From f83f362d40ccceb647f7d80eb92206733d76a36b Mon Sep 17 00:00:00 2001 From: Jinliang Zheng Date: Tue, 15 Apr 2025 17:02:32 +0800 Subject: [PATCH 001/285] mm: fix ratelimit_pages update error in dirty_ratio_handler() In dirty_ratio_handler(), vm_dirty_bytes must be set to zero before calling writeback_set_ratelimit(), as global_dirty_limits() always prioritizes the value of vm_dirty_bytes. It's domain_dirty_limits() that's relevant here, not node_dirty_ok: dirty_ratio_handler writeback_set_ratelimit global_dirty_limits(&dirty_thresh) <- ratelimit_pages based on dirty_thresh domain_dirty_limits if (bytes) <- bytes = vm_dirty_bytes <--------+ thresh = f1(bytes) <- prioritizes vm_dirty_bytes | else | thresh = f2(ratio) | ratelimit_pages = f3(dirty_thresh) | vm_dirty_bytes = 0 <- it's late! ---------------------+ This causes ratelimit_pages to still use the value calculated based on vm_dirty_bytes, which is wrong now. The impact visible to userspace is difficult to capture directly because there is no procfs/sysfs interface exported to user space. However, it will have a real impact on the balance of dirty pages. For example: 1. On default, we have vm_dirty_ratio=40, vm_dirty_bytes=0 2. echo 8192 > dirty_bytes, then vm_dirty_bytes=8192, vm_dirty_ratio=0, and ratelimit_pages is calculated based on vm_dirty_bytes now. 3. echo 20 > dirty_ratio, then since vm_dirty_bytes is not reset to zero when writeback_set_ratelimit() -> global_dirty_limits() -> domain_dirty_limits() is called, reallimit_pages is still calculated based on vm_dirty_bytes instead of vm_dirty_ratio. This does not conform to the actual intent of the user. Link: https://lkml.kernel.org/r/20250415090232.7544-1-alexjlzheng@tencent.com Fixes: 9d823e8f6b1b ("writeback: per task dirty rate limit") Signed-off-by: Jinliang Zheng Reviewed-by: MengEn Sun Cc: Andrea Righi Cc: Fenggaung Wu Cc: Jinliang Zheng Cc: Matthew Wilcox (Oracle) Cc: Signed-off-by: Andrew Morton --- mm/page-writeback.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index c81624bc3969..20e1d76f1eba 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -520,8 +520,8 @@ static int dirty_ratio_handler(const struct ctl_table *table, int write, void *b ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos); if (ret == 0 && write && vm_dirty_ratio != old_ratio) { - writeback_set_ratelimit(); vm_dirty_bytes = 0; + writeback_set_ratelimit(); } return ret; } From 4e92030c05dc351e62955ac1aba7233157f49b78 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 2 Apr 2025 19:16:55 +0100 Subject: [PATCH 002/285] mm: set the pte dirty if the folio is already dirty Patch series "Add folio_mk_pte()", v2. Today if you have a folio and want to create a PTE that points to the first page in it, you have to convert from a folio to a page. That's zero-cost today but will be more expensive in the future. I didn't want to add folio_mk_pte() to each architecture, and I didn't want to lose any optimisations that architectures have from their own implementation of mk_pte(). Fortunately, most architectures have by now turned their mk_pte() into a fairly bland variant of pfn_pte(), but s390 has a special optimisation that needs to be moved into generic code in the first patch. At the end of this patch set, we have mk_pte() and folio_mk_pte() in mm.h and each architecture only has to implement pfn_pte(). We've also eliminated mk_huge_pte(), mk_huge_pmd() and mk_pmd(). This patch (of 11): If the first access to a folio is a read that is then followed by a write, we can save a page fault. s390 implemented this in their mk_pte() in commit abf09bed3cce ("s390/mm: implement software dirty bits"), but other architectures can also benefit from this. Link: https://lkml.kernel.org/r/20250402181709.2386022-1-willy@infradead.org Link: https://lkml.kernel.org/r/20250402181709.2386022-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Acked-by: David Hildenbrand Reviewed-by: Alexander Gordeev # for s390 Cc: Zi Yan Cc: Andreas Larsson Cc: Anton Ivanov Cc: Dave Hansen Cc: "David S. Miller" Cc: Geert Uytterhoeven Cc: Johannes Berg Cc: Muchun Song Cc: Richard Weinberger Cc: Signed-off-by: Andrew Morton --- arch/s390/include/asm/pgtable.h | 7 +------ mm/memory.c | 2 ++ 2 files changed, 3 insertions(+), 6 deletions(-) diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h index f8a6b54986ec..49833002232b 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -1450,12 +1450,7 @@ static inline pte_t mk_pte_phys(unsigned long physpage, pgprot_t pgprot) static inline pte_t mk_pte(struct page *page, pgprot_t pgprot) { - unsigned long physpage = page_to_phys(page); - pte_t __pte = mk_pte_phys(physpage, pgprot); - - if (pte_write(__pte) && PageDirty(page)) - __pte = pte_mkdirty(__pte); - return __pte; + return mk_pte_phys(page_to_phys(page), pgprot); } #define pgd_index(address) (((address) >> PGDIR_SHIFT) & (PTRS_PER_PGD-1)) diff --git a/mm/memory.c b/mm/memory.c index 49199410805c..da4778fb3a38 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5245,6 +5245,8 @@ void set_pte_range(struct vm_fault *vmf, struct folio *folio, if (write) entry = maybe_mkwrite(pte_mkdirty(entry), vma); + else if (pte_write(entry) && folio_test_dirty(folio)) + entry = pte_mkdirty(entry); if (unlikely(vmf_orig_pte_uffd_wp(vmf))) entry = pte_mkuffd_wp(entry); /* copy-on-write page */ From cb5b13cd6c9237fe5ac978b22453eb3fa098a8d6 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 2 Apr 2025 19:16:56 +0100 Subject: [PATCH 003/285] mm: introduce a common definition of mk_pte() Most architectures simply call pfn_pte(). Centralise that as the normal definition and remove the definition of mk_pte() from the architectures which have either that exact definition or something similar. Link: https://lkml.kernel.org/r/20250402181709.2386022-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Acked-by: Geert Uytterhoeven # m68k Acked-by: David Hildenbrand Reviewed-by: Alexander Gordeev # s390 Cc: Zi Yan Cc: Andreas Larsson Cc: Anton Ivanov Cc: Dave Hansen Cc: "David S. Miller" Cc: Johannes Berg Cc: Muchun Song Cc: Richard Weinberger Cc: Signed-off-by: Andrew Morton --- arch/alpha/include/asm/pgtable.h | 7 ------- arch/arc/include/asm/pgtable-levels.h | 1 - arch/arm/include/asm/pgtable.h | 1 - arch/arm64/include/asm/pgtable.h | 6 ------ arch/csky/include/asm/pgtable.h | 5 ----- arch/hexagon/include/asm/pgtable.h | 3 --- arch/loongarch/include/asm/pgtable.h | 6 ------ arch/m68k/include/asm/mcf_pgtable.h | 6 ------ arch/m68k/include/asm/motorola_pgtable.h | 6 ------ arch/m68k/include/asm/sun3_pgtable.h | 6 ------ arch/microblaze/include/asm/pgtable.h | 8 -------- arch/mips/include/asm/pgtable.h | 6 ------ arch/nios2/include/asm/pgtable.h | 6 ------ arch/openrisc/include/asm/pgtable.h | 2 -- arch/parisc/include/asm/pgtable.h | 6 ------ arch/powerpc/include/asm/pgtable.h | 3 +-- arch/riscv/include/asm/pgtable.h | 2 -- arch/s390/include/asm/pgtable.h | 5 ----- arch/sh/include/asm/pgtable_32.h | 8 -------- arch/sparc/include/asm/pgtable_64.h | 1 - arch/xtensa/include/asm/pgtable.h | 6 ------ include/linux/mm.h | 9 +++++++++ 22 files changed, 10 insertions(+), 99 deletions(-) diff --git a/arch/alpha/include/asm/pgtable.h b/arch/alpha/include/asm/pgtable.h index 02e8817a8921..2676017f42f1 100644 --- a/arch/alpha/include/asm/pgtable.h +++ b/arch/alpha/include/asm/pgtable.h @@ -192,13 +192,6 @@ extern unsigned long __zero_page(void); #define pte_pfn(pte) (pte_val(pte) >> PFN_PTE_SHIFT) #define pte_page(pte) pfn_to_page(pte_pfn(pte)) -#define mk_pte(page, pgprot) \ -({ \ - pte_t pte; \ - \ - pte_val(pte) = (page_to_pfn(page) << 32) | pgprot_val(pgprot); \ - pte; \ -}) extern inline pte_t pfn_pte(unsigned long physpfn, pgprot_t pgprot) { pte_t pte; pte_val(pte) = (PHYS_TWIDDLE(physpfn) << 32) | pgprot_val(pgprot); return pte; } diff --git a/arch/arc/include/asm/pgtable-levels.h b/arch/arc/include/asm/pgtable-levels.h index 86e148226463..55dbd2719e35 100644 --- a/arch/arc/include/asm/pgtable-levels.h +++ b/arch/arc/include/asm/pgtable-levels.h @@ -177,7 +177,6 @@ #define set_pte(ptep, pte) ((*(ptep)) = (pte)) #define pte_pfn(pte) (pte_val(pte) >> PAGE_SHIFT) #define pfn_pte(pfn, prot) __pte(__pfn_to_phys(pfn) | pgprot_val(prot)) -#define mk_pte(page, prot) pfn_pte(page_to_pfn(page), prot) #ifdef CONFIG_ISA_ARCV2 #define pmd_leaf(x) (pmd_val(x) & _PAGE_HW_SZ) diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h index 6b986ef6042f..7f1c3b4e3e04 100644 --- a/arch/arm/include/asm/pgtable.h +++ b/arch/arm/include/asm/pgtable.h @@ -168,7 +168,6 @@ static inline pte_t *pmd_page_vaddr(pmd_t pmd) #define pfn_pte(pfn,prot) __pte(__pfn_to_phys(pfn) | pgprot_val(prot)) #define pte_page(pte) pfn_to_page(pte_pfn(pte)) -#define mk_pte(page,prot) pfn_pte(page_to_pfn(page), prot) #define pte_clear(mm,addr,ptep) set_pte_ext(ptep, __pte(0), 0) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index d3b538be1500..f03c6c2b0944 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -813,12 +813,6 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd) /* use ONLY for statically allocated translation tables */ #define pte_offset_kimg(dir,addr) ((pte_t *)__phys_to_kimg(pte_offset_phys((dir), (addr)))) -/* - * Conversion functions: convert a page and protection to a page entry, - * and a page entry and page directory to the page they refer to. - */ -#define mk_pte(page,prot) pfn_pte(page_to_pfn(page),prot) - #if CONFIG_PGTABLE_LEVELS > 2 #define pmd_ERROR(e) \ diff --git a/arch/csky/include/asm/pgtable.h b/arch/csky/include/asm/pgtable.h index a397e1718ab6..b8378431aeff 100644 --- a/arch/csky/include/asm/pgtable.h +++ b/arch/csky/include/asm/pgtable.h @@ -249,11 +249,6 @@ static inline pgprot_t pgprot_writecombine(pgprot_t _prot) return __pgprot(prot); } -/* - * Conversion functions: convert a page and protection to a page entry, - * and a page entry and page directory to the page they refer to. - */ -#define mk_pte(page, pgprot) pfn_pte(page_to_pfn(page), (pgprot)) static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) { return __pte((pte_val(pte) & _PAGE_CHG_MASK) | diff --git a/arch/hexagon/include/asm/pgtable.h b/arch/hexagon/include/asm/pgtable.h index 8c5b7a1c3d90..9fbdfdbc539f 100644 --- a/arch/hexagon/include/asm/pgtable.h +++ b/arch/hexagon/include/asm/pgtable.h @@ -238,9 +238,6 @@ static inline int pte_present(pte_t pte) return pte_val(pte) & _PAGE_PRESENT; } -/* mk_pte - make a PTE out of a page pointer and protection bits */ -#define mk_pte(page, pgprot) pfn_pte(page_to_pfn(page), (pgprot)) - /* pte_page - returns a page (frame pointer/descriptor?) based on a PTE */ #define pte_page(x) pfn_to_page(pte_pfn(x)) diff --git a/arch/loongarch/include/asm/pgtable.h b/arch/loongarch/include/asm/pgtable.h index da346733a1da..9ba3a4ebcd98 100644 --- a/arch/loongarch/include/asm/pgtable.h +++ b/arch/loongarch/include/asm/pgtable.h @@ -426,12 +426,6 @@ static inline unsigned long pte_accessible(struct mm_struct *mm, pte_t a) return false; } -/* - * Conversion functions: convert a page and protection to a page entry, - * and a page entry and page directory to the page they refer to. - */ -#define mk_pte(page, pgprot) pfn_pte(page_to_pfn(page), (pgprot)) - static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) { return __pte((pte_val(pte) & _PAGE_CHG_MASK) | diff --git a/arch/m68k/include/asm/mcf_pgtable.h b/arch/m68k/include/asm/mcf_pgtable.h index 48f87a8a8832..f5c596b211d4 100644 --- a/arch/m68k/include/asm/mcf_pgtable.h +++ b/arch/m68k/include/asm/mcf_pgtable.h @@ -96,12 +96,6 @@ #define pmd_pgtable(pmd) pfn_to_virt(pmd_val(pmd) >> PAGE_SHIFT) -/* - * Conversion functions: convert a page and protection to a page entry, - * and a page entry and page directory to the page they refer to. - */ -#define mk_pte(page, pgprot) pfn_pte(page_to_pfn(page), (pgprot)) - static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) { pte_val(pte) = (pte_val(pte) & CF_PAGE_CHG_MASK) | pgprot_val(newprot); diff --git a/arch/m68k/include/asm/motorola_pgtable.h b/arch/m68k/include/asm/motorola_pgtable.h index 9866c7acdabe..040ac3bad713 100644 --- a/arch/m68k/include/asm/motorola_pgtable.h +++ b/arch/m68k/include/asm/motorola_pgtable.h @@ -81,12 +81,6 @@ extern unsigned long mm_cachebits; #define pmd_pgtable(pmd) ((pgtable_t)pmd_page_vaddr(pmd)) -/* - * Conversion functions: convert a page and protection to a page entry, - * and a page entry and page directory to the page they refer to. - */ -#define mk_pte(page, pgprot) pfn_pte(page_to_pfn(page), (pgprot)) - static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) { pte_val(pte) = (pte_val(pte) & _PAGE_CHG_MASK) | pgprot_val(newprot); diff --git a/arch/m68k/include/asm/sun3_pgtable.h b/arch/m68k/include/asm/sun3_pgtable.h index 30081aee8164..73745dc0ec0e 100644 --- a/arch/m68k/include/asm/sun3_pgtable.h +++ b/arch/m68k/include/asm/sun3_pgtable.h @@ -76,12 +76,6 @@ #ifndef __ASSEMBLY__ -/* - * Conversion functions: convert a page and protection to a page entry, - * and a page entry and page directory to the page they refer to. - */ -#define mk_pte(page, pgprot) pfn_pte(page_to_pfn(page), (pgprot)) - static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) { pte_val(pte) = (pte_val(pte) & SUN3_PAGE_CHG_MASK) | pgprot_val(newprot); diff --git a/arch/microblaze/include/asm/pgtable.h b/arch/microblaze/include/asm/pgtable.h index e4ea2ec3642f..b1bb2c65dd04 100644 --- a/arch/microblaze/include/asm/pgtable.h +++ b/arch/microblaze/include/asm/pgtable.h @@ -285,14 +285,6 @@ static inline pte_t mk_pte_phys(phys_addr_t physpage, pgprot_t pgprot) return pte; } -#define mk_pte(page, pgprot) \ -({ \ - pte_t pte; \ - pte_val(pte) = (((page - mem_map) << PAGE_SHIFT) + memory_start) | \ - pgprot_val(pgprot); \ - pte; \ -}) - static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) { pte_val(pte) = (pte_val(pte) & _PAGE_CHG_MASK) | pgprot_val(newprot); diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h index c29a551eb0ca..d69cfa5a8ac6 100644 --- a/arch/mips/include/asm/pgtable.h +++ b/arch/mips/include/asm/pgtable.h @@ -504,12 +504,6 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma, return true; } -/* - * Conversion functions: convert a page and protection to a page entry, - * and a page entry and page directory to the page they refer to. - */ -#define mk_pte(page, pgprot) pfn_pte(page_to_pfn(page), (pgprot)) - #if defined(CONFIG_XPA) static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) { diff --git a/arch/nios2/include/asm/pgtable.h b/arch/nios2/include/asm/pgtable.h index eab87c6beacb..f490f2fa0dca 100644 --- a/arch/nios2/include/asm/pgtable.h +++ b/arch/nios2/include/asm/pgtable.h @@ -217,12 +217,6 @@ static inline void pte_clear(struct mm_struct *mm, set_pte(ptep, null); } -/* - * Conversion functions: convert a page and protection to a page entry, - * and a page entry and page directory to the page they refer to. - */ -#define mk_pte(page, prot) (pfn_pte(page_to_pfn(page), prot)) - /* * Conversion functions: convert a page and protection to a page entry, * and a page entry and page directory to the page they refer to. diff --git a/arch/openrisc/include/asm/pgtable.h b/arch/openrisc/include/asm/pgtable.h index 60c6ce7ff2dc..71bfb8c8c482 100644 --- a/arch/openrisc/include/asm/pgtable.h +++ b/arch/openrisc/include/asm/pgtable.h @@ -299,8 +299,6 @@ static inline pte_t __mk_pte(void *page, pgprot_t pgprot) return pte; } -#define mk_pte(page, pgprot) __mk_pte(page_address(page), (pgprot)) - #define mk_pte_phys(physpage, pgprot) \ ({ \ pte_t __pte; \ diff --git a/arch/parisc/include/asm/pgtable.h b/arch/parisc/include/asm/pgtable.h index babf65751e81..e85963fefa63 100644 --- a/arch/parisc/include/asm/pgtable.h +++ b/arch/parisc/include/asm/pgtable.h @@ -338,10 +338,6 @@ static inline pte_t pte_mkspecial(pte_t pte) { pte_val(pte) |= _PAGE_SPECIAL; re #endif -/* - * Conversion functions: convert a page and protection to a page entry, - * and a page entry and page directory to the page they refer to. - */ #define __mk_pte(addr,pgprot) \ ({ \ pte_t __pte; \ @@ -351,8 +347,6 @@ static inline pte_t pte_mkspecial(pte_t pte) { pte_val(pte) |= _PAGE_SPECIAL; re __pte; \ }) -#define mk_pte(page, pgprot) pfn_pte(page_to_pfn(page), (pgprot)) - static inline pte_t pfn_pte(unsigned long pfn, pgprot_t pgprot) { pte_t pte; diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h index 2f72ad885332..93d77ad5a92f 100644 --- a/arch/powerpc/include/asm/pgtable.h +++ b/arch/powerpc/include/asm/pgtable.h @@ -53,9 +53,8 @@ void set_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep, #define MAX_PTRS_PER_PGD PTRS_PER_PGD #endif -/* Keep these as a macros to avoid include dependency mess */ +/* Keep this as a macro to avoid include dependency mess */ #define pte_page(x) pfn_to_page(pte_pfn(x)) -#define mk_pte(page, pgprot) pfn_pte(page_to_pfn(page), (pgprot)) static inline unsigned long pte_pfn(pte_t pte) { diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h index 428e48e5f57d..f19240fd018e 100644 --- a/arch/riscv/include/asm/pgtable.h +++ b/arch/riscv/include/asm/pgtable.h @@ -343,8 +343,6 @@ static inline pte_t pfn_pte(unsigned long pfn, pgprot_t prot) return __pte((pfn << _PAGE_PFN_SHIFT) | prot_val); } -#define mk_pte(page, prot) pfn_pte(page_to_pfn(page), prot) - #define pte_pgprot pte_pgprot static inline pgprot_t pte_pgprot(pte_t pte) { diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h index 49833002232b..3ef5d2198480 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -1448,11 +1448,6 @@ static inline pte_t mk_pte_phys(unsigned long physpage, pgprot_t pgprot) return pte_mkyoung(__pte); } -static inline pte_t mk_pte(struct page *page, pgprot_t pgprot) -{ - return mk_pte_phys(page_to_phys(page), pgprot); -} - #define pgd_index(address) (((address) >> PGDIR_SHIFT) & (PTRS_PER_PGD-1)) #define p4d_index(address) (((address) >> P4D_SHIFT) & (PTRS_PER_P4D-1)) #define pud_index(address) (((address) >> PUD_SHIFT) & (PTRS_PER_PUD-1)) diff --git a/arch/sh/include/asm/pgtable_32.h b/arch/sh/include/asm/pgtable_32.h index f939f1215232..db2e48366e0d 100644 --- a/arch/sh/include/asm/pgtable_32.h +++ b/arch/sh/include/asm/pgtable_32.h @@ -380,14 +380,6 @@ PTE_BIT_FUNC(low, mkspecial, |= _PAGE_SPECIAL); #define pgprot_noncached pgprot_writecombine -/* - * Conversion functions: convert a page and protection to a page entry, - * and a page entry and page directory to the page they refer to. - * - * extern pte_t mk_pte(struct page *page, pgprot_t pgprot) - */ -#define mk_pte(page, pgprot) pfn_pte(page_to_pfn(page), (pgprot)) - static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) { pte.pte_low &= _PAGE_CHG_MASK; diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index dc28f2c4eee3..d9c903576084 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -225,7 +225,6 @@ static inline pte_t pfn_pte(unsigned long pfn, pgprot_t prot) BUILD_BUG_ON(_PAGE_SZBITS_4U != 0UL || _PAGE_SZBITS_4V != 0UL); return __pte(paddr | pgprot_val(prot)); } -#define mk_pte(page, pgprot) pfn_pte(page_to_pfn(page), (pgprot)) #ifdef CONFIG_TRANSPARENT_HUGEPAGE static inline pmd_t pfn_pmd(unsigned long page_nr, pgprot_t pgprot) diff --git a/arch/xtensa/include/asm/pgtable.h b/arch/xtensa/include/asm/pgtable.h index 1647a7cc3fbf..cb1725c40e36 100644 --- a/arch/xtensa/include/asm/pgtable.h +++ b/arch/xtensa/include/asm/pgtable.h @@ -269,17 +269,11 @@ static inline pte_t pte_mkwrite_novma(pte_t pte) ((__pgprot((pgprot_val(prot) & ~_PAGE_CA_MASK) | \ _PAGE_CA_BYPASS))) -/* - * Conversion functions: convert a page and protection to a page entry, - * and a page entry and page directory to the page they refer to. - */ - #define PFN_PTE_SHIFT PAGE_SHIFT #define pte_pfn(pte) (pte_val(pte) >> PAGE_SHIFT) #define pte_same(a,b) (pte_val(a) == pte_val(b)) #define pte_page(x) pfn_to_page(pte_pfn(x)) #define pfn_pte(pfn, prot) __pte(((pfn) << PAGE_SHIFT) | pgprot_val(prot)) -#define mk_pte(page, prot) pfn_pte(page_to_pfn(page), prot) static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) { diff --git a/include/linux/mm.h b/include/linux/mm.h index bf55206935c4..aa944eaad0ec 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2004,6 +2004,15 @@ static inline struct folio *pfn_folio(unsigned long pfn) return page_folio(pfn_to_page(pfn)); } +#ifndef mk_pte +#ifdef CONFIG_MMU +static inline pte_t mk_pte(struct page *page, pgprot_t pgprot) +{ + return pfn_pte(page_to_pfn(page), pgprot); +} +#endif +#endif + static inline bool folio_has_pincount(const struct folio *folio) { if (IS_ENABLED(CONFIG_64BIT)) From aec4417168594f89ea0854c7b06dc70cbeaa219c Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 2 Apr 2025 19:16:57 +0100 Subject: [PATCH 004/285] sparc32: remove custom definition of mk_pte() Instead of defining pfn_pte() in terms of mk_pte(), make pfn_pte() the base implementation. That lets us use the generic definition of mk_pte(). Link: https://lkml.kernel.org/r/20250402181709.2386022-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Cc: Zi Yan Cc: "David S. Miller" Cc: Andreas Larsson Cc: Alexander Gordeev Cc: Anton Ivanov Cc: Dave Hansen Cc: David Hildenbrand Cc: Geert Uytterhoeven Cc: Johannes Berg Cc: Muchun Song Cc: Richard Weinberger Cc: Signed-off-by: Andrew Morton --- arch/sparc/include/asm/pgtable_32.h | 15 +++++---------- 1 file changed, 5 insertions(+), 10 deletions(-) diff --git a/arch/sparc/include/asm/pgtable_32.h b/arch/sparc/include/asm/pgtable_32.h index 62bcafe38b1f..1454ebe91539 100644 --- a/arch/sparc/include/asm/pgtable_32.h +++ b/arch/sparc/include/asm/pgtable_32.h @@ -255,7 +255,11 @@ static inline pte_t pte_mkyoung(pte_t pte) } #define PFN_PTE_SHIFT (PAGE_SHIFT - 4) -#define pfn_pte(pfn, prot) mk_pte(pfn_to_page(pfn), prot) + +static inline pte_t pfn_pte(unsigned long pfn, pgprot_t pgprot) +{ + return __pte((pfn << PFN_PTE_SHIFT) | pgprot_val(pgprot)); +} static inline unsigned long pte_pfn(pte_t pte) { @@ -272,15 +276,6 @@ static inline unsigned long pte_pfn(pte_t pte) #define pte_page(pte) pfn_to_page(pte_pfn(pte)) -/* - * Conversion functions: convert a page and protection to a page entry, - * and a page entry and page directory to the page they refer to. - */ -static inline pte_t mk_pte(struct page *page, pgprot_t pgprot) -{ - return __pte((page_to_pfn(page) << (PAGE_SHIFT-4)) | pgprot_val(pgprot)); -} - static inline pte_t mk_pte_phys(unsigned long page, pgprot_t pgprot) { return __pte(((page) >> 4) | pgprot_val(pgprot)); From a03079e4eeb1031dd669394b34e341cd3ff0e24b Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 2 Apr 2025 19:16:58 +0100 Subject: [PATCH 005/285] x86: remove custom definition of mk_pte() Move the shadow stack check to pfn_pte() which lets us use the common definition of mk_pte(). Link: https://lkml.kernel.org/r/20250402181709.2386022-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Acked-by: Dave Hansen Cc: Zi Yan Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Anton Ivanov Cc: David Hildenbrand Cc: "David S. Miller" Cc: Geert Uytterhoeven Cc: Johannes Berg Cc: Muchun Song Cc: Richard Weinberger Cc: Signed-off-by: Andrew Morton --- arch/x86/include/asm/pgtable.h | 19 +++---------------- 1 file changed, 3 insertions(+), 16 deletions(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 7bd6bd6df4a1..2ce98b547a25 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -784,6 +784,9 @@ static inline pgprotval_t check_pgprot(pgprot_t pgprot) static inline pte_t pfn_pte(unsigned long page_nr, pgprot_t pgprot) { phys_addr_t pfn = (phys_addr_t)page_nr << PAGE_SHIFT; + /* This bit combination is used to mark shadow stacks */ + WARN_ON_ONCE((pgprot_val(pgprot) & (_PAGE_DIRTY | _PAGE_RW)) == + _PAGE_DIRTY); pfn ^= protnone_mask(pgprot_val(pgprot)); pfn &= PTE_PFN_MASK; return __pte(pfn | check_pgprot(pgprot)); @@ -1080,22 +1083,6 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd) */ #define pmd_page(pmd) pfn_to_page(pmd_pfn(pmd)) -/* - * Conversion functions: convert a page and protection to a page entry, - * and a page entry and page directory to the page they refer to. - * - * (Currently stuck as a macro because of indirect forward reference - * to linux/mm.h:page_to_nid()) - */ -#define mk_pte(page, pgprot) \ -({ \ - pgprot_t __pgprot = pgprot; \ - \ - WARN_ON_ONCE((pgprot_val(__pgprot) & (_PAGE_DIRTY | _PAGE_RW)) == \ - _PAGE_DIRTY); \ - pfn_pte(page_to_pfn(page), __pgprot); \ -}) - static inline int pmd_bad(pmd_t pmd) { return (pmd_flags(pmd) & ~(_PAGE_USER | _PAGE_ACCESSED)) != From 669eec68f680be2b530e2ea96cb2f519cb0b3336 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 2 Apr 2025 19:16:59 +0100 Subject: [PATCH 006/285] um: remove custom definition of mk_pte() Move the pfn_pte() definitions from the 2level and 4level files to the generic pgtable.h and delete the custom definition of mk_pte() so that we use the central definition. Link: https://lkml.kernel.org/r/20250402181709.2386022-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Cc: Zi Yan Cc: Richard Weinberger Cc: Anton Ivanov Cc: Johannes Berg Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Dave Hansen Cc: David Hildenbrand Cc: "David S. Miller" Cc: Geert Uytterhoeven Cc: Muchun Song Cc: Signed-off-by: Andrew Morton --- arch/um/include/asm/pgtable-2level.h | 1 - arch/um/include/asm/pgtable-4level.h | 9 --------- arch/um/include/asm/pgtable.h | 18 ++++++++---------- 3 files changed, 8 insertions(+), 20 deletions(-) diff --git a/arch/um/include/asm/pgtable-2level.h b/arch/um/include/asm/pgtable-2level.h index ab0c8dd86564..14ec16f92ce4 100644 --- a/arch/um/include/asm/pgtable-2level.h +++ b/arch/um/include/asm/pgtable-2level.h @@ -37,7 +37,6 @@ static inline void pgd_mkuptodate(pgd_t pgd) { } #define set_pmd(pmdptr, pmdval) (*(pmdptr) = (pmdval)) #define pte_pfn(x) phys_to_pfn(pte_val(x)) -#define pfn_pte(pfn, prot) __pte(pfn_to_phys(pfn) | pgprot_val(prot)) #define pfn_pmd(pfn, prot) __pmd(pfn_to_phys(pfn) | pgprot_val(prot)) #endif diff --git a/arch/um/include/asm/pgtable-4level.h b/arch/um/include/asm/pgtable-4level.h index 0d279caee93c..7a271b7b83d2 100644 --- a/arch/um/include/asm/pgtable-4level.h +++ b/arch/um/include/asm/pgtable-4level.h @@ -102,15 +102,6 @@ static inline unsigned long pte_pfn(pte_t pte) return phys_to_pfn(pte_val(pte)); } -static inline pte_t pfn_pte(unsigned long page_nr, pgprot_t pgprot) -{ - pte_t pte; - phys_t phys = pfn_to_phys(page_nr); - - pte_set_val(pte, phys, pgprot); - return pte; -} - static inline pmd_t pfn_pmd(unsigned long page_nr, pgprot_t pgprot) { return __pmd((page_nr << PAGE_SHIFT) | pgprot_val(pgprot)); diff --git a/arch/um/include/asm/pgtable.h b/arch/um/include/asm/pgtable.h index 5601ca98e8a6..ca2a519d53ab 100644 --- a/arch/um/include/asm/pgtable.h +++ b/arch/um/include/asm/pgtable.h @@ -260,19 +260,17 @@ static inline int pte_same(pte_t pte_a, pte_t pte_b) return !((pte_val(pte_a) ^ pte_val(pte_b)) & ~_PAGE_NEEDSYNC); } -/* - * Conversion functions: convert a page and protection to a page entry, - * and a page entry and page directory to the page they refer to. - */ - #define __virt_to_page(virt) phys_to_page(__pa(virt)) #define virt_to_page(addr) __virt_to_page((const unsigned long) addr) -#define mk_pte(page, pgprot) \ - ({ pte_t pte; \ - \ - pte_set_val(pte, page_to_phys(page), (pgprot)); \ - pte;}) +static inline pte_t pfn_pte(unsigned long pfn, pgprot_t pgprot) +{ + pte_t pte; + + pte_set_val(pte, pfn_to_phys(pfn), pgprot); + + return pte; +} static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) { From 4ec492a628d897806bb6dc13b1c257c4e06eb1cf Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 2 Apr 2025 19:17:00 +0100 Subject: [PATCH 007/285] mm: make mk_pte() definition unconditional All architectures now use the common mk_pte() definition, so we can remove the condition. Link: https://lkml.kernel.org/r/20250402181709.2386022-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Acked-by: David Hildenbrand Cc: Zi Yan Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Anton Ivanov Cc: Dave Hansen Cc: "David S. Miller" Cc: Geert Uytterhoeven Cc: Johannes Berg Cc: Muchun Song Cc: Richard Weinberger Cc: Signed-off-by: Andrew Morton --- include/linux/mm.h | 2 -- 1 file changed, 2 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index aa944eaad0ec..3a55903d68e2 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2004,14 +2004,12 @@ static inline struct folio *pfn_folio(unsigned long pfn) return page_folio(pfn_to_page(pfn)); } -#ifndef mk_pte #ifdef CONFIG_MMU static inline pte_t mk_pte(struct page *page, pgprot_t pgprot) { return pfn_pte(page_to_pfn(page), pgprot); } #endif -#endif static inline bool folio_has_pincount(const struct folio *folio) { From deb8d4d28e4d05c4ecfc6e242c0a53d49e119224 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 2 Apr 2025 19:17:01 +0100 Subject: [PATCH 008/285] mm: add folio_mk_pte() Remove a cast from folio to page in four callers of mk_pte(). Link: https://lkml.kernel.org/r/20250402181709.2386022-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Acked-by: David Hildenbrand Cc: Zi Yan Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Anton Ivanov Cc: Dave Hansen Cc: "David S. Miller" Cc: Geert Uytterhoeven Cc: Johannes Berg Cc: Muchun Song Cc: Richard Weinberger Cc: Signed-off-by: Andrew Morton --- include/linux/mm.h | 15 +++++++++++++++ mm/memory.c | 6 +++--- mm/userfaultfd.c | 2 +- 3 files changed, 19 insertions(+), 4 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 3a55903d68e2..cbad8c663c4d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2009,6 +2009,21 @@ static inline pte_t mk_pte(struct page *page, pgprot_t pgprot) { return pfn_pte(page_to_pfn(page), pgprot); } + +/** + * folio_mk_pte - Create a PTE for this folio + * @folio: The folio to create a PTE for + * @pgprot: The page protection bits to use + * + * Create a page table entry for the first page of this folio. + * This is suitable for passing to set_ptes(). + * + * Return: A page table entry suitable for mapping this folio. + */ +static inline pte_t folio_mk_pte(struct folio *folio, pgprot_t pgprot) +{ + return pfn_pte(folio_pfn(folio), pgprot); +} #endif static inline bool folio_has_pincount(const struct folio *folio) diff --git a/mm/memory.c b/mm/memory.c index da4778fb3a38..a9e631927478 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -929,7 +929,7 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma rss[MM_ANONPAGES]++; /* All done, just insert the new page copy in the child */ - pte = mk_pte(&new_folio->page, dst_vma->vm_page_prot); + pte = folio_mk_pte(new_folio, dst_vma->vm_page_prot); pte = maybe_mkwrite(pte_mkdirty(pte), dst_vma); if (userfaultfd_pte_wp(dst_vma, ptep_get(src_pte))) /* Uffd-wp needs to be delivered to dest pte as well */ @@ -3523,7 +3523,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) inc_mm_counter(mm, MM_ANONPAGES); } flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte)); - entry = mk_pte(&new_folio->page, vma->vm_page_prot); + entry = folio_mk_pte(new_folio, vma->vm_page_prot); entry = pte_sw_mkyoung(entry); if (unlikely(unshare)) { if (pte_soft_dirty(vmf->orig_pte)) @@ -5013,7 +5013,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) */ __folio_mark_uptodate(folio); - entry = mk_pte(&folio->page, vma->vm_page_prot); + entry = folio_mk_pte(folio, vma->vm_page_prot); entry = pte_sw_mkyoung(entry); if (vma->vm_flags & VM_WRITE) entry = pte_mkwrite(pte_mkdirty(entry), vma); diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index e0db855c89b4..bc473ad21202 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -1063,7 +1063,7 @@ static int move_present_pte(struct mm_struct *mm, folio_move_anon_rmap(src_folio, dst_vma); src_folio->index = linear_page_index(dst_vma, dst_addr); - orig_dst_pte = mk_pte(&src_folio->page, dst_vma->vm_page_prot); + orig_dst_pte = folio_mk_pte(src_folio, dst_vma->vm_page_prot); /* Set soft dirty bit so userspace can notice the pte was moved */ #ifdef CONFIG_MEM_SOFT_DIRTY orig_dst_pte = pte_mksoft_dirty(orig_dst_pte); From e06fa168c342c8b937cd72c89582e6af008b204b Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 2 Apr 2025 19:17:02 +0100 Subject: [PATCH 009/285] hugetlb: simplify make_huge_pte() mk_huge_pte() is a bad API. Despite its name, it creates a normal PTE which is later transformed into a huge PTE by arch_make_huge_pte(). So replace the page argument with a folio argument and call folio_mk_pte() instead. Then, because we now know this is a regular PTE rather than a huge one, use pte_mkdirty() instead of huge_pte_mkdirty() (and similar functions). Link: https://lkml.kernel.org/r/20250402181709.2386022-9-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Cc: Zi Yan Cc: Muchun Song Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Anton Ivanov Cc: Dave Hansen Cc: David Hildenbrand Cc: "David S. Miller" Cc: Geert Uytterhoeven Cc: Johannes Berg Cc: Richard Weinberger Cc: Signed-off-by: Andrew Morton --- mm/hugetlb.c | 18 ++++++++---------- 1 file changed, 8 insertions(+), 10 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 7ae38bfb9096..a44d4b0d844c 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5472,18 +5472,16 @@ const struct vm_operations_struct hugetlb_vm_ops = { .pagesize = hugetlb_vm_op_pagesize, }; -static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page, +static pte_t make_huge_pte(struct vm_area_struct *vma, struct folio *folio, bool try_mkwrite) { - pte_t entry; + pte_t entry = folio_mk_pte(folio, vma->vm_page_prot); unsigned int shift = huge_page_shift(hstate_vma(vma)); if (try_mkwrite && (vma->vm_flags & VM_WRITE)) { - entry = huge_pte_mkwrite(huge_pte_mkdirty(mk_huge_pte(page, - vma->vm_page_prot))); + entry = pte_mkwrite_novma(pte_mkdirty(entry)); } else { - entry = huge_pte_wrprotect(mk_huge_pte(page, - vma->vm_page_prot)); + entry = pte_wrprotect(entry); } entry = pte_mkyoung(entry); entry = arch_make_huge_pte(entry, shift, vma->vm_flags); @@ -5538,7 +5536,7 @@ static void hugetlb_install_folio(struct vm_area_struct *vma, pte_t *ptep, unsigned long addr, struct folio *new_folio, pte_t old, unsigned long sz) { - pte_t newpte = make_huge_pte(vma, &new_folio->page, true); + pte_t newpte = make_huge_pte(vma, new_folio, true); __folio_mark_uptodate(new_folio); hugetlb_add_new_anon_rmap(new_folio, vma, addr); @@ -6288,7 +6286,7 @@ retry_avoidcopy: spin_lock(vmf->ptl); vmf->pte = hugetlb_walk(vma, vmf->address, huge_page_size(h)); if (likely(vmf->pte && pte_same(huge_ptep_get(mm, vmf->address, vmf->pte), pte))) { - pte_t newpte = make_huge_pte(vma, &new_folio->page, !unshare); + pte_t newpte = make_huge_pte(vma, new_folio, !unshare); /* Break COW or unshare */ huge_ptep_clear_flush(vma, vmf->address, vmf->pte); @@ -6568,7 +6566,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping, hugetlb_add_new_anon_rmap(folio, vma, vmf->address); else hugetlb_add_file_rmap(folio); - new_pte = make_huge_pte(vma, &folio->page, vma->vm_flags & VM_SHARED); + new_pte = make_huge_pte(vma, folio, vma->vm_flags & VM_SHARED); /* * If this pte was previously wr-protected, keep it wr-protected even * if populated. @@ -7053,7 +7051,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte, * For either: (1) CONTINUE on a non-shared VMA, or (2) UFFDIO_COPY * with wp flag set, don't set pte write bit. */ - _dst_pte = make_huge_pte(dst_vma, &folio->page, + _dst_pte = make_huge_pte(dst_vma, folio, !wp_enabled && !(is_continue && !vm_shared)); /* * Always mark UFFDIO_COPY page dirty; note that this may not be From 7b7aa8a4adb62e3c3312d1c6086891014addc567 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 2 Apr 2025 19:17:03 +0100 Subject: [PATCH 010/285] mm: remove mk_huge_pte() The only remaining user of mk_huge_pte() is the debug code, so remove the API and replace its use with pfn_pte() which lets us remove the conversion to a page first. We should always call arch_make_huge_pte() to turn this PTE into a huge PTE before operating on it with huge_pte_mkdirty() etc. Link: https://lkml.kernel.org/r/20250402181709.2386022-10-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Cc: Zi Yan Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Anton Ivanov Cc: Dave Hansen Cc: David Hildenbrand Cc: "David S. Miller" Cc: Geert Uytterhoeven Cc: Johannes Berg Cc: Muchun Song Cc: Richard Weinberger Cc: Signed-off-by: Andrew Morton --- include/asm-generic/hugetlb.h | 5 ----- mm/debug_vm_pgtable.c | 18 +++++------------- 2 files changed, 5 insertions(+), 18 deletions(-) diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h index 2afc95bf1655..3e0a8fe9b108 100644 --- a/include/asm-generic/hugetlb.h +++ b/include/asm-generic/hugetlb.h @@ -5,11 +5,6 @@ #include #include -static inline pte_t mk_huge_pte(struct page *page, pgprot_t pgprot) -{ - return mk_pte(page, pgprot); -} - static inline unsigned long huge_pte_write(pte_t pte) { return pte_write(pte); diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index bc748f700a9e..7731b238b534 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -910,26 +910,18 @@ static void __init swap_migration_tests(struct pgtable_debug_args *args) #ifdef CONFIG_HUGETLB_PAGE static void __init hugetlb_basic_tests(struct pgtable_debug_args *args) { - struct page *page; pte_t pte; pr_debug("Validating HugeTLB basic\n"); - /* - * Accessing the page associated with the pfn is safe here, - * as it was previously derived from a real kernel symbol. - */ - page = pfn_to_page(args->fixed_pmd_pfn); - pte = mk_huge_pte(page, args->page_prot); + pte = pfn_pte(args->fixed_pmd_pfn, args->page_prot); + pte = arch_make_huge_pte(pte, PMD_SHIFT, VM_ACCESS_FLAGS); +#ifdef CONFIG_ARCH_WANT_GENERAL_HUGETLB + WARN_ON(!pte_huge(pte)); +#endif WARN_ON(!huge_pte_dirty(huge_pte_mkdirty(pte))); WARN_ON(!huge_pte_write(huge_pte_mkwrite(huge_pte_wrprotect(pte)))); WARN_ON(huge_pte_write(huge_pte_wrprotect(huge_pte_mkwrite(pte)))); - -#ifdef CONFIG_ARCH_WANT_GENERAL_HUGETLB - pte = pfn_pte(args->fixed_pmd_pfn, args->page_prot); - - WARN_ON(!pte_huge(arch_make_huge_pte(pte, PMD_SHIFT, VM_ACCESS_FLAGS))); -#endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */ } #else /* !CONFIG_HUGETLB_PAGE */ static void __init hugetlb_basic_tests(struct pgtable_debug_args *args) { } From e3981db444a0a18d350d9f92e3f2e8d489b54211 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 2 Apr 2025 19:17:04 +0100 Subject: [PATCH 011/285] mm: add folio_mk_pmd() Removes five conversions from folio to page. Also removes both callers of mk_pmd() that aren't part of mk_huge_pmd(), getting us a step closer to removing the confusion between mk_pmd(), mk_huge_pmd() and pmd_mkhuge(). Link: https://lkml.kernel.org/r/20250402181709.2386022-11-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: Zi Yan Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Anton Ivanov Cc: Dave Hansen Cc: David Hildenbrand Cc: "David S. Miller" Cc: Geert Uytterhoeven Cc: Johannes Berg Cc: Muchun Song Cc: Richard Weinberger Cc: Signed-off-by: Andrew Morton --- fs/dax.c | 3 +-- include/linux/mm.h | 17 +++++++++++++++++ mm/huge_memory.c | 11 +++++------ mm/khugepaged.c | 2 +- mm/memory.c | 2 +- 5 files changed, 25 insertions(+), 10 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 676303419e9e..5087ca3b1f7b 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -1422,8 +1422,7 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf, pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable); mm_inc_nr_ptes(vma->vm_mm); } - pmd_entry = mk_pmd(&zero_folio->page, vmf->vma->vm_page_prot); - pmd_entry = pmd_mkhuge(pmd_entry); + pmd_entry = folio_mk_pmd(zero_folio, vmf->vma->vm_page_prot); set_pmd_at(vmf->vma->vm_mm, pmd_addr, vmf->pmd, pmd_entry); spin_unlock(ptl); trace_dax_pmd_load_hole(inode, vmf, zero_folio, *entry); diff --git a/include/linux/mm.h b/include/linux/mm.h index cbad8c663c4d..733dd7100ca1 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2024,7 +2024,24 @@ static inline pte_t folio_mk_pte(struct folio *folio, pgprot_t pgprot) { return pfn_pte(folio_pfn(folio), pgprot); } + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +/** + * folio_mk_pmd - Create a PMD for this folio + * @folio: The folio to create a PMD for + * @pgprot: The page protection bits to use + * + * Create a page table entry for the first page of this folio. + * This is suitable for passing to set_pmd_at(). + * + * Return: A page table entry suitable for mapping this folio. + */ +static inline pmd_t folio_mk_pmd(struct folio *folio, pgprot_t pgprot) +{ + return pmd_mkhuge(pfn_pmd(folio_pfn(folio), pgprot)); +} #endif +#endif /* CONFIG_MMU */ static inline bool folio_has_pincount(const struct folio *folio) { diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 47d76d03ce30..1cd975503131 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1203,7 +1203,7 @@ static void map_anon_folio_pmd(struct folio *folio, pmd_t *pmd, { pmd_t entry; - entry = mk_huge_pmd(&folio->page, vma->vm_page_prot); + entry = folio_mk_pmd(folio, vma->vm_page_prot); entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); folio_add_new_anon_rmap(folio, vma, haddr, RMAP_EXCLUSIVE); folio_add_lru_vma(folio, vma); @@ -1309,8 +1309,7 @@ static void set_huge_zero_folio(pgtable_t pgtable, struct mm_struct *mm, struct folio *zero_folio) { pmd_t entry; - entry = mk_pmd(&zero_folio->page, vma->vm_page_prot); - entry = pmd_mkhuge(entry); + entry = folio_mk_pmd(zero_folio, vma->vm_page_prot); pgtable_trans_huge_deposit(mm, pmd, pgtable); set_pmd_at(mm, haddr, pmd, entry); mm_inc_nr_ptes(mm); @@ -2653,12 +2652,12 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm folio_move_anon_rmap(src_folio, dst_vma); src_folio->index = linear_page_index(dst_vma, dst_addr); - _dst_pmd = mk_huge_pmd(&src_folio->page, dst_vma->vm_page_prot); + _dst_pmd = folio_mk_pmd(src_folio, dst_vma->vm_page_prot); /* Follow mremap() behavior and treat the entry dirty after the move */ _dst_pmd = pmd_mkwrite(pmd_mkdirty(_dst_pmd), dst_vma); } else { src_pmdval = pmdp_huge_clear_flush(src_vma, src_addr, src_pmd); - _dst_pmd = mk_huge_pmd(src_page, dst_vma->vm_page_prot); + _dst_pmd = folio_mk_pmd(src_folio, dst_vma->vm_page_prot); } set_pmd_at(mm, dst_addr, dst_pmd, _dst_pmd); @@ -4680,7 +4679,7 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new) entry = pmd_to_swp_entry(*pvmw->pmd); folio_get(folio); - pmde = mk_huge_pmd(new, READ_ONCE(vma->vm_page_prot)); + pmde = folio_mk_pmd(folio, READ_ONCE(vma->vm_page_prot)); if (pmd_swp_soft_dirty(*pvmw->pmd)) pmde = pmd_mksoft_dirty(pmde); if (is_writable_migration_entry(entry)) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index cc945c6ab3bd..b8838ba8207a 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1239,7 +1239,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, __folio_mark_uptodate(folio); pgtable = pmd_pgtable(_pmd); - _pmd = mk_huge_pmd(&folio->page, vma->vm_page_prot); + _pmd = folio_mk_pmd(folio, vma->vm_page_prot); _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma); spin_lock(pmd_ptl); diff --git a/mm/memory.c b/mm/memory.c index a9e631927478..86e7e66e3c5b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5188,7 +5188,7 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) flush_icache_pages(vma, page, HPAGE_PMD_NR); - entry = mk_huge_pmd(page, vma->vm_page_prot); + entry = folio_mk_pmd(folio, vma->vm_page_prot); if (write) entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); From 5071ea3d7b3d1e9660524374083a929a6885d78a Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 2 Apr 2025 19:17:05 +0100 Subject: [PATCH 012/285] arch: remove mk_pmd() There are now no callers of mk_huge_pmd() and mk_pmd(). Remove them. Link: https://lkml.kernel.org/r/20250402181709.2386022-12-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Cc: Zi Yan Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Anton Ivanov Cc: Dave Hansen Cc: David Hildenbrand Cc: "David S. Miller" Cc: Geert Uytterhoeven Cc: Johannes Berg Cc: Muchun Song Cc: Richard Weinberger Cc: Signed-off-by: Andrew Morton --- arch/arc/include/asm/hugepage.h | 2 -- arch/arc/include/asm/pgtable-levels.h | 1 - arch/arm/include/asm/pgtable-3level.h | 1 - arch/arm64/include/asm/pgtable.h | 1 - arch/loongarch/include/asm/pgtable.h | 1 - arch/loongarch/mm/pgtable.c | 9 --------- arch/mips/include/asm/pgtable.h | 3 --- arch/mips/mm/pgtable-32.c | 10 ---------- arch/mips/mm/pgtable-64.c | 9 --------- arch/powerpc/include/asm/book3s/64/pgtable.h | 1 - arch/powerpc/mm/book3s64/pgtable.c | 5 ----- arch/riscv/include/asm/pgtable-64.h | 2 -- arch/s390/include/asm/pgtable.h | 1 - arch/sparc/include/asm/pgtable_64.h | 1 - arch/x86/include/asm/pgtable.h | 2 -- include/linux/huge_mm.h | 2 -- 16 files changed, 51 deletions(-) diff --git a/arch/arc/include/asm/hugepage.h b/arch/arc/include/asm/hugepage.h index 8a2441670a8f..7765dc105d54 100644 --- a/arch/arc/include/asm/hugepage.h +++ b/arch/arc/include/asm/hugepage.h @@ -40,8 +40,6 @@ static inline pmd_t pte_pmd(pte_t pte) #define pmd_young(pmd) pte_young(pmd_pte(pmd)) #define pmd_dirty(pmd) pte_dirty(pmd_pte(pmd)) -#define mk_pmd(page, prot) pte_pmd(mk_pte(page, prot)) - #define pmd_trans_huge(pmd) (pmd_val(pmd) & _PAGE_HW_SZ) #define pfn_pmd(pfn, prot) (__pmd(((pfn) << PAGE_SHIFT) | pgprot_val(prot))) diff --git a/arch/arc/include/asm/pgtable-levels.h b/arch/arc/include/asm/pgtable-levels.h index 55dbd2719e35..d1ce4b0f1071 100644 --- a/arch/arc/include/asm/pgtable-levels.h +++ b/arch/arc/include/asm/pgtable-levels.h @@ -142,7 +142,6 @@ #define pmd_pfn(pmd) ((pmd_val(pmd) & PMD_MASK) >> PAGE_SHIFT) #define pfn_pmd(pfn,prot) __pmd(((pfn) << PAGE_SHIFT) | pgprot_val(prot)) -#define mk_pmd(page,prot) pfn_pmd(page_to_pfn(page),prot) #endif diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h index fa5939eb9864..7b71a3d414b7 100644 --- a/arch/arm/include/asm/pgtable-3level.h +++ b/arch/arm/include/asm/pgtable-3level.h @@ -209,7 +209,6 @@ PMD_BIT_FUNC(mkyoung, |= PMD_SECT_AF); #define pmd_pfn(pmd) (((pmd_val(pmd) & PMD_MASK) & PHYS_MASK) >> PAGE_SHIFT) #define pfn_pmd(pfn,prot) (__pmd(((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot))) -#define mk_pmd(page,prot) pfn_pmd(page_to_pfn(page),prot) /* No hardware dirty/accessed bits -- generic_pmdp_establish() fits */ #define pmdp_establish generic_pmdp_establish diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index f03c6c2b0944..2a77f11b78d5 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -609,7 +609,6 @@ static inline pmd_t pmd_mkspecial(pmd_t pmd) #define __phys_to_pmd_val(phys) __phys_to_pte_val(phys) #define pmd_pfn(pmd) ((__pmd_to_phys(pmd) & PMD_MASK) >> PAGE_SHIFT) #define pfn_pmd(pfn,prot) __pmd(__phys_to_pmd_val((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot)) -#define mk_pmd(page,prot) pfn_pmd(page_to_pfn(page),prot) #define pud_young(pud) pte_young(pud_pte(pud)) #define pud_mkyoung(pud) pte_pud(pte_mkyoung(pud_pte(pud))) diff --git a/arch/loongarch/include/asm/pgtable.h b/arch/loongarch/include/asm/pgtable.h index 9ba3a4ebcd98..a3f17914dbab 100644 --- a/arch/loongarch/include/asm/pgtable.h +++ b/arch/loongarch/include/asm/pgtable.h @@ -255,7 +255,6 @@ static inline void pmd_clear(pmd_t *pmdp) #define pmd_page_vaddr(pmd) pmd_val(pmd) -extern pmd_t mk_pmd(struct page *page, pgprot_t prot); extern void set_pmd_at(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp, pmd_t pmd); #define pte_page(x) pfn_to_page(pte_pfn(x)) diff --git a/arch/loongarch/mm/pgtable.c b/arch/loongarch/mm/pgtable.c index 22a94bb3e6e8..352d9b2e02ab 100644 --- a/arch/loongarch/mm/pgtable.c +++ b/arch/loongarch/mm/pgtable.c @@ -135,15 +135,6 @@ void kernel_pte_init(void *addr) } while (p != end); } -pmd_t mk_pmd(struct page *page, pgprot_t prot) -{ - pmd_t pmd; - - pmd_val(pmd) = (page_to_pfn(page) << PFN_PTE_SHIFT) | pgprot_val(prot); - - return pmd; -} - void set_pmd_at(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp, pmd_t pmd) { diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h index d69cfa5a8ac6..4852b005a72d 100644 --- a/arch/mips/include/asm/pgtable.h +++ b/arch/mips/include/asm/pgtable.h @@ -713,9 +713,6 @@ static inline pmd_t pmd_clear_soft_dirty(pmd_t pmd) #endif /* CONFIG_HAVE_ARCH_SOFT_DIRTY */ -/* Extern to avoid header file madness */ -extern pmd_t mk_pmd(struct page *page, pgprot_t prot); - static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot) { pmd_val(pmd) = (pmd_val(pmd) & (_PAGE_CHG_MASK | _PAGE_HUGE)) | diff --git a/arch/mips/mm/pgtable-32.c b/arch/mips/mm/pgtable-32.c index 84dd5136d53a..e2cf2166d5cb 100644 --- a/arch/mips/mm/pgtable-32.c +++ b/arch/mips/mm/pgtable-32.c @@ -31,16 +31,6 @@ void pgd_init(void *addr) } #if defined(CONFIG_TRANSPARENT_HUGEPAGE) -pmd_t mk_pmd(struct page *page, pgprot_t prot) -{ - pmd_t pmd; - - pmd_val(pmd) = (page_to_pfn(page) << PFN_PTE_SHIFT) | pgprot_val(prot); - - return pmd; -} - - void set_pmd_at(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp, pmd_t pmd) { diff --git a/arch/mips/mm/pgtable-64.c b/arch/mips/mm/pgtable-64.c index 1e544827dea9..b24f865de357 100644 --- a/arch/mips/mm/pgtable-64.c +++ b/arch/mips/mm/pgtable-64.c @@ -90,15 +90,6 @@ void pud_init(void *addr) #endif #ifdef CONFIG_TRANSPARENT_HUGEPAGE -pmd_t mk_pmd(struct page *page, pgprot_t prot) -{ - pmd_t pmd; - - pmd_val(pmd) = (page_to_pfn(page) << PFN_PTE_SHIFT) | pgprot_val(prot); - - return pmd; -} - void set_pmd_at(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp, pmd_t pmd) { diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h index 6d98e6f08d4d..6ed93e290c2f 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -1096,7 +1096,6 @@ static inline bool pmd_access_permitted(pmd_t pmd, bool write) #ifdef CONFIG_TRANSPARENT_HUGEPAGE extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot); extern pud_t pfn_pud(unsigned long pfn, pgprot_t pgprot); -extern pmd_t mk_pmd(struct page *page, pgprot_t pgprot); extern pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot); extern pud_t pud_modify(pud_t pud, pgprot_t newprot); extern void set_pmd_at(struct mm_struct *mm, unsigned long addr, diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c index 8f7d41ce2ca1..0e62d25062f8 100644 --- a/arch/powerpc/mm/book3s64/pgtable.c +++ b/arch/powerpc/mm/book3s64/pgtable.c @@ -269,11 +269,6 @@ pud_t pfn_pud(unsigned long pfn, pgprot_t pgprot) return __pud_mkhuge(pud_set_protbits(__pud(pudv), pgprot)); } -pmd_t mk_pmd(struct page *page, pgprot_t pgprot) -{ - return pfn_pmd(page_to_pfn(page), pgprot); -} - pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot) { unsigned long pmdv; diff --git a/arch/riscv/include/asm/pgtable-64.h b/arch/riscv/include/asm/pgtable-64.h index 0897dd99ab8d..188fadc1c21f 100644 --- a/arch/riscv/include/asm/pgtable-64.h +++ b/arch/riscv/include/asm/pgtable-64.h @@ -262,8 +262,6 @@ static inline unsigned long _pmd_pfn(pmd_t pmd) return __page_val_to_pfn(pmd_val(pmd)); } -#define mk_pmd(page, prot) pfn_pmd(page_to_pfn(page), prot) - #define pmd_ERROR(e) \ pr_err("%s:%d: bad pmd %016lx.\n", __FILE__, __LINE__, pmd_val(e)) diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h index 3ef5d2198480..1c661ac62ce8 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -1869,7 +1869,6 @@ static inline pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, #define pmdp_collapse_flush pmdp_collapse_flush #define pfn_pmd(pfn, pgprot) mk_pmd_phys(((pfn) << PAGE_SHIFT), (pgprot)) -#define mk_pmd(page, pgprot) pfn_pmd(page_to_pfn(page), (pgprot)) static inline int pmd_trans_huge(pmd_t pmd) { diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index d9c903576084..4af03e3c161b 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -233,7 +233,6 @@ static inline pmd_t pfn_pmd(unsigned long page_nr, pgprot_t pgprot) return __pmd(pte_val(pte)); } -#define mk_pmd(page, pgprot) pfn_pmd(page_to_pfn(page), (pgprot)) #endif /* This one can be done with two shifts. */ diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 2ce98b547a25..3f59d7a16010 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -1347,8 +1347,6 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, #define flush_tlb_fix_spurious_fault(vma, address, ptep) do { } while (0) -#define mk_pmd(page, pgprot) pfn_pmd(page_to_pfn(page), (pgprot)) - #define __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS extern int pmdp_set_access_flags(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp, diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index e893d546a49f..f190998b2ebd 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -495,8 +495,6 @@ static inline bool is_huge_zero_pmd(pmd_t pmd) struct folio *mm_get_huge_zero_folio(struct mm_struct *mm); void mm_put_huge_zero_folio(struct mm_struct *mm); -#define mk_huge_pmd(page, prot) pmd_mkhuge(mk_pmd(page, prot)) - static inline bool thp_migration_supported(void) { return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION); From c09b997342bcd1b3c2b63c6ff6fecf037a68beff Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 2 Apr 2025 22:06:03 +0100 Subject: [PATCH 013/285] filemap: remove readahead_page() Patch series "Misc folio patches for 6.16". Remove a few APIs that we've converted everybody from using. I also found a few places that extract a page pointer from i_pages, which will be an invalid thing to do when we separate pages from folios. This patch (of 8): All filesystems have now been converted to call readahead_folio() so we can delete this wrapper. Link: https://lkml.kernel.org/r/20250402210612.2444135-1-willy@infradead.org Link: https://lkml.kernel.org/r/20250402210612.2444135-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: David Hildenbrand Signed-off-by: Andrew Morton --- include/linux/pagemap.h | 22 +++------------------- 1 file changed, 3 insertions(+), 19 deletions(-) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 26baa78f1ca7..cd4bd0f8e5f6 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -1308,9 +1308,9 @@ static inline bool filemap_range_needs_writeback(struct address_space *mapping, * struct readahead_control - Describes a readahead request. * * A readahead request is for consecutive pages. Filesystems which - * implement the ->readahead method should call readahead_page() or - * readahead_page_batch() in a loop and attempt to start I/O against - * each page in the request. + * implement the ->readahead method should call readahead_folio() or + * __readahead_batch() in a loop and attempt to start reads into each + * folio in the request. * * Most of the fields in this struct are private and should be accessed * by the functions below. @@ -1415,22 +1415,6 @@ static inline struct folio *__readahead_folio(struct readahead_control *ractl) return folio; } -/** - * readahead_page - Get the next page to read. - * @ractl: The current readahead request. - * - * Context: The page is locked and has an elevated refcount. The caller - * should decreases the refcount once the page has been submitted for I/O - * and unlock the page once all I/O to that page has completed. - * Return: A pointer to the next page, or %NULL if we are done. - */ -static inline struct page *readahead_page(struct readahead_control *ractl) -{ - struct folio *folio = __readahead_folio(ractl); - - return &folio->page; -} - /** * readahead_folio - Get the next folio to read. * @ractl: The current readahead request. From a55139579082e4bc9ea9b04003cb9f78c781e08b Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 2 Apr 2025 22:06:04 +0100 Subject: [PATCH 014/285] mm: remove offset_in_thp() All callers have been converted to call offset_in_folio(). Link: https://lkml.kernel.org/r/20250402210612.2444135-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: David Hildenbrand Signed-off-by: Andrew Morton --- include/linux/mm.h | 1 - 1 file changed, 1 deletion(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 733dd7100ca1..ae67d3a33792 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2445,7 +2445,6 @@ static inline void clear_page_pfmemalloc(struct page *page) extern void pagefault_out_of_memory(void); #define offset_in_page(p) ((unsigned long)(p) & ~PAGE_MASK) -#define offset_in_thp(page, p) ((unsigned long)(p) & (thp_size(page) - 1)) #define offset_in_folio(folio, p) ((unsigned long)(p) & (folio_size(folio) - 1)) /* From b57f4f4f186d728a1b484be1f214f17d843b6a24 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 2 Apr 2025 22:06:05 +0100 Subject: [PATCH 015/285] iov_iter: convert iter_xarray_populate_pages() to use folios ITER_XARRAY is exclusively used with xarrays that contain folios, not pages, so extract folio pointers from it, not page pointers. Removes a hidden call to compound_head() and a use of find_subpage(). Link: https://lkml.kernel.org/r/20250402210612.2444135-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: David Hildenbrand Signed-off-by: Andrew Morton --- lib/iov_iter.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/lib/iov_iter.c b/lib/iov_iter.c index bc9391e55d57..3e340d1fcadc 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -1059,22 +1059,22 @@ static ssize_t iter_xarray_populate_pages(struct page **pages, struct xarray *xa pgoff_t index, unsigned int nr_pages) { XA_STATE(xas, xa, index); - struct page *page; + struct folio *folio; unsigned int ret = 0; rcu_read_lock(); - for (page = xas_load(&xas); page; page = xas_next(&xas)) { - if (xas_retry(&xas, page)) + for (folio = xas_load(&xas); folio; folio = xas_next(&xas)) { + if (xas_retry(&xas, folio)) continue; - /* Has the page moved or been split? */ - if (unlikely(page != xas_reload(&xas))) { + /* Has the folio moved or been split? */ + if (unlikely(folio != xas_reload(&xas))) { xas_reset(&xas); continue; } - pages[ret] = find_subpage(page, xas.xa_index); - get_page(pages[ret]); + pages[ret] = folio_file_page(folio, xas.xa_index); + folio_get(folio); if (++ret == nr_pages) break; } From 70d1be00b49a6bbf0b60c5b7f70d57d75bbcd4b7 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 2 Apr 2025 22:06:06 +0100 Subject: [PATCH 016/285] iov_iter: convert iov_iter_extract_xarray_pages() to use folios ITER_XARRAY is exclusively used with xarrays that contain folios, not pages, so extract folio pointers from it, not page pointers. Removes a use of find_subpage(). Link: https://lkml.kernel.org/r/20250402210612.2444135-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: David Hildenbrand Signed-off-by: Andrew Morton --- lib/iov_iter.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/lib/iov_iter.c b/lib/iov_iter.c index 3e340d1fcadc..d9e19fb2dcf3 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -1650,11 +1650,11 @@ static ssize_t iov_iter_extract_xarray_pages(struct iov_iter *i, iov_iter_extraction_t extraction_flags, size_t *offset0) { - struct page *page, **p; + struct page **p; + struct folio *folio; unsigned int nr = 0, offset; loff_t pos = i->xarray_start + i->iov_offset; - pgoff_t index = pos >> PAGE_SHIFT; - XA_STATE(xas, i->xarray, index); + XA_STATE(xas, i->xarray, pos >> PAGE_SHIFT); offset = pos & ~PAGE_MASK; *offset0 = offset; @@ -1665,17 +1665,17 @@ static ssize_t iov_iter_extract_xarray_pages(struct iov_iter *i, p = *pages; rcu_read_lock(); - for (page = xas_load(&xas); page; page = xas_next(&xas)) { - if (xas_retry(&xas, page)) + for (folio = xas_load(&xas); folio; folio = xas_next(&xas)) { + if (xas_retry(&xas, folio)) continue; - /* Has the page moved or been split? */ - if (unlikely(page != xas_reload(&xas))) { + /* Has the folio moved or been split? */ + if (unlikely(folio != xas_reload(&xas))) { xas_reset(&xas); continue; } - p[nr++] = find_subpage(page, xas.xa_index); + p[nr++] = folio_file_page(folio, xas.xa_index); if (nr == maxpages) break; } From 9c532d79082f9f2e58a45f2e92020dda65b7dedd Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 2 Apr 2025 22:06:07 +0100 Subject: [PATCH 017/285] filemap: remove find_subpage() All users of this function now call folio_file_page() instead. Delete it. Link: https://lkml.kernel.org/r/20250402210612.2444135-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: David Hildenbrand Signed-off-by: Andrew Morton --- include/linux/pagemap.h | 13 ------------- 1 file changed, 13 deletions(-) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index cd4bd0f8e5f6..0ddd4bd8cdf8 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -945,19 +945,6 @@ static inline bool folio_contains(struct folio *folio, pgoff_t index) return index - folio_index(folio) < folio_nr_pages(folio); } -/* - * Given the page we found in the page cache, return the page corresponding - * to this index in the file - */ -static inline struct page *find_subpage(struct page *head, pgoff_t index) -{ - /* HugeTLBfs wants the head page regardless */ - if (PageHuge(head)) - return head; - - return head + (index & (thp_nr_pages(head) - 1)); -} - unsigned filemap_get_folios(struct address_space *mapping, pgoff_t *start, pgoff_t end, struct folio_batch *fbatch); unsigned filemap_get_folios_contig(struct address_space *mapping, From 8dfc8cbf7b07da172b3a4c8bea064e84fbfa5d56 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 2 Apr 2025 22:06:08 +0100 Subject: [PATCH 018/285] filemap: convert __readahead_batch() to use a folio Extract folios from i_mapping, not pages. Removes a hidden call to compound_head(), a use of thp_nr_pages() and an unnecessary assertion that we didn't find a tail page in the page cache. Link: https://lkml.kernel.org/r/20250402210612.2444135-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: David Hildenbrand Signed-off-by: Andrew Morton --- include/linux/pagemap.h | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 0ddd4bd8cdf8..c5c9b3770d75 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -1424,7 +1424,7 @@ static inline unsigned int __readahead_batch(struct readahead_control *rac, { unsigned int i = 0; XA_STATE(xas, &rac->mapping->i_pages, 0); - struct page *page; + struct folio *folio; BUG_ON(rac->_batch_count > rac->_nr_pages); rac->_nr_pages -= rac->_batch_count; @@ -1433,13 +1433,12 @@ static inline unsigned int __readahead_batch(struct readahead_control *rac, xas_set(&xas, rac->_index); rcu_read_lock(); - xas_for_each(&xas, page, rac->_index + rac->_nr_pages - 1) { - if (xas_retry(&xas, page)) + xas_for_each(&xas, folio, rac->_index + rac->_nr_pages - 1) { + if (xas_retry(&xas, folio)) continue; - VM_BUG_ON_PAGE(!PageLocked(page), page); - VM_BUG_ON_PAGE(PageTail(page), page); - array[i++] = page; - rac->_batch_count += thp_nr_pages(page); + VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); + array[i++] = folio_page(folio, 0); + rac->_batch_count += folio_nr_pages(folio); if (i == array_sz) break; } From 41e422a898da69e903a846dfb2b0c0ff62abc3cd Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 2 Apr 2025 22:06:09 +0100 Subject: [PATCH 019/285] filemap: remove readahead_page_batch() This function has no more callers; delete it. Link: https://lkml.kernel.org/r/20250402210612.2444135-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: David Hildenbrand Signed-off-by: Andrew Morton --- include/linux/pagemap.h | 14 -------------- 1 file changed, 14 deletions(-) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index c5c9b3770d75..af25fb640463 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -1447,20 +1447,6 @@ static inline unsigned int __readahead_batch(struct readahead_control *rac, return i; } -/** - * readahead_page_batch - Get a batch of pages to read. - * @rac: The current readahead request. - * @array: An array of pointers to struct page. - * - * Context: The pages are locked and have an elevated refcount. The caller - * should decreases the refcount once the page has been submitted for I/O - * and unlock the page once all I/O to that page has completed. - * Return: The number of pages placed in the array. 0 indicates the request - * is complete. - */ -#define readahead_page_batch(rac, array) \ - __readahead_batch(rac, array, ARRAY_SIZE(array)) - /** * readahead_pos - The byte offset into the file of this readahead request. * @rac: The readahead request. From 2355153ea8185e7deaf9ce728716dadb707ef59f Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 2 Apr 2025 22:06:10 +0100 Subject: [PATCH 020/285] mm: delete thp_nr_pages() All callers now use folio_nr_pages(). Delete this wrapper. Link: https://lkml.kernel.org/r/20250402210612.2444135-9-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: David Hildenbrand Signed-off-by: Andrew Morton --- include/linux/mm.h | 9 --------- 1 file changed, 9 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index ae67d3a33792..dcdb798184ef 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2223,15 +2223,6 @@ static inline long compound_nr(struct page *page) return folio_large_nr_pages(folio); } -/** - * thp_nr_pages - The number of regular pages in this huge page. - * @page: The head page of a huge page. - */ -static inline long thp_nr_pages(struct page *page) -{ - return folio_nr_pages((struct folio *)page); -} - /** * folio_next - Move to the next physical folio. * @folio: The folio we're currently operating on. From 56e5a103a721d0ef139bba7ff3d3ada6c8217d5b Mon Sep 17 00:00:00 2001 From: Nhat Pham Date: Wed, 2 Apr 2025 13:44:16 -0700 Subject: [PATCH 021/285] zsmalloc: prefer the the original page's node for compressed data Currently, zsmalloc, zswap's and zram's backend memory allocator, does not enforce any policy for the allocation of memory for the compressed data, instead just adopting the memory policy of the task entering reclaim, or the default policy (prefer local node) if no such policy is specified. This can lead to several pathological behaviors in multi-node NUMA systems: 1. Systems with CXL-based memory tiering can encounter the following inversion with zswap/zram: the coldest pages demoted to the CXL tier can return to the high tier when they are reclaimed to compressed swap, creating memory pressure on the high tier. 2. Consider a direct reclaimer scanning nodes in order of allocation preference. If it ventures into remote nodes, the memory it compresses there should stay there. Trying to shift those contents over to the reclaiming thread's preferred node further *increases* its local pressure, and provoking more spills. The remote node is also the most likely to refault this data again. This undesirable behavior was pointed out by Johannes Weiner in [1]. 3. For zswap writeback, the zswap entries are organized in node-specific LRUs, based on the node placement of the original pages, allowing for targeted zswap writeback for specific nodes. However, the compressed data of a zswap entry can be placed on a different node from the LRU it is placed on. This means that reclaim targeted at one node might not free up memory used for zswap entries in that node, but instead reclaiming memory in a different node. All of these issues will be resolved if the compressed data go to the same node as the original page. This patch encourages this behavior by having zswap and zram pass the node of the original page to zsmalloc, and have zsmalloc prefer the specified node if we need to allocate new (zs)pages for the compressed data. Note that we are not strictly binding the allocation to the preferred node. We still allow the allocation to fall back to other nodes when the preferred node is full, or if we have zspages with slots available on a different node. This is OK, and still a strict improvement over the status quo: 1. On a system with demotion enabled, we will generally prefer demotions over compressed swapping, and only swap when pages have already gone to the lowest tier. This patch should achieve the desired effect for the most part. 2. If the preferred node is out of memory, letting the compressed data going to other nodes can be better than the alternative (OOMs, keeping cold memory unreclaimed, disk swapping, etc.). 3. If the allocation go to a separate node because we have a zspage with slots available, at least we're not creating extra immediate memory pressure (since the space is already allocated). 3. While there can be mixings, we generally reclaim pages in same-node batches, which encourage zspage grouping that is more likely to go to the right node. 4. A strict binding would require partitioning zsmalloc by node, which is more complicated, and more prone to regression, since it reduces the storage density of zsmalloc. We need to evaluate the tradeoff and benchmark carefully before adopting such an involved solution. [1]: https://lore.kernel.org/linux-mm/20250331165306.GC2110528@cmpxchg.org/ [senozhatsky@chromium.org: coding-style fixes] Link: https://lkml.kernel.org/r/mnvexa7kseswglcqbhlot4zg3b3la2ypv2rimdl5mh5glbmhvz@wi6bgqn47hge Link: https://lkml.kernel.org/r/20250402204416.3435994-1-nphamcs@gmail.com Signed-off-by: Nhat Pham Suggested-by: Gregory Price Acked-by: Dan Williams Reviewed-by: Chengming Zhou Acked-by: Sergey Senozhatsky [zram, zsmalloc] Acked-by: Johannes Weiner Acked-by: Yosry Ahmed [zswap/zsmalloc] Cc: "Huang, Ying" Cc: Joanthan Cameron Cc: Minchan Kim Cc: SeongJae Park Signed-off-by: Andrew Morton --- drivers/block/zram/zram_drv.c | 11 ++++++++--- include/linux/zpool.h | 4 ++-- include/linux/zsmalloc.h | 3 ++- mm/zpool.c | 8 +++++--- mm/zsmalloc.c | 20 +++++++++++--------- mm/zswap.c | 2 +- 6 files changed, 29 insertions(+), 19 deletions(-) diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index fda7d8624889..0ba18277ed7b 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -1694,7 +1694,7 @@ static int write_incompressible_page(struct zram *zram, struct page *page, */ handle = zs_malloc(zram->mem_pool, PAGE_SIZE, GFP_NOIO | __GFP_NOWARN | - __GFP_HIGHMEM | __GFP_MOVABLE); + __GFP_HIGHMEM | __GFP_MOVABLE, page_to_nid(page)); if (IS_ERR_VALUE(handle)) return PTR_ERR((void *)handle); @@ -1761,7 +1761,7 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index) handle = zs_malloc(zram->mem_pool, comp_len, GFP_NOIO | __GFP_NOWARN | - __GFP_HIGHMEM | __GFP_MOVABLE); + __GFP_HIGHMEM | __GFP_MOVABLE, page_to_nid(page)); if (IS_ERR_VALUE(handle)) { zcomp_stream_put(zstrm); return PTR_ERR((void *)handle); @@ -1981,10 +1981,15 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page, * We are holding per-CPU stream mutex and entry lock so better * avoid direct reclaim. Allocation error is not fatal since * we still have the old object in the mem_pool. + * + * XXX: technically, the node we really want here is the node that holds + * the original compressed data. But that would require us to modify + * zsmalloc API to return this information. For now, we will make do with + * the node of the page allocated for recompression. */ handle_new = zs_malloc(zram->mem_pool, comp_len_new, GFP_NOIO | __GFP_NOWARN | - __GFP_HIGHMEM | __GFP_MOVABLE); + __GFP_HIGHMEM | __GFP_MOVABLE, page_to_nid(page)); if (IS_ERR_VALUE(handle_new)) { zcomp_stream_put(zstrm); return PTR_ERR((void *)handle_new); diff --git a/include/linux/zpool.h b/include/linux/zpool.h index 52f30e526607..369ef068fad8 100644 --- a/include/linux/zpool.h +++ b/include/linux/zpool.h @@ -22,7 +22,7 @@ const char *zpool_get_type(struct zpool *pool); void zpool_destroy_pool(struct zpool *pool); int zpool_malloc(struct zpool *pool, size_t size, gfp_t gfp, - unsigned long *handle); + unsigned long *handle, const int nid); void zpool_free(struct zpool *pool, unsigned long handle); @@ -64,7 +64,7 @@ struct zpool_driver { void (*destroy)(void *pool); int (*malloc)(void *pool, size_t size, gfp_t gfp, - unsigned long *handle); + unsigned long *handle, const int nid); void (*free)(void *pool, unsigned long handle); void *(*obj_read_begin)(void *pool, unsigned long handle, diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h index c26baf9fb331..13e9cc5490f7 100644 --- a/include/linux/zsmalloc.h +++ b/include/linux/zsmalloc.h @@ -26,7 +26,8 @@ struct zs_pool; struct zs_pool *zs_create_pool(const char *name); void zs_destroy_pool(struct zs_pool *pool); -unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t flags); +unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t flags, + const int nid); void zs_free(struct zs_pool *pool, unsigned long obj); size_t zs_huge_class_size(struct zs_pool *pool); diff --git a/mm/zpool.c b/mm/zpool.c index 6d6d88930932..0a71d03369f1 100644 --- a/mm/zpool.c +++ b/mm/zpool.c @@ -226,20 +226,22 @@ const char *zpool_get_type(struct zpool *zpool) * @size: The amount of memory to allocate. * @gfp: The GFP flags to use when allocating memory. * @handle: Pointer to the handle to set + * @nid: The preferred node id. * * This allocates the requested amount of memory from the pool. * The gfp flags will be used when allocating memory, if the * implementation supports it. The provided @handle will be - * set to the allocated object handle. + * set to the allocated object handle. The allocation will + * prefer the NUMA node specified by @nid. * * Implementations must guarantee this to be thread-safe. * * Returns: 0 on success, negative value on error. */ int zpool_malloc(struct zpool *zpool, size_t size, gfp_t gfp, - unsigned long *handle) + unsigned long *handle, const int nid) { - return zpool->driver->malloc(zpool->pool, size, gfp, handle); + return zpool->driver->malloc(zpool->pool, size, gfp, handle, nid); } /** diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index d14a7e317ac8..513b08c7c941 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -243,9 +243,9 @@ static inline void zpdesc_dec_zone_page_state(struct zpdesc *zpdesc) dec_zone_page_state(zpdesc_page(zpdesc), NR_ZSPAGES); } -static inline struct zpdesc *alloc_zpdesc(gfp_t gfp) +static inline struct zpdesc *alloc_zpdesc(gfp_t gfp, const int nid) { - struct page *page = alloc_page(gfp); + struct page *page = alloc_pages_node(nid, gfp, 0); return page_zpdesc(page); } @@ -462,9 +462,9 @@ static void zs_zpool_destroy(void *pool) } static int zs_zpool_malloc(void *pool, size_t size, gfp_t gfp, - unsigned long *handle) + unsigned long *handle, const int nid) { - *handle = zs_malloc(pool, size, gfp); + *handle = zs_malloc(pool, size, gfp, nid); if (IS_ERR_VALUE(*handle)) return PTR_ERR((void *)*handle); @@ -1043,8 +1043,8 @@ static void create_page_chain(struct size_class *class, struct zspage *zspage, * Allocate a zspage for the given size class */ static struct zspage *alloc_zspage(struct zs_pool *pool, - struct size_class *class, - gfp_t gfp) + struct size_class *class, + gfp_t gfp, const int nid) { int i; struct zpdesc *zpdescs[ZS_MAX_PAGES_PER_ZSPAGE]; @@ -1061,7 +1061,7 @@ static struct zspage *alloc_zspage(struct zs_pool *pool, for (i = 0; i < class->pages_per_zspage; i++) { struct zpdesc *zpdesc; - zpdesc = alloc_zpdesc(gfp); + zpdesc = alloc_zpdesc(gfp, nid); if (!zpdesc) { while (--i >= 0) { zpdesc_dec_zone_page_state(zpdescs[i]); @@ -1336,12 +1336,14 @@ static unsigned long obj_malloc(struct zs_pool *pool, * @pool: pool to allocate from * @size: size of block to allocate * @gfp: gfp flags when allocating object + * @nid: The preferred node id to allocate new zspage (if needed) * * On success, handle to the allocated object is returned, * otherwise an ERR_PTR(). * Allocation requests with size > ZS_MAX_ALLOC_SIZE will fail. */ -unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp) +unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp, + const int nid) { unsigned long handle; struct size_class *class; @@ -1376,7 +1378,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp) spin_unlock(&class->lock); - zspage = alloc_zspage(pool, class, gfp); + zspage = alloc_zspage(pool, class, gfp, nid); if (!zspage) { cache_free_handle(pool, handle); return (unsigned long)ERR_PTR(-ENOMEM); diff --git a/mm/zswap.c b/mm/zswap.c index 204fb59da33c..455e9425c5f5 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -981,7 +981,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry, zpool = pool->zpool; gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE; - alloc_ret = zpool_malloc(zpool, dlen, gfp, &handle); + alloc_ret = zpool_malloc(zpool, dlen, gfp, &handle, page_to_nid(page)); if (alloc_ret) goto unlock; From a75ffa26122bba4a3fa9372aa13b32d7a5590425 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Wed, 2 Apr 2025 11:01:17 +0200 Subject: [PATCH 022/285] memcg, oom: do not bypass oom killer for dying tasks 7775face2079 ("memcg: killed threads should not invoke memcg OOM killer") has added a bypass of the oom killer path for dying threads because a very specific workload (described in the changelog) could hit "no killable tasks" path. This itself is not fatal condition but it could be annoying if this was a common case. On the other hand the bypass has some issues on its own. Without triggering oom killer we won't be able to trigger async oom reclaim (oom_reaper) which can operate on killed tasks as well as long as they still have their mm available. This could be the case during futex cleanup when the memory as pointed out by Johannes in [1]. The said case is still not fully understood but let's drop this bypass that was mostly driven by an artificial workload and allow dying tasks to go into oom path. This will make the code easier to reason about and also help corner cases where oom_reaper could help to release memory. Link: https://lore.kernel.org/all/20241212183012.GB1026@cmpxchg.org/T/#u [1] Link: https://lkml.kernel.org/r/20250402090117.130245-1-mhocko@kernel.org Signed-off-by: Michal Hocko Signed-off-by: Johannes Weiner Suggested-by: Johannes Weiner Acked-by: Shakeel Butt Acked-by: David Rientjes Cc: Muchun Song Cc: Rik van Riel Cc: Roman Gushchin Signed-off-by: Andrew Morton --- mm/memcontrol.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c96c1f2b9cf5..b16b5b807d7c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1664,7 +1664,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, * A few threads which were not waiting at mutex_lock_killable() can * fail to bail out. Therefore, check again after holding oom_lock. */ - ret = task_is_dying() || out_of_memory(&oc); + ret = out_of_memory(&oc); unlock: mutex_unlock(&oom_lock); From cd348c5e6af3b4220eb37a08fdaf9c924a2e1c19 Mon Sep 17 00:00:00 2001 From: Songtang Liu Date: Wed, 2 Apr 2025 03:41:16 -0400 Subject: [PATCH 023/285] mm: page_alloc: remove redundant READ_ONCE In the current code, batch is a local variable, and it cannot be concurrently modified. It's unnecessary to use READ_ONCE here, so remove it. Link: https://lkml.kernel.org/r/CAA=HWd1kn01ym8YuVFuAqK2Ggq3itEGkqX8T6eCXs_C7tiv-Jw@mail.gmail.com Fixes: 51a755c56dc0 ("mm: tune PCP high automatically") Signed-off-by: Songtang Liu Reviewed-by: Qi Zheng Reviewed-by: Huang Ying Reviewed-by: Anshuman Khandual Reviewed-by: Pankaj Gupta Reviewed-by: Oscar Salvador Reviewed-by: Vishal Moola (Oracle) Cc: Muchun Song Signed-off-by: Andrew Morton --- mm/page_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8258349e49ac..1ee0e8265888 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2675,7 +2675,7 @@ static void free_frozen_page_commit(struct zone *zone, free_high = (pcp->free_count >= batch && (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) && (!(pcp->flags & PCPF_FREE_HIGH_BATCH) || - pcp->count >= READ_ONCE(batch))); + pcp->count >= batch)); pcp->flags |= PCPF_PREV_FREE_HIGH_ORDER; } else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) { pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER; From 737e9d021993c99377bb61e762c0d6945a37615b Mon Sep 17 00:00:00 2001 From: Gregory Price Date: Mon, 27 Jan 2025 10:34:03 -0500 Subject: [PATCH 024/285] memory: implement memory_block_advise/probe_max_size Patch series "memory,x86,acpi: hotplug memory alignment advisement", v8. When physical address regions are not aligned to memory block size, the misaligned portion is lost (stranded capacity). Block size (min/max/selected) is architecture defined. Most architectures tend to use the minimum block size or some simplistic heurist. On x86, memory block size increases up to 2GB, and is otherwise fitted to the alignment of non-hotplug (i.e. not special purpose memory). CXL exposes its memory for management through the ACPI CEDT (CXL Early Detection Table) in a field called the CXL Fixed Memory Window. Per the CXL specification, this memory must be aligned to at least 256MB. When a CFMW aligns on a size less than the block size, this causes a loss of up to 2GB per CFMW on x86. It is not uncommon for CFMW to be allocated per-device - though this behavior is BIOS defined. This patch set provides 3 things: 1) implement advise/query functions in driverse/base/memory.c to report/query architecture agnostic hotplug block alignment advice. 2) update x86 memblock size logic to consider the hotplug advice 3) add code in acpi/numa/srat.c to report CFMW alignment advice The advisement interfaces are design to be called during arch_init code prior to allocator and smp_init. start_kernel will call these through setup_arch() (via acpi and mm/init_64.c on x86), which occurs prior to mm_core_init and smp_init - so no need for atomics. There's an attempt to signal callers to advise() that query has already occurred, but this is predicated on the notion that query actually occurs (which presently only happens on the x86 arch). This is to assist debugging future users. Otherwise, the advise() call has been marked __init to help static discovery of bad call times. Once query is called the first time, it will always return the same value. Interfaces return -EBUSY and 0 respectively on systems without hotplug. This patch (of 3): Hotplug memory sources may have opinions on what the memblock size should be - usually for alignment purposes. For example, CXL memory extents can be 256MB with a matching alignment. If this size/alignment is smaller than the block size, it can result in stranded capacity. Implement memory_block_advise_max_size for use prior to allocator init, for software to advise the system on the max block size. Implement memory_block_probe_max_size for use by arch init code to calculate the best block size. Use of advice is architecture defined. The probe value can never change after first probe. Calls to advise after probe will return -EBUSY to aid debugging. On systems without hotplug, always return -ENODEV and 0 respectively. Link: https://lkml.kernel.org/r/20250127153405.3379117-1-gourry@gourry.net Link: https://lkml.kernel.org/r/20250127153405.3379117-2-gourry@gourry.net Signed-off-by: Gregory Price Suggested-by: Ira Weiny Acked-by: David Hildenbrand Acked-by: Mike Rapoport (Microsoft) Acked-by: Dan Williams Tested-by: Fan Ni Reviewed-by: Ira Weiny Acked-by: Oscar Salvador Cc: Alison Schofield Cc: Andy Lutomirski Cc: Borislav Betkov Cc: Bruno Faccini Cc: Dave Hansen Cc: Dave Jiang Cc: Greg Kroah-Hartman Cc: Haibo Xu Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Joanthan Cameron Cc: Len Brown Cc: Peter Zijlstra Cc: Rafael J. Wysocki Cc: Robert Richter Cc: Thomas Gleinxer Signed-off-by: Andrew Morton --- drivers/base/memory.c | 51 ++++++++++++++++++++++++++++++++++++++++++ include/linux/memory.h | 10 +++++++++ 2 files changed, 61 insertions(+) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index 19469e7f88c2..ed3e69dc785c 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -110,6 +110,57 @@ static void memory_block_release(struct device *dev) kfree(mem); } + +/* Max block size to be set by memory_block_advise_max_size */ +static unsigned long memory_block_advised_size; +static bool memory_block_advised_size_queried; + +/** + * memory_block_advise_max_size() - advise memory hotplug on the max suggested + * block size, usually for alignment. + * @size: suggestion for maximum block size. must be aligned on power of 2. + * + * Early boot software (pre-allocator init) may advise archs on the max block + * size. This value can only decrease after initialization, as the intent is + * to identify the largest supported alignment for all sources. + * + * Use of this value is arch-defined, as is min/max block size. + * + * Return: 0 on success + * -EINVAL if size is 0 or not pow2 aligned + * -EBUSY if value has already been probed + */ +int __init memory_block_advise_max_size(unsigned long size) +{ + if (!size || !is_power_of_2(size)) + return -EINVAL; + + if (memory_block_advised_size_queried) + return -EBUSY; + + if (memory_block_advised_size) + memory_block_advised_size = min(memory_block_advised_size, size); + else + memory_block_advised_size = size; + + return 0; +} + +/** + * memory_block_advised_max_size() - query advised max hotplug block size. + * + * After the first call, the value can never change. Callers looking for the + * actual block size should use memory_block_size_bytes. This interface is + * intended for use by arch-init when initializing the hotplug block size. + * + * Return: advised size in bytes, or 0 if never set. + */ +unsigned long memory_block_advised_max_size(void) +{ + memory_block_advised_size_queried = true; + return memory_block_advised_size; +} + unsigned long __weak memory_block_size_bytes(void) { return MIN_MEMORY_BLOCK_SIZE; diff --git a/include/linux/memory.h b/include/linux/memory.h index 12daa6ec7d09..5ec4e6d209b9 100644 --- a/include/linux/memory.h +++ b/include/linux/memory.h @@ -149,6 +149,14 @@ static inline int hotplug_memory_notifier(notifier_fn_t fn, int pri) { return 0; } +static inline int memory_block_advise_max_size(unsigned long size) +{ + return -ENODEV; +} +static inline unsigned long memory_block_advised_max_size(void) +{ + return 0; +} #else /* CONFIG_MEMORY_HOTPLUG */ extern int register_memory_notifier(struct notifier_block *nb); extern void unregister_memory_notifier(struct notifier_block *nb); @@ -181,6 +189,8 @@ int walk_dynamic_memory_groups(int nid, walk_memory_groups_func_t func, void memory_block_add_nid(struct memory_block *mem, int nid, enum meminit_context context); #endif /* CONFIG_NUMA */ +int memory_block_advise_max_size(unsigned long size); +unsigned long memory_block_advised_max_size(void); #endif /* CONFIG_MEMORY_HOTPLUG */ /* From b1143537098b33affd4e50cec2c10e4b3858d93c Mon Sep 17 00:00:00 2001 From: Gregory Price Date: Mon, 27 Jan 2025 10:34:04 -0500 Subject: [PATCH 025/285] x86: probe memory block size advisement value during mm init Systems with hotplug may provide an advisement value on what the memblock size should be. Probe this value when the rest of the configuration values are considered. The new heuristic is as follows 1) set_memory_block_size_order value if already set (cmdline param) 2) minimum block size if memory is less than large block limit 3) if no hotplug advice: Max block size if system is bare-metal, otherwise use end of memory alignment. 4) if hotplug advice: lesser of advice and end of memory alignment. Convert to cpu_feature_enabled() while at it.[1] [1] https://lore.kernel.org/all/20241031103401.GBZyNdGQ-ZyXKyzC_z@fat_crate.local/ Link: https://lkml.kernel.org/r/20250127153405.3379117-3-gourry@gourry.net Signed-off-by: Gregory Price Suggested-by: Borislav Petkov Suggested-by: David Hildenbrand Acked-by: David Hildenbrand Acked-by: Dave Hansen Acked-by: Mike Rapoport (Microsoft) Acked-by: Dan Williams Tested-by: Fan Ni Reviewed-by: Ira Weiny Acked-by: Oscar Salvador Cc: Alison Schofield Cc: Andy Lutomirski Cc: Bruno Faccini Cc: Dave Jiang Cc: Greg Kroah-Hartman Cc: Haibo Xu Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Joanthan Cameron Cc: Len Brown Cc: Peter Zijlstra Cc: Rafael J. Wysocki Cc: Robert Richter Cc: Thomas Gleinxer Signed-off-by: Andrew Morton --- arch/x86/mm/init_64.c | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 7c4f6f591f2b..8cea7dcf8fa0 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -1464,16 +1464,21 @@ static unsigned long probe_memory_block_size(void) } /* - * Use max block size to minimize overhead on bare metal, where - * alignment for memory hotplug isn't a concern. + * When hotplug alignment is not a concern, maximize blocksize + * to minimize overhead. Otherwise, align to the lesser of advice + * alignment and end of memory alignment. */ - if (!boot_cpu_has(X86_FEATURE_HYPERVISOR)) { + bz = memory_block_advised_max_size(); + if (!bz) { bz = MAX_BLOCK_SIZE; - goto done; + if (!cpu_feature_enabled(X86_FEATURE_HYPERVISOR)) + goto done; + } else { + bz = max(min(bz, MAX_BLOCK_SIZE), MIN_MEMORY_BLOCK_SIZE); } /* Find the largest allowed block size that aligns to memory end */ - for (bz = MAX_BLOCK_SIZE; bz > MIN_MEMORY_BLOCK_SIZE; bz >>= 1) { + for (; bz > MIN_MEMORY_BLOCK_SIZE; bz >>= 1) { if (IS_ALIGNED(boot_mem_end, bz)) break; } From 6e3d1b1813c77876e3b354ba94b0f9e1cc23e19f Mon Sep 17 00:00:00 2001 From: Gregory Price Date: Mon, 27 Jan 2025 10:34:05 -0500 Subject: [PATCH 026/285] acpi,srat: give memory block size advice based on CFMWS alignment Capacity is stranded when CFMWS regions are not aligned to block size. On x86, block size increases with capacity (2G blocks @ 64G capacity). Use CFMWS base/size to report memory block size alignment advice. Link: https://lkml.kernel.org/r/20250127153405.3379117-4-gourry@gourry.net Signed-off-by: Gregory Price Suggested-by: Dan Williams Acked-by: Mike Rapoport (Microsoft) Acked-by: David Hildenbrand Acked-by: Dan Williams Tested-by: Fan Ni Acked-by: Rafael J. Wysocki Reviewed-by: Ira Weiny Acked-by: Oscar Salvador Cc: Alison Schofield Cc: Andy Lutomirski Cc: Borislav Betkov Cc: Bruno Faccini Cc: Dave Hansen Cc: Dave Jiang Cc: David Hildenbrand Cc: Greg Kroah-Hartman Cc: Haibo Xu Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Joanthan Cameron Cc: Len Brown Cc: Mike Rapoport Cc: Peter Zijlstra Cc: Robert Richter Cc: Thomas Gleinxer Signed-off-by: Andrew Morton --- drivers/acpi/numa/srat.c | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c index 0a725e46d017..5d0cbc5c88a0 100644 --- a/drivers/acpi/numa/srat.c +++ b/drivers/acpi/numa/srat.c @@ -14,6 +14,7 @@ #include #include #include +#include #include #include #include @@ -429,13 +430,23 @@ static int __init acpi_parse_cfmws(union acpi_subtable_headers *header, { struct acpi_cedt_cfmws *cfmws; int *fake_pxm = arg; - u64 start, end; + u64 start, end, align; int node; + int err; cfmws = (struct acpi_cedt_cfmws *)header; start = cfmws->base_hpa; end = cfmws->base_hpa + cfmws->window_size; + /* Align memblock size to CFMW regions if possible */ + align = 1UL << __ffs(start | end); + if (align >= SZ_256M) { + err = memory_block_advise_max_size(align); + if (err) + pr_warn("CFMWS: memblock size advise failed (%d)\n", err); + } else + pr_err("CFMWS: [BIOS BUG] base/size alignment violates spec\n"); + /* * The SRAT may have already described NUMA details for all, * or a portion of, this CFMWS HPA range. Extend the memblks From b4c829fa4d56f3b566bbbb41c9a8ff0c83ae84c5 Mon Sep 17 00:00:00 2001 From: "Vishal Moola (Oracle)" Date: Mon, 31 Mar 2025 19:10:25 -0700 Subject: [PATCH 027/285] mm/compaction: use folio in hugetlb pathway Use a folio in the hugetlb pathway during the compaction migrate-able pageblock scan. This removes a call to compound_head(). Link: https://lkml.kernel.org/r/20250401021025.637333-2-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) Acked-by: Oscar Salvador Reviewed-by: Zi Yan Cc: Muchun Song Signed-off-by: Andrew Morton --- include/linux/hugetlb.h | 4 ++-- mm/compaction.c | 8 ++++---- mm/hugetlb.c | 3 +-- 3 files changed, 7 insertions(+), 8 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 8f3ac832ee7f..a57bed83c657 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -695,7 +695,7 @@ struct huge_bootmem_page { bool hugetlb_bootmem_page_zones_valid(int nid, struct huge_bootmem_page *m); -int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list); +int isolate_or_dissolve_huge_folio(struct folio *folio, struct list_head *list); int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn); void wait_for_freed_hugetlb_folios(void); struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, @@ -1083,7 +1083,7 @@ static inline struct folio *filemap_lock_hugetlb_folio(struct hstate *h, return NULL; } -static inline int isolate_or_dissolve_huge_page(struct page *page, +static inline int isolate_or_dissolve_huge_folio(struct folio *folio, struct list_head *list) { return -ENOMEM; diff --git a/mm/compaction.c b/mm/compaction.c index ca71fd3c3181..dd868c861774 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1001,10 +1001,11 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, locked = NULL; } - ret = isolate_or_dissolve_huge_page(page, &cc->migratepages); + folio = page_folio(page); + ret = isolate_or_dissolve_huge_folio(folio, &cc->migratepages); /* - * Fail isolation in case isolate_or_dissolve_huge_page() + * Fail isolation in case isolate_or_dissolve_huge_folio() * reports an error. In case of -ENOMEM, abort right away. */ if (ret < 0) { @@ -1016,12 +1017,11 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, goto isolate_fail; } - if (PageHuge(page)) { + if (folio_test_hugetlb(folio)) { /* * Hugepage was successfully isolated and placed * on the cc->migratepages list. */ - folio = page_folio(page); low_pfn += folio_nr_pages(folio) - 1; goto isolate_success_no_list; } diff --git a/mm/hugetlb.c b/mm/hugetlb.c index a44d4b0d844c..a2c111447812 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2896,10 +2896,9 @@ free_new: return ret; } -int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list) +int isolate_or_dissolve_huge_folio(struct folio *folio, struct list_head *list) { struct hstate *h; - struct folio *folio = page_folio(page); int ret = -EBUSY; /* From 2e976567233228ff928c2405f7e03ebb7fb7aa50 Mon Sep 17 00:00:00 2001 From: Ignacio Encinas Date: Mon, 31 Mar 2025 21:57:05 +0200 Subject: [PATCH 028/285] mm: annotate data race in update_hiwater_rss mm_struct.hiwater_rss can be accessed concurrently without proper synchronization as reported by KCSAN. This data race is benign as it only affects accounting information. Annotate it with data_race() to make KCSAN happy. Link: https://lkml.kernel.org/r/20250331-mm-maxrss-data-race-v2-1-cf958e6205bf@iencinas.com Signed-off-by: Ignacio Encinas Reported-by: syzbot+419c4b42acc36c420ad3@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67e3390c.050a0220.1ec46.0001.GAE@google.com/ Suggested-by: Lorenzo Stoakes Acked-by: Pedro Falcato Cc: Liam Howlett Signed-off-by: Andrew Morton --- include/linux/mm.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index dcdb798184ef..1690f21e7808 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -12,6 +12,7 @@ #include #include #include +#include #include #include #include @@ -2796,7 +2797,7 @@ static inline void update_hiwater_rss(struct mm_struct *mm) { unsigned long _rss = get_mm_rss(mm); - if ((mm)->hiwater_rss < _rss) + if (data_race(mm->hiwater_rss) < _rss) (mm)->hiwater_rss = _rss; } From 26d4d18b79659054e451be9487937e31e63d0853 Mon Sep 17 00:00:00 2001 From: Ye Liu Date: Tue, 25 Mar 2025 15:38:03 +0800 Subject: [PATCH 029/285] mm/show_mem: optimize si_meminfo_node by reducing redundant code Refactors the si_meminfo_node() function by reducing redundant code and improving readability. Moved the calculation of managed_pages inside the existing loop that processes pgdat->node_zones, eliminating the need for a separate loop. Simplified the logic by removing unnecessary preprocessor conditionals. Ensured that both totalram, totalhigh, and other memory statistics are consistently set without duplication. This change results in cleaner and more efficient code without altering functionality. Link: https://lkml.kernel.org/r/20250325073803.852594-1-ye.liu@linux.dev Signed-off-by: Ye Liu Acked-by: Mike Rapoport (Microsoft) Reviewed-by: Harry Yoo Signed-off-by: Andrew Morton --- mm/show_mem.c | 16 +++++----------- 1 file changed, 5 insertions(+), 11 deletions(-) diff --git a/mm/show_mem.c b/mm/show_mem.c index 6af13bcd2ab3..ad373b4b6e39 100644 --- a/mm/show_mem.c +++ b/mm/show_mem.c @@ -94,26 +94,20 @@ void si_meminfo_node(struct sysinfo *val, int nid) unsigned long free_highpages = 0; pg_data_t *pgdat = NODE_DATA(nid); - for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) - managed_pages += zone_managed_pages(&pgdat->node_zones[zone_type]); - val->totalram = managed_pages; - val->sharedram = node_page_state(pgdat, NR_SHMEM); - val->freeram = sum_zone_node_page_state(nid, NR_FREE_PAGES); -#ifdef CONFIG_HIGHMEM for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) { struct zone *zone = &pgdat->node_zones[zone_type]; - + managed_pages += zone_managed_pages(zone); if (is_highmem(zone)) { managed_highpages += zone_managed_pages(zone); free_highpages += zone_page_state(zone, NR_FREE_PAGES); } } + + val->totalram = managed_pages; + val->sharedram = node_page_state(pgdat, NR_SHMEM); + val->freeram = sum_zone_node_page_state(nid, NR_FREE_PAGES); val->totalhigh = managed_highpages; val->freehigh = free_highpages; -#else - val->totalhigh = managed_highpages; - val->freehigh = free_highpages; -#endif val->mem_unit = PAGE_SIZE; } #endif From 0bf19a357e0eaf03e757ac9482c45a797e40157a Mon Sep 17 00:00:00 2001 From: Siddarth G Date: Thu, 3 Apr 2025 15:43:45 +0530 Subject: [PATCH 030/285] selftests/mm: convert page_size to unsigned long Cppcheck warning: int result is assigned to long long variable. If the variable is long long to avoid loss of information, then you have loss of information. This patch changes the type of page_size from 'unsigned int' to 'unsigned long' instead of using ULL suffixes. Changing hpage_size to 'unsigned long' was considered, but since gethugepage() expects an int, this change was avoided. Link: https://lkml.kernel.org/r/20250403101345.29226-1-siddarthsgml@gmail.com Signed-off-by: Siddarth G Reported-by: David Binderman Closes: https://lore.kernel.org/all/AS8PR02MB10217315060BBFDB21F19643E9CA62@AS8PR02MB10217.eurprd02.prod.outlook.com/ Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/pagemap_ioctl.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/tools/testing/selftests/mm/pagemap_ioctl.c b/tools/testing/selftests/mm/pagemap_ioctl.c index 57b4bba2b45f..fe5ae8b25ff6 100644 --- a/tools/testing/selftests/mm/pagemap_ioctl.c +++ b/tools/testing/selftests/mm/pagemap_ioctl.c @@ -34,7 +34,7 @@ #define PAGEMAP "/proc/self/pagemap" int pagemap_fd; int uffd; -unsigned int page_size; +unsigned long page_size; unsigned int hpage_size; const char *progname; @@ -184,7 +184,7 @@ void *gethugetlb_mem(int size, int *shmid) int userfaultfd_tests(void) { - int mem_size, vec_size, written, num_pages = 16; + long mem_size, vec_size, written, num_pages = 16; char *mem, *vec; mem_size = num_pages * page_size; @@ -213,7 +213,7 @@ int userfaultfd_tests(void) written = pagemap_ioctl(mem, mem_size, vec, 1, PM_SCAN_WP_MATCHING | PM_SCAN_CHECK_WPASYNC, vec_size - 2, PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN); if (written < 0) - ksft_exit_fail_msg("error %d %d %s\n", written, errno, strerror(errno)); + ksft_exit_fail_msg("error %ld %d %s\n", written, errno, strerror(errno)); ksft_test_result(written == 0, "%s all new pages must not be written (dirty)\n", __func__); @@ -995,7 +995,7 @@ int unmapped_region_tests(void) { void *start = (void *)0x10000000; int written, len = 0x00040000; - int vec_size = len / page_size; + long vec_size = len / page_size; struct page_region *vec = malloc(sizeof(struct page_region) * vec_size); /* 1. Get written pages */ @@ -1051,7 +1051,7 @@ static void test_simple(void) int sanity_tests(void) { unsigned long long mem_size, vec_size; - int ret, fd, i, buf_size; + long ret, fd, i, buf_size; struct page_region *vec; char *mem, *fmem; struct stat sbuf; @@ -1160,7 +1160,7 @@ int sanity_tests(void) ret = stat(progname, &sbuf); if (ret < 0) - ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno)); + ksft_exit_fail_msg("error %ld %d %s\n", ret, errno, strerror(errno)); fmem = mmap(NULL, sbuf.st_size, PROT_READ, MAP_PRIVATE, fd, 0); if (fmem == MAP_FAILED) From cf42d4cccf0d01e375175393776a16dc47b9996f Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Thu, 27 Mar 2025 10:58:09 +0900 Subject: [PATCH 031/285] zram: modernize writeback interface The writeback interface supports a page_index=N parameter which performs writeback of the given page. Since we rarely need to writeback just one single page, the typical use case involves a number of writeback calls, each performing writeback of one page: echo page_index=100 > zram0/writeback ... echo page_index=200 > zram0/writeback echo page_index=500 > zram0/writeback ... echo page_index=700 > zram0/writeback One obvious downside of this is that it increases the number of syscalls. Less obvious, but a significantly more important downside, is that when given only one page to post-process zram cannot perform an optimal target selection. This becomes a critical limitation when writeback_limit is enabled, because under writeback_limit we want to guarantee the highest memory savings hence we first need to writeback pages that release the highest amount of zsmalloc pool memory. This patch adds page_indexes=LOW-HIGH parameter to the writeback interface: echo page_indexes=100-200 page_indexes=500-700 > zram0/writeback This gives zram a chance to apply an optimal target selection strategy on each iteration of the writeback loop. We also now permit multiple page_index parameters per call (previously zram would recognize only one page_index) and a mix or single pages and page ranges: echo page_index=42 page_index=99 page_indexes=100-200 \ page_indexes=500-700 > zram0/writeback Apart from that the patch also unifies parameters passing and resembles other "modern" zram device attributes (e.g. recompression), while the old interface used a mixed scheme: values-less parameters for mode and a key=value format for page_index. We still support the "old" value-less format for compatibility reasons. [senozhatsky@chromium.org: simplify parse_page_index() range checks, per Brian] nk: https://lkml.kernel.org/r/20250404015327.2427684-1-senozhatsky@chromium.org [sozhatsky@chromium.org: fix uninitialized variable in zram_writeback_slots(), per Dan] nk: https://lkml.kernel.org/r/20250409112611.1154282-1-senozhatsky@chromium.org Link: https://lkml.kernel.org/r/20250327015818.4148660-1-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky Reviewed-by: Brian Geffon Cc: Minchan Kim Cc: Richard Chang Cc: Sergey Senozhatsky Cc: Dan Carpenter Signed-off-by: Andrew Morton --- Documentation/admin-guide/blockdev/zram.rst | 17 ++ drivers/block/zram/zram_drv.c | 322 +++++++++++++------- 2 files changed, 233 insertions(+), 106 deletions(-) diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst index 9bdb30901a93..b8d36134a151 100644 --- a/Documentation/admin-guide/blockdev/zram.rst +++ b/Documentation/admin-guide/blockdev/zram.rst @@ -369,6 +369,23 @@ they could write a page index into the interface:: echo "page_index=1251" > /sys/block/zramX/writeback +In Linux 6.16 this interface underwent some rework. First, the interface +now supports `key=value` format for all of its parameters (`type=huge_idle`, +etc.) Second, the support for `page_indexes` was introduced, which specify +`LOW-HIGH` range (or ranges) of pages to be written-back. This reduces the +number of syscalls, but more importantly this enables optimal post-processing +target selection strategy. Usage example:: + + echo "type=idle" > /sys/block/zramX/writeback + echo "page_indexes=1-100 page_indexes=200-300" > \ + /sys/block/zramX/writeback + +We also now permit multiple page_index params per call and a mix of +single pages and page ranges:: + + echo page_index=42 page_index=99 page_indexes=100-200 \ + page_indexes=500-700 > /sys/block/zramX/writeback + If there are lots of write IO with flash device, potentially, it has flash wearout problem so that admin needs to design write limitation to guarantee storage health for entire product life. diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index 0ba18277ed7b..94e6e9b80bf0 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -734,114 +734,19 @@ static void read_from_bdev_async(struct zram *zram, struct page *page, submit_bio(bio); } -#define PAGE_WB_SIG "page_index=" - -#define PAGE_WRITEBACK 0 -#define HUGE_WRITEBACK (1<<0) -#define IDLE_WRITEBACK (1<<1) -#define INCOMPRESSIBLE_WRITEBACK (1<<2) - -static int scan_slots_for_writeback(struct zram *zram, u32 mode, - unsigned long nr_pages, - unsigned long index, - struct zram_pp_ctl *ctl) +static int zram_writeback_slots(struct zram *zram, struct zram_pp_ctl *ctl) { - for (; nr_pages != 0; index++, nr_pages--) { - bool ok = true; - - zram_slot_lock(zram, index); - if (!zram_allocated(zram, index)) - goto next; - - if (zram_test_flag(zram, index, ZRAM_WB) || - zram_test_flag(zram, index, ZRAM_SAME)) - goto next; - - if (mode & IDLE_WRITEBACK && - !zram_test_flag(zram, index, ZRAM_IDLE)) - goto next; - if (mode & HUGE_WRITEBACK && - !zram_test_flag(zram, index, ZRAM_HUGE)) - goto next; - if (mode & INCOMPRESSIBLE_WRITEBACK && - !zram_test_flag(zram, index, ZRAM_INCOMPRESSIBLE)) - goto next; - - ok = place_pp_slot(zram, ctl, index); -next: - zram_slot_unlock(zram, index); - if (!ok) - break; - } - - return 0; -} - -static ssize_t writeback_store(struct device *dev, - struct device_attribute *attr, const char *buf, size_t len) -{ - struct zram *zram = dev_to_zram(dev); - unsigned long nr_pages = zram->disksize >> PAGE_SHIFT; - struct zram_pp_ctl *ctl = NULL; - struct zram_pp_slot *pps; - unsigned long index = 0; - struct bio bio; - struct bio_vec bio_vec; - struct page *page = NULL; - ssize_t ret = len; - int mode, err; unsigned long blk_idx = 0; - - if (sysfs_streq(buf, "idle")) - mode = IDLE_WRITEBACK; - else if (sysfs_streq(buf, "huge")) - mode = HUGE_WRITEBACK; - else if (sysfs_streq(buf, "huge_idle")) - mode = IDLE_WRITEBACK | HUGE_WRITEBACK; - else if (sysfs_streq(buf, "incompressible")) - mode = INCOMPRESSIBLE_WRITEBACK; - else { - if (strncmp(buf, PAGE_WB_SIG, sizeof(PAGE_WB_SIG) - 1)) - return -EINVAL; - - if (kstrtol(buf + sizeof(PAGE_WB_SIG) - 1, 10, &index) || - index >= nr_pages) - return -EINVAL; - - nr_pages = 1; - mode = PAGE_WRITEBACK; - } - - down_read(&zram->init_lock); - if (!init_done(zram)) { - ret = -EINVAL; - goto release_init_lock; - } - - /* Do not permit concurrent post-processing actions. */ - if (atomic_xchg(&zram->pp_in_progress, 1)) { - up_read(&zram->init_lock); - return -EAGAIN; - } - - if (!zram->backing_dev) { - ret = -ENODEV; - goto release_init_lock; - } + struct page *page = NULL; + struct zram_pp_slot *pps; + struct bio_vec bio_vec; + struct bio bio; + int ret = 0, err; + u32 index; page = alloc_page(GFP_KERNEL); - if (!page) { - ret = -ENOMEM; - goto release_init_lock; - } - - ctl = init_pp_ctl(); - if (!ctl) { - ret = -ENOMEM; - goto release_init_lock; - } - - scan_slots_for_writeback(zram, mode, nr_pages, index, ctl); + if (!page) + return -ENOMEM; while ((pps = select_pp_slot(ctl))) { spin_lock(&zram->wb_limit_lock); @@ -929,10 +834,215 @@ next: if (blk_idx) free_block_bdev(zram, blk_idx); - -release_init_lock: if (page) __free_page(page); + + return ret; +} + +#define PAGE_WRITEBACK 0 +#define HUGE_WRITEBACK (1 << 0) +#define IDLE_WRITEBACK (1 << 1) +#define INCOMPRESSIBLE_WRITEBACK (1 << 2) + +static int parse_page_index(char *val, unsigned long nr_pages, + unsigned long *lo, unsigned long *hi) +{ + int ret; + + ret = kstrtoul(val, 10, lo); + if (ret) + return ret; + if (*lo >= nr_pages) + return -ERANGE; + *hi = *lo + 1; + return 0; +} + +static int parse_page_indexes(char *val, unsigned long nr_pages, + unsigned long *lo, unsigned long *hi) +{ + char *delim; + int ret; + + delim = strchr(val, '-'); + if (!delim) + return -EINVAL; + + *delim = 0x00; + ret = kstrtoul(val, 10, lo); + if (ret) + return ret; + if (*lo >= nr_pages) + return -ERANGE; + + ret = kstrtoul(delim + 1, 10, hi); + if (ret) + return ret; + if (*hi >= nr_pages || *lo > *hi) + return -ERANGE; + *hi += 1; + return 0; +} + +static int parse_mode(char *val, u32 *mode) +{ + *mode = 0; + + if (!strcmp(val, "idle")) + *mode = IDLE_WRITEBACK; + if (!strcmp(val, "huge")) + *mode = HUGE_WRITEBACK; + if (!strcmp(val, "huge_idle")) + *mode = IDLE_WRITEBACK | HUGE_WRITEBACK; + if (!strcmp(val, "incompressible")) + *mode = INCOMPRESSIBLE_WRITEBACK; + + if (*mode == 0) + return -EINVAL; + return 0; +} + +static int scan_slots_for_writeback(struct zram *zram, u32 mode, + unsigned long lo, unsigned long hi, + struct zram_pp_ctl *ctl) +{ + u32 index = lo; + + while (index < hi) { + bool ok = true; + + zram_slot_lock(zram, index); + if (!zram_allocated(zram, index)) + goto next; + + if (zram_test_flag(zram, index, ZRAM_WB) || + zram_test_flag(zram, index, ZRAM_SAME)) + goto next; + + if (mode & IDLE_WRITEBACK && + !zram_test_flag(zram, index, ZRAM_IDLE)) + goto next; + if (mode & HUGE_WRITEBACK && + !zram_test_flag(zram, index, ZRAM_HUGE)) + goto next; + if (mode & INCOMPRESSIBLE_WRITEBACK && + !zram_test_flag(zram, index, ZRAM_INCOMPRESSIBLE)) + goto next; + + ok = place_pp_slot(zram, ctl, index); +next: + zram_slot_unlock(zram, index); + if (!ok) + break; + index++; + } + + return 0; +} + +static ssize_t writeback_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t len) +{ + struct zram *zram = dev_to_zram(dev); + u64 nr_pages = zram->disksize >> PAGE_SHIFT; + unsigned long lo = 0, hi = nr_pages; + struct zram_pp_ctl *ctl = NULL; + char *args, *param, *val; + ssize_t ret = len; + int err, mode = 0; + + down_read(&zram->init_lock); + if (!init_done(zram)) { + up_read(&zram->init_lock); + return -EINVAL; + } + + /* Do not permit concurrent post-processing actions. */ + if (atomic_xchg(&zram->pp_in_progress, 1)) { + up_read(&zram->init_lock); + return -EAGAIN; + } + + if (!zram->backing_dev) { + ret = -ENODEV; + goto release_init_lock; + } + + ctl = init_pp_ctl(); + if (!ctl) { + ret = -ENOMEM; + goto release_init_lock; + } + + args = skip_spaces(buf); + while (*args) { + args = next_arg(args, ¶m, &val); + + /* + * Workaround to support the old writeback interface. + * + * The old writeback interface has a minor inconsistency and + * requires key=value only for page_index parameter, while the + * writeback mode is a valueless parameter. + * + * This is not the case anymore and now all parameters are + * required to have values, however, we need to support the + * legacy writeback interface format so we check if we can + * recognize a valueless parameter as the (legacy) writeback + * mode. + */ + if (!val || !*val) { + err = parse_mode(param, &mode); + if (err) { + ret = err; + goto release_init_lock; + } + + scan_slots_for_writeback(zram, mode, lo, hi, ctl); + break; + } + + if (!strcmp(param, "type")) { + err = parse_mode(val, &mode); + if (err) { + ret = err; + goto release_init_lock; + } + + scan_slots_for_writeback(zram, mode, lo, hi, ctl); + break; + } + + if (!strcmp(param, "page_index")) { + err = parse_page_index(val, nr_pages, &lo, &hi); + if (err) { + ret = err; + goto release_init_lock; + } + + scan_slots_for_writeback(zram, mode, lo, hi, ctl); + continue; + } + + if (!strcmp(param, "page_indexes")) { + err = parse_page_indexes(val, nr_pages, &lo, &hi); + if (err) { + ret = err; + goto release_init_lock; + } + + scan_slots_for_writeback(zram, mode, lo, hi, ctl); + continue; + } + } + + err = zram_writeback_slots(zram, ctl); + if (err) + ret = err; + +release_init_lock: release_pp_ctl(zram, ctl); atomic_set(&zram->pp_in_progress, 0); up_read(&zram->init_lock); From 3a531a9939623e13cf798e9ef5e7edd3e74fe733 Mon Sep 17 00:00:00 2001 From: Ye Liu Date: Fri, 28 Mar 2025 09:20:31 +0800 Subject: [PATCH 032/285] mm/page_alloc: simplify free_page_is_bad by removing free_page_is_bad_report Refactor free_page_is_bad() to call bad_page() directly, removing the intermediate free_page_is_bad_report(). This reduces unnecessary indirection, improving code clarity and maintainability without changing functionality. Link: https://lkml.kernel.org/r/20250328012031.1204993-1-ye.liu@linux.dev Signed-off-by: Ye Liu Reviewed-by: Anshuman Khandual Reviewed-by: Oscar Salvador Signed-off-by: Andrew Morton --- mm/page_alloc.c | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1ee0e8265888..b8c15d572f2d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -934,19 +934,13 @@ static const char *page_bad_reason(struct page *page, unsigned long flags) return bad_reason; } -static void free_page_is_bad_report(struct page *page) -{ - bad_page(page, - page_bad_reason(page, PAGE_FLAGS_CHECK_AT_FREE)); -} - static inline bool free_page_is_bad(struct page *page) { if (likely(page_expected_state(page, PAGE_FLAGS_CHECK_AT_FREE))) return false; /* Something has gone sideways, find it */ - free_page_is_bad_report(page); + bad_page(page, page_bad_reason(page, PAGE_FLAGS_CHECK_AT_FREE)); return true; } From bb317f00b9b7f59de5a2ef512a76a1ae285bd23f Mon Sep 17 00:00:00 2001 From: Michal Clapinski Date: Fri, 4 Apr 2025 13:11:02 +0200 Subject: [PATCH 033/285] mm/compaction: remove low watermark cap for proactive compaction Patch series "mm/compaction: allow more aggressive proactive compaction", v4. Our goal is to keep memory usage of a VM low on the host. For that reason, we use free page reporting which by default reports free pages of order 9 and larger to the host to be freed. The feature works well only if the memory in the guest is not fragmented below pages of order 9. Proactive compaction can be reused to achieve defragmentation after some parameter tweaking. When the fragmentation score (lower is better) gets larger than the high watermark, proactive compaction kicks in. Compaction stops when the score goes below the low watermark (or no progress is made and backoff kicks in). Let's define the difference between high and low watermarks as leeway. Before these changes, the minimum possible value for low watermark was 5 and the leeway was hardcoded to 10 (so minimum possible value for high watermark was 15). To test this, I created a VM with 19GB of memory and free page reporting enabled. The VM was ~idle. I meassured the memory usage from inside the guest (/proc/meminfo) and from the host (provided by the hypervisor). Before: https://drive.google.com/file/d/1Xw23lRry_PgEH3f6QRnSGvoHh2u9UHyI/view?usp=sharing After: https://drive.google.com/file/d/1wMhpIzepx6t44F70yCPA50n1S5V2AT-a/view?usp=sharing This patch (of 2): Previously a min cap of 5 has been set in the commit introducing proactive compaction. This was to make sure users don't hurt themselves by setting the proactiveness to 100 and making their system unresponsive. But the compaction mechanism has a backoff mechanism that will sleep for 30s if no progress is made, so I don't see a significant risk here. My system (19GB of memory) has been perfectly fine with both watermarks hardcoded to 0. Link: https://lkml.kernel.org/r/20250404111103.1994507-1-mclapinski@google.com Link: https://lkml.kernel.org/r/20250404111103.1994507-2-mclapinski@google.com Signed-off-by: Michal Clapinski Cc: Vlastimil Babka Cc: Mel Gorman Signed-off-by: Andrew Morton --- mm/compaction.c | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/mm/compaction.c b/mm/compaction.c index dd868c861774..f9ee06d55726 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -2251,12 +2251,7 @@ static unsigned int fragmentation_score_wmark(bool low) { unsigned int wmark_low; - /* - * Cap the low watermark to avoid excessive compaction - * activity in case a user sets the proactiveness tunable - * close to 100 (maximum). - */ - wmark_low = max(100U - sysctl_compaction_proactiveness, 5U); + wmark_low = 100U - sysctl_compaction_proactiveness; return low ? wmark_low : min(wmark_low + 10, 100U); } From 98c9389042f4d1e6aa73fbaf79e2e962c9497fc5 Mon Sep 17 00:00:00 2001 From: Michal Clapinski Date: Fri, 4 Apr 2025 13:11:03 +0200 Subject: [PATCH 034/285] mm/compaction: reduce the difference between low and high watermarks Reduce the diff between low and high watermarks when compaction proactiveness is set to high. This allows users who set the proactiveness really high to have more stable fragmentation score over time. Link: https://lkml.kernel.org/r/20250404111103.1994507-3-mclapinski@google.com Signed-off-by: Michal Clapinski Cc: Mel Gorman Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- Documentation/admin-guide/sysctl/vm.rst | 6 ++++++ mm/compaction.c | 5 +++-- 2 files changed, 9 insertions(+), 2 deletions(-) diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index 8290177b4f75..b325bfbc2611 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -131,6 +131,12 @@ to latency spikes in unsuspecting applications. The kernel employs various heuristics to avoid wasting CPU cycles if it detects that proactive compaction is not being effective. +Setting the value above 80 will, in addition to lowering the acceptable level +of fragmentation, make the compaction code more sensitive to increases in +fragmentation, i.e. compaction will trigger more often, but reduce +fragmentation by a smaller amount. +This makes the fragmentation level more stable over time. + Be careful when setting it to extreme values like 100, as that may cause excessive background compaction activity. diff --git a/mm/compaction.c b/mm/compaction.c index f9ee06d55726..fe51a73b91a7 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -2249,10 +2249,11 @@ static unsigned int fragmentation_score_node(pg_data_t *pgdat) static unsigned int fragmentation_score_wmark(bool low) { - unsigned int wmark_low; + unsigned int wmark_low, leeway; wmark_low = 100U - sysctl_compaction_proactiveness; - return low ? wmark_low : min(wmark_low + 10, 100U); + leeway = min(10U, wmark_low / 2); + return low ? wmark_low : min(wmark_low + leeway, 100U); } static bool should_proactive_compact_node(pg_data_t *pgdat) From aa8d89d1472b010cbcda7288a4da76e4852a260d Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Wed, 2 Apr 2025 22:33:26 -0700 Subject: [PATCH 035/285] memcg: vmalloc: simplify MEMCG_VMALLOC updates The vmalloc region can either be charged to a single memcg or none. At the moment kernel traverses all the pages backing the vmalloc region to update the MEMCG_VMALLOC stat. However there is no need to look at all the pages as all those pages will be charged to a single memcg or none. Simplify the MEMCG_VMALLOC update by just looking at the first page of the vmalloc region. [shakeel.butt@linux.dev: add comment] Link: https://lkml.kernel.org/r/bmlkdbqgwboyqrnxyom7n52fjmo76ux77jhqw5odc6c6dfon3h@zdylwtmlywbt Link: https://lkml.kernel.org/r/20250403053326.26860-1-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Acked-by: Michal Hocko Reviewed-by: Uladzislau Rezki (Sony) Reviewed-by: Yosry Ahmed Cc: Johannes Weiner Cc: Muchun Song Cc: Roman Gushchin Signed-off-by: Andrew Morton --- mm/vmalloc.c | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 2d7511654831..44b735a4573a 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -3371,12 +3371,13 @@ void vfree(const void *addr) if (unlikely(vm->flags & VM_FLUSH_RESET_PERMS)) vm_reset_perms(vm); + /* All pages of vm should be charged to same memcg, so use first one. */ + if (vm->nr_pages && !(vm->flags & VM_MAP_PUT_PAGES)) + mod_memcg_page_state(vm->pages[0], MEMCG_VMALLOC, -vm->nr_pages); for (i = 0; i < vm->nr_pages; i++) { struct page *page = vm->pages[i]; BUG_ON(!page); - if (!(vm->flags & VM_MAP_PUT_PAGES)) - mod_memcg_page_state(page, MEMCG_VMALLOC, -1); /* * High-order allocs for huge vmallocs are split, so * can be freed as an array of order-0 allocations @@ -3672,12 +3673,10 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, node, page_order, nr_small_pages, area->pages); atomic_long_add(area->nr_pages, &nr_vmalloc_pages); - if (gfp_mask & __GFP_ACCOUNT) { - int i; - - for (i = 0; i < area->nr_pages; i++) - mod_memcg_page_state(area->pages[i], MEMCG_VMALLOC, 1); - } + /* All pages of vm should be charged to same memcg, so use first one. */ + if (gfp_mask & __GFP_ACCOUNT && area->nr_pages) + mod_memcg_page_state(area->pages[0], MEMCG_VMALLOC, + area->nr_pages); /* * If not enough pages were obtained to accomplish an From e56fa8f5e108676393a363201819a094b0991b3c Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Thu, 3 Apr 2025 18:39:05 -0700 Subject: [PATCH 036/285] memcg: remove root memcg check from refill_stock refill_stock can not be called with root memcg, so there is no need to check it. Instead add a warning if root is ever passed to it. Link: https://lkml.kernel.org/r/20250404013913.1663035-1-shakeel.butt@linux.dev Link: https://lkml.kernel.org/r/20250404013913.1663035-2-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Reviewed-by: Roman Gushchin Acked-by: Vlastimil Babka Cc: Johannes Weiner Cc: Michal Hocko Cc: Muchun Song Cc: Roman Gushchin Cc: Sebastian Andrzej Siewior Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/memcontrol.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b16b5b807d7c..ae1e953cead7 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1893,13 +1893,13 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) { unsigned long flags; + VM_WARN_ON_ONCE(mem_cgroup_is_root(memcg)); + if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags)) { /* * In case of unlikely failure to lock percpu stock_lock * uncharge memcg directly. */ - if (mem_cgroup_is_root(memcg)) - return; page_counter_uncharge(&memcg->memory, nr_pages); if (do_memsw_account()) page_counter_uncharge(&memcg->memsw, nr_pages); From 65d2d15f41c603dc8f11ff0d1bcda2dd847a65b5 Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Thu, 3 Apr 2025 18:39:06 -0700 Subject: [PATCH 037/285] memcg: decouple drain_obj_stock from local stock Currently drain_obj_stock() can potentially call __refill_stock which accesses local cpu stock and thus requires memcg stock's local_lock. However if we look at the code paths leading to drain_obj_stock(), there is never a good reason to refill the memcg stock at all from it. At the moment, drain_obj_stock can be called from reclaim, hotplug cpu teardown, mod_objcg_state() and refill_obj_stock(). For reclaim and hotplug there is no need to refill. For the other two paths, most probably the newly switched objcg would be used in near future and thus no need to refill stock with the older objcg. In addition, __refill_stock() from drain_obj_stock() happens on rare cases, so performance is not really an issue. Let's just uncharge directly instead of refill which will also decouple drain_obj_stock from local cpu stock and local_lock requirements. Link: https://lkml.kernel.org/r/20250404013913.1663035-3-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Acked-by: Vlastimil Babka Reviewed-by: Roman Gushchin Cc: Johannes Weiner Cc: Michal Hocko Cc: Muchun Song Cc: Sebastian Andrzej Siewior Signed-off-by: Andrew Morton --- mm/memcontrol.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ae1e953cead7..52be78515d70 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2876,7 +2876,12 @@ static struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock) mod_memcg_state(memcg, MEMCG_KMEM, -nr_pages); memcg1_account_kmem(memcg, -nr_pages); - __refill_stock(memcg, nr_pages); + if (!mem_cgroup_is_root(memcg)) { + page_counter_uncharge(&memcg->memory, nr_pages); + if (do_memsw_account()) + page_counter_uncharge(&memcg->memsw, + nr_pages); + } css_put(&memcg->css); } From 89f342af660332656bbade4cc8a919a40674ac41 Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Thu, 3 Apr 2025 18:39:07 -0700 Subject: [PATCH 038/285] memcg: introduce memcg_uncharge At multiple places in memcontrol.c, the memory and memsw page counters are being uncharged. This is error-prone. Let's move the functionality to a newly introduced memcg_uncharge and call it from all those places. Link: https://lkml.kernel.org/r/20250404013913.1663035-4-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Acked-by: Vlastimil Babka Reviewed-by: Roman Gushchin Cc: Johannes Weiner Cc: Michal Hocko Cc: Muchun Song Cc: Sebastian Andrzej Siewior Signed-off-by: Andrew Morton --- mm/memcontrol.c | 28 ++++++++++++---------------- 1 file changed, 12 insertions(+), 16 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 52be78515d70..dfb3f14c1178 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1822,6 +1822,13 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages, return ret; } +static void memcg_uncharge(struct mem_cgroup *memcg, unsigned int nr_pages) +{ + page_counter_uncharge(&memcg->memory, nr_pages); + if (do_memsw_account()) + page_counter_uncharge(&memcg->memsw, nr_pages); +} + /* * Returns stocks cached in percpu and reset cached information. */ @@ -1834,10 +1841,7 @@ static void drain_stock(struct memcg_stock_pcp *stock) return; if (stock_pages) { - page_counter_uncharge(&old->memory, stock_pages); - if (do_memsw_account()) - page_counter_uncharge(&old->memsw, stock_pages); - + memcg_uncharge(old, stock_pages); WRITE_ONCE(stock->nr_pages, 0); } @@ -1900,9 +1904,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) * In case of unlikely failure to lock percpu stock_lock * uncharge memcg directly. */ - page_counter_uncharge(&memcg->memory, nr_pages); - if (do_memsw_account()) - page_counter_uncharge(&memcg->memsw, nr_pages); + memcg_uncharge(memcg, nr_pages); return; } __refill_stock(memcg, nr_pages); @@ -2876,12 +2878,8 @@ static struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock) mod_memcg_state(memcg, MEMCG_KMEM, -nr_pages); memcg1_account_kmem(memcg, -nr_pages); - if (!mem_cgroup_is_root(memcg)) { - page_counter_uncharge(&memcg->memory, nr_pages); - if (do_memsw_account()) - page_counter_uncharge(&memcg->memsw, - nr_pages); - } + if (!mem_cgroup_is_root(memcg)) + memcg_uncharge(memcg, nr_pages); css_put(&memcg->css); } @@ -4702,9 +4700,7 @@ static inline void uncharge_gather_clear(struct uncharge_gather *ug) static void uncharge_batch(const struct uncharge_gather *ug) { if (ug->nr_memory) { - page_counter_uncharge(&ug->memcg->memory, ug->nr_memory); - if (do_memsw_account()) - page_counter_uncharge(&ug->memcg->memsw, ug->nr_memory); + memcg_uncharge(ug->memcg, ug->nr_memory); if (ug->nr_kmem) { mod_memcg_state(ug->memcg, MEMCG_KMEM, -ug->nr_kmem); memcg1_account_kmem(ug->memcg, -ug->nr_kmem); From cbc091441d3adce9a1fbcd0a0a84bd6abe077a5b Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Thu, 3 Apr 2025 18:39:08 -0700 Subject: [PATCH 039/285] memcg: manually inline __refill_stock There are no more multiple callers of __refill_stock(), so simply inline it to refill_stock(). Link: https://lkml.kernel.org/r/20250404013913.1663035-5-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Acked-by: Vlastimil Babka Reviewed-by: Roman Gushchin Cc: Johannes Weiner Cc: Michal Hocko Cc: Muchun Song Cc: Sebastian Andrzej Siewior Signed-off-by: Andrew Morton --- mm/memcontrol.c | 36 ++++++++++++++---------------------- 1 file changed, 14 insertions(+), 22 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index dfb3f14c1178..03a2be6d4a67 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1871,30 +1871,10 @@ static void drain_local_stock(struct work_struct *dummy) obj_cgroup_put(old); } -/* - * Cache charges(val) to local per_cpu area. - * This will be consumed by consume_stock() function, later. - */ -static void __refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) +static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) { struct memcg_stock_pcp *stock; unsigned int stock_pages; - - stock = this_cpu_ptr(&memcg_stock); - if (READ_ONCE(stock->cached) != memcg) { /* reset if necessary */ - drain_stock(stock); - css_get(&memcg->css); - WRITE_ONCE(stock->cached, memcg); - } - stock_pages = READ_ONCE(stock->nr_pages) + nr_pages; - WRITE_ONCE(stock->nr_pages, stock_pages); - - if (stock_pages > MEMCG_CHARGE_BATCH) - drain_stock(stock); -} - -static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) -{ unsigned long flags; VM_WARN_ON_ONCE(mem_cgroup_is_root(memcg)); @@ -1907,7 +1887,19 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) memcg_uncharge(memcg, nr_pages); return; } - __refill_stock(memcg, nr_pages); + + stock = this_cpu_ptr(&memcg_stock); + if (READ_ONCE(stock->cached) != memcg) { /* reset if necessary */ + drain_stock(stock); + css_get(&memcg->css); + WRITE_ONCE(stock->cached, memcg); + } + stock_pages = READ_ONCE(stock->nr_pages) + nr_pages; + WRITE_ONCE(stock->nr_pages, stock_pages); + + if (stock_pages > MEMCG_CHARGE_BATCH) + drain_stock(stock); + local_unlock_irqrestore(&memcg_stock.stock_lock, flags); } From b6d0471117daec360bad2b881c1e2d0f4d2c4d3b Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Thu, 3 Apr 2025 18:39:09 -0700 Subject: [PATCH 040/285] memcg: no refilling stock from obj_cgroup_release obj_cgroup_release is called when all the references to the objcg have been released i.e. no more memory objects are pointing to it. Most probably objcg->memcg will be pointing to some ancestor memcg. In obj_cgroup_release(), the kernel calls obj_cgroup_uncharge_pages() which refills the local stock. There is no need to refill the local stock with some ancestor memcg and flush the local stock. Let's decouple obj_cgroup_release() from the local stock by uncharging instead of refilling. One additional benefit of this change is that it removes the requirement to only call obj_cgroup_put() outside of local_lock. Link: https://lkml.kernel.org/r/20250404013913.1663035-6-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Acked-by: Vlastimil Babka Reviewed-by: Roman Gushchin Cc: Johannes Weiner Cc: Michal Hocko Cc: Muchun Song Cc: Sebastian Andrzej Siewior Signed-off-by: Andrew Morton --- mm/memcontrol.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 03a2be6d4a67..df52084e90f4 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -129,8 +129,7 @@ bool mem_cgroup_kmem_disabled(void) return cgroup_memory_nokmem; } -static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg, - unsigned int nr_pages); +static void memcg_uncharge(struct mem_cgroup *memcg, unsigned int nr_pages); static void obj_cgroup_release(struct percpu_ref *ref) { @@ -163,8 +162,16 @@ static void obj_cgroup_release(struct percpu_ref *ref) WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1)); nr_pages = nr_bytes >> PAGE_SHIFT; - if (nr_pages) - obj_cgroup_uncharge_pages(objcg, nr_pages); + if (nr_pages) { + struct mem_cgroup *memcg; + + memcg = get_mem_cgroup_from_objcg(objcg); + mod_memcg_state(memcg, MEMCG_KMEM, -nr_pages); + memcg1_account_kmem(memcg, -nr_pages); + if (!mem_cgroup_is_root(memcg)) + memcg_uncharge(memcg, nr_pages); + mem_cgroup_put(memcg); + } spin_lock_irqsave(&objcg_lock, flags); list_del(&objcg->list); From ae51c775aa2b1fd4fbaf781740ec8762292bec3a Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Thu, 3 Apr 2025 18:39:10 -0700 Subject: [PATCH 041/285] memcg: do obj_cgroup_put inside drain_obj_stock Previously we could not call obj_cgroup_put() inside the local lock because on the put on the last reference, the release function obj_cgroup_release() may try to re-acquire the local lock. However that chain has been broken. Now simply do obj_cgroup_put() inside drain_obj_stock() instead of returning the old objcg. Link: https://lkml.kernel.org/r/20250404013913.1663035-7-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Reviewed-by: Roman Gushchin Acked-by: Vlastimil Babka Cc: Johannes Weiner Cc: Michal Hocko Cc: Muchun Song Cc: Sebastian Andrzej Siewior Signed-off-by: Andrew Morton --- mm/memcontrol.c | 37 +++++++++++-------------------------- 1 file changed, 11 insertions(+), 26 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index df52084e90f4..7988a42b29bf 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1785,7 +1785,7 @@ static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock) = { }; static DEFINE_MUTEX(percpu_charge_mutex); -static struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock); +static void drain_obj_stock(struct memcg_stock_pcp *stock); static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, struct mem_cgroup *root_memcg); @@ -1859,7 +1859,6 @@ static void drain_stock(struct memcg_stock_pcp *stock) static void drain_local_stock(struct work_struct *dummy) { struct memcg_stock_pcp *stock; - struct obj_cgroup *old = NULL; unsigned long flags; /* @@ -1870,12 +1869,11 @@ static void drain_local_stock(struct work_struct *dummy) local_lock_irqsave(&memcg_stock.stock_lock, flags); stock = this_cpu_ptr(&memcg_stock); - old = drain_obj_stock(stock); + drain_obj_stock(stock); drain_stock(stock); clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags); local_unlock_irqrestore(&memcg_stock.stock_lock, flags); - obj_cgroup_put(old); } static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) @@ -1958,18 +1956,16 @@ void drain_all_stock(struct mem_cgroup *root_memcg) static int memcg_hotplug_cpu_dead(unsigned int cpu) { struct memcg_stock_pcp *stock; - struct obj_cgroup *old; unsigned long flags; stock = &per_cpu(memcg_stock, cpu); /* drain_obj_stock requires stock_lock */ local_lock_irqsave(&memcg_stock.stock_lock, flags); - old = drain_obj_stock(stock); + drain_obj_stock(stock); local_unlock_irqrestore(&memcg_stock.stock_lock, flags); drain_stock(stock); - obj_cgroup_put(old); return 0; } @@ -2766,24 +2762,20 @@ void __memcg_kmem_uncharge_page(struct page *page, int order) } /* Replace the stock objcg with objcg, return the old objcg */ -static struct obj_cgroup *replace_stock_objcg(struct memcg_stock_pcp *stock, - struct obj_cgroup *objcg) +static void replace_stock_objcg(struct memcg_stock_pcp *stock, + struct obj_cgroup *objcg) { - struct obj_cgroup *old = NULL; - - old = drain_obj_stock(stock); + drain_obj_stock(stock); obj_cgroup_get(objcg); stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes) ? atomic_xchg(&objcg->nr_charged_bytes, 0) : 0; WRITE_ONCE(stock->cached_objcg, objcg); - return old; } static void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, enum node_stat_item idx, int nr) { struct memcg_stock_pcp *stock; - struct obj_cgroup *old = NULL; unsigned long flags; int *bytes; @@ -2796,7 +2788,7 @@ static void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, * changes. */ if (READ_ONCE(stock->cached_objcg) != objcg) { - old = replace_stock_objcg(stock, objcg); + replace_stock_objcg(stock, objcg); stock->cached_pgdat = pgdat; } else if (stock->cached_pgdat != pgdat) { /* Flush the existing cached vmstat data */ @@ -2837,7 +2829,6 @@ static void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, __mod_objcg_mlstate(objcg, pgdat, idx, nr); local_unlock_irqrestore(&memcg_stock.stock_lock, flags); - obj_cgroup_put(old); } static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) @@ -2859,12 +2850,12 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) return ret; } -static struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock) +static void drain_obj_stock(struct memcg_stock_pcp *stock) { struct obj_cgroup *old = READ_ONCE(stock->cached_objcg); if (!old) - return NULL; + return; if (stock->nr_bytes) { unsigned int nr_pages = stock->nr_bytes >> PAGE_SHIFT; @@ -2917,11 +2908,7 @@ static struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock) } WRITE_ONCE(stock->cached_objcg, NULL); - /* - * The `old' objects needs to be released by the caller via - * obj_cgroup_put() outside of memcg_stock_pcp::stock_lock. - */ - return old; + obj_cgroup_put(old); } static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, @@ -2943,7 +2930,6 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, bool allow_uncharge) { struct memcg_stock_pcp *stock; - struct obj_cgroup *old = NULL; unsigned long flags; unsigned int nr_pages = 0; @@ -2951,7 +2937,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, stock = this_cpu_ptr(&memcg_stock); if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */ - old = replace_stock_objcg(stock, objcg); + replace_stock_objcg(stock, objcg); allow_uncharge = true; /* Allow uncharge when objcg changes */ } stock->nr_bytes += nr_bytes; @@ -2962,7 +2948,6 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, } local_unlock_irqrestore(&memcg_stock.stock_lock, flags); - obj_cgroup_put(old); if (nr_pages) obj_cgroup_uncharge_pages(objcg, nr_pages); From 42a1910cfd23c46307a000a81127e02df247284c Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Thu, 3 Apr 2025 18:39:11 -0700 Subject: [PATCH 042/285] memcg: use __mod_memcg_state in drain_obj_stock For non-PREEMPT_RT kernels, drain_obj_stock() is always called with irq disabled, so we can use __mod_memcg_state() instead of mod_memcg_state(). For PREEMPT_RT, we need to add memcg_stats_[un]lock in __mod_memcg_state(). Link: https://lkml.kernel.org/r/20250404013913.1663035-8-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Reviewed-by: Sebastian Andrzej Siewior Reviewed-by: Roman Gushchin Acked-by: Vlastimil Babka Cc: Johannes Weiner Cc: Michal Hocko Cc: Muchun Song Signed-off-by: Andrew Morton --- mm/memcontrol.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 7988a42b29bf..33aeddfff0ba 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -710,10 +710,12 @@ void __mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx, if (WARN_ONCE(BAD_STAT_IDX(i), "%s: missing stat item %d\n", __func__, idx)) return; + memcg_stats_lock(); __this_cpu_add(memcg->vmstats_percpu->state[i], val); val = memcg_state_val_in_pages(idx, val); memcg_rstat_updated(memcg, val); trace_mod_memcg_state(memcg, idx, val); + memcg_stats_unlock(); } #ifdef CONFIG_MEMCG_V1 @@ -2866,7 +2868,7 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock) memcg = get_mem_cgroup_from_objcg(old); - mod_memcg_state(memcg, MEMCG_KMEM, -nr_pages); + __mod_memcg_state(memcg, MEMCG_KMEM, -nr_pages); memcg1_account_kmem(memcg, -nr_pages); if (!mem_cgroup_is_root(memcg)) memcg_uncharge(memcg, nr_pages); From bc730030f956fe83e94625c546c30ce6bae32d6b Mon Sep 17 00:00:00 2001 From: Vlastimil Babka Date: Thu, 3 Apr 2025 18:39:12 -0700 Subject: [PATCH 043/285] memcg: combine slab obj stock charging and accounting When handing slab objects, we use obj_cgroup_[un]charge() for (un)charging and mod_objcg_state() to account NR_SLAB_[UN]RECLAIMABLE_B. All these operations use the percpu stock for performance. However with the calls being separate, the stock_lock is taken twice in each case. By refactoring the code, we can turn mod_objcg_state() into __account_obj_stock() which is called on a stock that's already locked and validated. On the charging side we can call this function from consume_obj_stock() when it succeeds, and refill_obj_stock() in the fallback. We just expand parameters of these functions as necessary. The uncharge side from __memcg_slab_free_hook() is just the call to refill_obj_stock(). Other callers of obj_cgroup_[un]charge() (i.e. not slab) simply pass the extra parameters as NULL/zeroes to skip the __account_obj_stock() operation. In __memcg_slab_post_alloc_hook() we now charge each object separately, but that's not a problem as we did call mod_objcg_state() for each object separately, and most allocations are non-bulk anyway. This could be improved by batching all operations until slab_pgdat(slab) changes. Some preliminary benchmarking with a kfree(kmalloc()) loop of 10M iterations with/without __GFP_ACCOUNT: Before the patch: kmalloc/kfree !memcg: 581390144 cycles kmalloc/kfree memcg: 783689984 cycles After the patch: kmalloc/kfree memcg: 658723808 cycles More than half of the overhead of __GFP_ACCOUNT relative to non-accounted case seems eliminated. Link: https://lkml.kernel.org/r/20250404013913.1663035-9-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Signed-off-by: Vlastimil Babka Reviewed-by: Roman Gushchin Cc: Johannes Weiner Cc: Michal Hocko Cc: Muchun Song Cc: Sebastian Andrzej Siewior Signed-off-by: Andrew Morton --- mm/memcontrol.c | 77 +++++++++++++++++++++++++++++-------------------- 1 file changed, 46 insertions(+), 31 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 33aeddfff0ba..3bb02f672e39 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2774,25 +2774,17 @@ static void replace_stock_objcg(struct memcg_stock_pcp *stock, WRITE_ONCE(stock->cached_objcg, objcg); } -static void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, - enum node_stat_item idx, int nr) +static void __account_obj_stock(struct obj_cgroup *objcg, + struct memcg_stock_pcp *stock, int nr, + struct pglist_data *pgdat, enum node_stat_item idx) { - struct memcg_stock_pcp *stock; - unsigned long flags; int *bytes; - local_lock_irqsave(&memcg_stock.stock_lock, flags); - stock = this_cpu_ptr(&memcg_stock); - /* * Save vmstat data in stock and skip vmstat array update unless - * accumulating over a page of vmstat data or when pgdat or idx - * changes. + * accumulating over a page of vmstat data or when pgdat changes. */ - if (READ_ONCE(stock->cached_objcg) != objcg) { - replace_stock_objcg(stock, objcg); - stock->cached_pgdat = pgdat; - } else if (stock->cached_pgdat != pgdat) { + if (stock->cached_pgdat != pgdat) { /* Flush the existing cached vmstat data */ struct pglist_data *oldpg = stock->cached_pgdat; @@ -2829,11 +2821,10 @@ static void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, } if (nr) __mod_objcg_mlstate(objcg, pgdat, idx, nr); - - local_unlock_irqrestore(&memcg_stock.stock_lock, flags); } -static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) +static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, + struct pglist_data *pgdat, enum node_stat_item idx) { struct memcg_stock_pcp *stock; unsigned long flags; @@ -2845,6 +2836,9 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) if (objcg == READ_ONCE(stock->cached_objcg) && stock->nr_bytes >= nr_bytes) { stock->nr_bytes -= nr_bytes; ret = true; + + if (pgdat) + __account_obj_stock(objcg, stock, nr_bytes, pgdat, idx); } local_unlock_irqrestore(&memcg_stock.stock_lock, flags); @@ -2929,7 +2923,8 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, } static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, - bool allow_uncharge) + bool allow_uncharge, int nr_acct, struct pglist_data *pgdat, + enum node_stat_item idx) { struct memcg_stock_pcp *stock; unsigned long flags; @@ -2944,6 +2939,9 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, } stock->nr_bytes += nr_bytes; + if (pgdat) + __account_obj_stock(objcg, stock, nr_acct, pgdat, idx); + if (allow_uncharge && (stock->nr_bytes > PAGE_SIZE)) { nr_pages = stock->nr_bytes >> PAGE_SHIFT; stock->nr_bytes &= (PAGE_SIZE - 1); @@ -2955,12 +2953,13 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, obj_cgroup_uncharge_pages(objcg, nr_pages); } -int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size) +static int obj_cgroup_charge_account(struct obj_cgroup *objcg, gfp_t gfp, size_t size, + struct pglist_data *pgdat, enum node_stat_item idx) { unsigned int nr_pages, nr_bytes; int ret; - if (consume_obj_stock(objcg, size)) + if (likely(consume_obj_stock(objcg, size, pgdat, idx))) return 0; /* @@ -2993,15 +2992,21 @@ int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size) nr_pages += 1; ret = obj_cgroup_charge_pages(objcg, gfp, nr_pages); - if (!ret && nr_bytes) - refill_obj_stock(objcg, PAGE_SIZE - nr_bytes, false); + if (!ret && (nr_bytes || pgdat)) + refill_obj_stock(objcg, nr_bytes ? PAGE_SIZE - nr_bytes : 0, + false, size, pgdat, idx); return ret; } +int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size) +{ + return obj_cgroup_charge_account(objcg, gfp, size, NULL, 0); +} + void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size) { - refill_obj_stock(objcg, size, true); + refill_obj_stock(objcg, size, true, 0, NULL, 0); } static inline size_t obj_full_size(struct kmem_cache *s) @@ -3053,23 +3058,32 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru, return false; } - if (obj_cgroup_charge(objcg, flags, size * obj_full_size(s))) - return false; - for (i = 0; i < size; i++) { slab = virt_to_slab(p[i]); if (!slab_obj_exts(slab) && alloc_slab_obj_exts(slab, s, flags, false)) { - obj_cgroup_uncharge(objcg, obj_full_size(s)); continue; } + /* + * if we fail and size is 1, memcg_alloc_abort_single() will + * just free the object, which is ok as we have not assigned + * objcg to its obj_ext yet + * + * for larger sizes, kmem_cache_free_bulk() will uncharge + * any objects that were already charged and obj_ext assigned + * + * TODO: we could batch this until slab_pgdat(slab) changes + * between iterations, with a more complicated undo + */ + if (obj_cgroup_charge_account(objcg, flags, obj_full_size(s), + slab_pgdat(slab), cache_vmstat_idx(s))) + return false; + off = obj_to_index(s, slab, p[i]); obj_cgroup_get(objcg); slab_obj_exts(slab)[off].objcg = objcg; - mod_objcg_state(objcg, slab_pgdat(slab), - cache_vmstat_idx(s), obj_full_size(s)); } return true; @@ -3078,6 +3092,8 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru, void __memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab, void **p, int objects, struct slabobj_ext *obj_exts) { + size_t obj_size = obj_full_size(s); + for (int i = 0; i < objects; i++) { struct obj_cgroup *objcg; unsigned int off; @@ -3088,9 +3104,8 @@ void __memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab, continue; obj_exts[off].objcg = NULL; - obj_cgroup_uncharge(objcg, obj_full_size(s)); - mod_objcg_state(objcg, slab_pgdat(slab), cache_vmstat_idx(s), - -obj_full_size(s)); + refill_obj_stock(objcg, obj_size, true, -obj_size, + slab_pgdat(slab), cache_vmstat_idx(s)); obj_cgroup_put(objcg); } } From ac26920d58224a6d5c78ad3ff0d58fd7faa775c7 Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Thu, 3 Apr 2025 18:39:13 -0700 Subject: [PATCH 044/285] memcg: manually inline replace_stock_objcg The replace_stock_objcg() is being called by only refill_obj_stock, so manually inline it. Link: https://lkml.kernel.org/r/20250404013913.1663035-10-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Reviewed-by: Roman Gushchin Acked-by: Vlastimil Babka Cc: Johannes Weiner Cc: Michal Hocko Cc: Muchun Song Cc: Sebastian Andrzej Siewior Signed-off-by: Andrew Morton --- mm/memcontrol.c | 18 ++++++------------ 1 file changed, 6 insertions(+), 12 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 3bb02f672e39..aebb1f2c8657 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2763,17 +2763,6 @@ void __memcg_kmem_uncharge_page(struct page *page, int order) obj_cgroup_put(objcg); } -/* Replace the stock objcg with objcg, return the old objcg */ -static void replace_stock_objcg(struct memcg_stock_pcp *stock, - struct obj_cgroup *objcg) -{ - drain_obj_stock(stock); - obj_cgroup_get(objcg); - stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes) - ? atomic_xchg(&objcg->nr_charged_bytes, 0) : 0; - WRITE_ONCE(stock->cached_objcg, objcg); -} - static void __account_obj_stock(struct obj_cgroup *objcg, struct memcg_stock_pcp *stock, int nr, struct pglist_data *pgdat, enum node_stat_item idx) @@ -2934,7 +2923,12 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, stock = this_cpu_ptr(&memcg_stock); if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */ - replace_stock_objcg(stock, objcg); + drain_obj_stock(stock); + obj_cgroup_get(objcg); + stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes) + ? atomic_xchg(&objcg->nr_charged_bytes, 0) : 0; + WRITE_ONCE(stock->cached_objcg, objcg); + allow_uncharge = true; /* Allow uncharge when objcg changes */ } stock->nr_bytes += nr_bytes; From 9c1c38bcdc9215f0bd231591b662bd0fbe8b7045 Mon Sep 17 00:00:00 2001 From: Kemeng Shi Date: Wed, 26 Mar 2025 00:25:21 +0800 Subject: [PATCH 045/285] mm: swap: rename __swap_[entry/entries]_free[_locked] to swap_[entry/entries]_put[_locked] Patch series "Minor cleanups and improvements to swap freeing code", v4. This series contains some cleanups and improvements which are made during learning swapfile. Here is a summary of the changes: 1. Function naming improvments. - Use "put" instead of "free" to name functions which only do actual free when count drops to zero. - Use "entry" to name function only frees one swap slot. Use "entries" to name function could may free multi swap slots within one cluster. Use "_nr" suffix to name function which could free multi swap slots spanning cross multi clusters. 2. Eliminate the need to set swap slot to intermediate SWAP_HAS_CACHE value before do actual free by using swap_entry_range_free() 3. Add helpers swap_entries_put_map() and swap_entries_put_cache() as a general-purpose routine to free swap entries within a single cluster which will try batch-remove first and fallback to put eatch entry indvidually with cluster lock acquired/released only once. By using these helpers, we could remove repeated code, levarage batch-remove in more cases and aoivd to acquire/release cluster lock for each single swap entry. This patch (of 8): In __swap_entry_free[_locked] and __swap_entries_free, we decrease count first and only free swap entry if count drops to zero. This behavior is more akin to a put() operation rather than a free() operation. Therefore, rename these functions with "put" instead of "free". Additionally, add "_nr" suffix to swap_entries_put to indicate the input range may span swap clusters. Link: https://lkml.kernel.org/r/20250325162528.68385-1-shikemeng@huaweicloud.com Link: https://lkml.kernel.org/r/20250325162528.68385-2-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi Reviewed-by: Tim Chen Reviewed-by: Baoquan He Cc: Kairui Song Signed-off-by: Andrew Morton --- mm/swapfile.c | 28 ++++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index f214843612dc..ea01cf05b0cb 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1355,9 +1355,9 @@ out: return NULL; } -static unsigned char __swap_entry_free_locked(struct swap_info_struct *si, - unsigned long offset, - unsigned char usage) +static unsigned char swap_entry_put_locked(struct swap_info_struct *si, + unsigned long offset, + unsigned char usage) { unsigned char count; unsigned char has_cache; @@ -1461,15 +1461,15 @@ put_out: return NULL; } -static unsigned char __swap_entry_free(struct swap_info_struct *si, - swp_entry_t entry) +static unsigned char swap_entry_put(struct swap_info_struct *si, + swp_entry_t entry) { struct swap_cluster_info *ci; unsigned long offset = swp_offset(entry); unsigned char usage; ci = lock_cluster(si, offset); - usage = __swap_entry_free_locked(si, offset, 1); + usage = swap_entry_put_locked(si, offset, 1); if (!usage) swap_entry_range_free(si, ci, swp_entry(si->type, offset), 1); unlock_cluster(ci); @@ -1477,8 +1477,8 @@ static unsigned char __swap_entry_free(struct swap_info_struct *si, return usage; } -static bool __swap_entries_free(struct swap_info_struct *si, - swp_entry_t entry, int nr) +static bool swap_entries_put_nr(struct swap_info_struct *si, + swp_entry_t entry, int nr) { unsigned long offset = swp_offset(entry); unsigned int type = swp_type(entry); @@ -1509,7 +1509,7 @@ static bool __swap_entries_free(struct swap_info_struct *si, fallback: for (i = 0; i < nr; i++) { if (data_race(si->swap_map[offset + i])) { - count = __swap_entry_free(si, swp_entry(type, offset + i)); + count = swap_entry_put(si, swp_entry(type, offset + i)); if (count == SWAP_HAS_CACHE) has_cache = true; } else { @@ -1560,7 +1560,7 @@ static void cluster_swap_free_nr(struct swap_info_struct *si, ci = lock_cluster(si, offset); do { - if (!__swap_entry_free_locked(si, offset, usage)) + if (!swap_entry_put_locked(si, offset, usage)) swap_entry_range_free(si, ci, swp_entry(si->type, offset), 1); } while (++offset < end); unlock_cluster(ci); @@ -1607,7 +1607,7 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry) swap_entry_range_free(si, ci, entry, size); else { for (int i = 0; i < size; i++, entry.val++) { - if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) + if (!swap_entry_put_locked(si, offset + i, SWAP_HAS_CACHE)) swap_entry_range_free(si, ci, entry, 1); } } @@ -1806,7 +1806,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr) /* * First free all entries in the range. */ - any_only_cache = __swap_entries_free(si, entry, nr); + any_only_cache = swap_entries_put_nr(si, entry, nr); /* * Short-circuit the below loop if none of the entries had their @@ -1819,7 +1819,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr) * Now go back over the range trying to reclaim the swap cache. This is * more efficient for large folios because we will only try to reclaim * the swap once per folio in the common case. If we do - * __swap_entry_free() and __try_to_reclaim_swap() in the same loop, the + * swap_entry_put() and __try_to_reclaim_swap() in the same loop, the * latter will get a reference and lock the folio for every individual * page but will only succeed once the swap slot for every subpage is * zero. @@ -3789,7 +3789,7 @@ outer: * into, carry if so, or else fail until a new continuation page is allocated; * when the original swap_map count is decremented from 0 with continuation, * borrow from the continuation and report whether it still holds more. - * Called while __swap_duplicate() or caller of __swap_entry_free_locked() + * Called while __swap_duplicate() or caller of swap_entry_put_locked() * holds cluster lock. */ static bool swap_count_continued(struct swap_info_struct *si, From 64944ef6a13e774546eb7635bdd26ef963335a41 Mon Sep 17 00:00:00 2001 From: Kemeng Shi Date: Wed, 26 Mar 2025 00:25:22 +0800 Subject: [PATCH 046/285] mm: swap: enable swap_entry_range_free() to drop any kind of last ref The original VM_BUG_ON only allows swap_entry_range_free() to drop last SWAP_HAS_CACHE ref. By allowing other kind of last ref in VM_BUG_ON, swap_entry_range_free() could be a more general-purpose function able to handle all kind of last ref. Following thi change, also rename swap_entry_range_free() to swap_entries_free() and update it's comment accordingly. This is a preparation to use swap_entries_free() to drop more kind of last ref other than SWAP_HAS_CACHE. [shikemeng@huaweicloud.com: add __maybe_unused attribute for swap_is_last_ref() and update comment] Link: https://lkml.kernel.org/r/20250410153908.612984-1-shikemeng@huaweicloud.com Link: https://lkml.kernel.org/r/20250325162528.68385-3-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi Reviewed-by: Tim Chen Reviewed-by: Baoquan He Tested-by: SeongJae Park Cc: Kairui Song Signed-off-by: Andrew Morton --- mm/swapfile.c | 38 ++++++++++++++++++++++++-------------- 1 file changed, 24 insertions(+), 14 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index ea01cf05b0cb..c6b4c74622fc 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -52,9 +52,9 @@ static bool swap_count_continued(struct swap_info_struct *, pgoff_t, unsigned char); static void free_swap_count_continuations(struct swap_info_struct *); -static void swap_entry_range_free(struct swap_info_struct *si, - struct swap_cluster_info *ci, - swp_entry_t entry, unsigned int nr_pages); +static void swap_entries_free(struct swap_info_struct *si, + struct swap_cluster_info *ci, + swp_entry_t entry, unsigned int nr_pages); static void swap_range_alloc(struct swap_info_struct *si, unsigned int nr_entries); static bool folio_swapcache_freeable(struct folio *folio); @@ -1471,7 +1471,7 @@ static unsigned char swap_entry_put(struct swap_info_struct *si, ci = lock_cluster(si, offset); usage = swap_entry_put_locked(si, offset, 1); if (!usage) - swap_entry_range_free(si, ci, swp_entry(si->type, offset), 1); + swap_entries_free(si, ci, swp_entry(si->type, offset), 1); unlock_cluster(ci); return usage; @@ -1501,7 +1501,7 @@ static bool swap_entries_put_nr(struct swap_info_struct *si, for (i = 0; i < nr; i++) WRITE_ONCE(si->swap_map[offset + i], SWAP_HAS_CACHE); if (!has_cache) - swap_entry_range_free(si, ci, entry, nr); + swap_entries_free(si, ci, entry, nr); unlock_cluster(ci); return has_cache; @@ -1520,12 +1520,22 @@ fallback: } /* - * Drop the last HAS_CACHE flag of swap entries, caller have to - * ensure all entries belong to the same cgroup. + * Check if it's the last ref of swap entry in the freeing path. + * Qualified vlaue includes 1, SWAP_HAS_CACHE or SWAP_MAP_SHMEM. */ -static void swap_entry_range_free(struct swap_info_struct *si, - struct swap_cluster_info *ci, - swp_entry_t entry, unsigned int nr_pages) +static inline bool __maybe_unused swap_is_last_ref(unsigned char count) +{ + return (count == SWAP_HAS_CACHE) || (count == 1) || + (count == SWAP_MAP_SHMEM); +} + +/* + * Drop the last ref of swap entries, caller have to ensure all entries + * belong to the same cgroup and cluster. + */ +static void swap_entries_free(struct swap_info_struct *si, + struct swap_cluster_info *ci, + swp_entry_t entry, unsigned int nr_pages) { unsigned long offset = swp_offset(entry); unsigned char *map = si->swap_map + offset; @@ -1538,7 +1548,7 @@ static void swap_entry_range_free(struct swap_info_struct *si, ci->count -= nr_pages; do { - VM_BUG_ON(*map != SWAP_HAS_CACHE); + VM_BUG_ON(!swap_is_last_ref(*map)); *map = 0; } while (++map < map_end); @@ -1561,7 +1571,7 @@ static void cluster_swap_free_nr(struct swap_info_struct *si, ci = lock_cluster(si, offset); do { if (!swap_entry_put_locked(si, offset, usage)) - swap_entry_range_free(si, ci, swp_entry(si->type, offset), 1); + swap_entries_free(si, ci, swp_entry(si->type, offset), 1); } while (++offset < end); unlock_cluster(ci); } @@ -1604,11 +1614,11 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry) ci = lock_cluster(si, offset); if (swap_only_has_cache(si, offset, size)) - swap_entry_range_free(si, ci, entry, size); + swap_entries_free(si, ci, entry, size); else { for (int i = 0; i < size; i++, entry.val++) { if (!swap_entry_put_locked(si, offset + i, SWAP_HAS_CACHE)) - swap_entry_range_free(si, ci, entry, 1); + swap_entries_free(si, ci, entry, 1); } } unlock_cluster(ci); From 835b868878d0127bd29b8c0009dc424a63dadffb Mon Sep 17 00:00:00 2001 From: Kemeng Shi Date: Wed, 26 Mar 2025 00:25:23 +0800 Subject: [PATCH 047/285] mm: swap: use swap_entries_free() to free swap entry in swap_entry_put_locked() In swap_entry_put_locked(), we will set slot to SWAP_HAS_CACHE before using swap_entries_free() to do actual swap entry freeing. This introduce an unnecessary intermediate state. By using swap_entries_free() in swap_entry_put_locked(), we can eliminate the need to set slot to SWAP_HAS_CACHE. This change would make the behavior of swap_entry_put_locked() more consistent with other put() operations which will do actual free work after put last reference. Link: https://lkml.kernel.org/r/20250325162528.68385-4-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi Reviewed-by: Tim Chen Reviewed-by: Kairui Song Reviewed-by: Baoquan He Signed-off-by: Andrew Morton --- mm/swapfile.c | 20 +++++++++----------- 1 file changed, 9 insertions(+), 11 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index c6b4c74622fc..f0ba27db9b3e 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1356,9 +1356,11 @@ out: } static unsigned char swap_entry_put_locked(struct swap_info_struct *si, - unsigned long offset, + struct swap_cluster_info *ci, + swp_entry_t entry, unsigned char usage) { + unsigned long offset = swp_offset(entry); unsigned char count; unsigned char has_cache; @@ -1390,7 +1392,7 @@ static unsigned char swap_entry_put_locked(struct swap_info_struct *si, if (usage) WRITE_ONCE(si->swap_map[offset], usage); else - WRITE_ONCE(si->swap_map[offset], SWAP_HAS_CACHE); + swap_entries_free(si, ci, entry, 1); return usage; } @@ -1469,9 +1471,7 @@ static unsigned char swap_entry_put(struct swap_info_struct *si, unsigned char usage; ci = lock_cluster(si, offset); - usage = swap_entry_put_locked(si, offset, 1); - if (!usage) - swap_entries_free(si, ci, swp_entry(si->type, offset), 1); + usage = swap_entry_put_locked(si, ci, entry, 1); unlock_cluster(ci); return usage; @@ -1570,8 +1570,8 @@ static void cluster_swap_free_nr(struct swap_info_struct *si, ci = lock_cluster(si, offset); do { - if (!swap_entry_put_locked(si, offset, usage)) - swap_entries_free(si, ci, swp_entry(si->type, offset), 1); + swap_entry_put_locked(si, ci, swp_entry(si->type, offset), + usage); } while (++offset < end); unlock_cluster(ci); } @@ -1616,10 +1616,8 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry) if (swap_only_has_cache(si, offset, size)) swap_entries_free(si, ci, entry, size); else { - for (int i = 0; i < size; i++, entry.val++) { - if (!swap_entry_put_locked(si, offset + i, SWAP_HAS_CACHE)) - swap_entries_free(si, ci, entry, 1); - } + for (int i = 0; i < size; i++, entry.val++) + swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE); } unlock_cluster(ci); } From 46e0ab2c62067a8d5e62ae90088b1743af83725d Mon Sep 17 00:00:00 2001 From: Kemeng Shi Date: Wed, 26 Mar 2025 00:25:24 +0800 Subject: [PATCH 048/285] mm: swap: use swap_entries_free() drop last ref count in swap_entries_put_nr() Use swap_entries_free() to directly free swap entries when the swap entries are not cached and referenced, without needing to set swap entries to set intermediate SWAP_HAS_CACHE state. Link: https://lkml.kernel.org/r/20250325162528.68385-5-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi Reviewed-by: Tim Chen Reviewed-by: Baoquan He Cc: Kairui Song Signed-off-by: Andrew Morton --- mm/swapfile.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index f0ba27db9b3e..f8fe507be4f3 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1498,10 +1498,11 @@ static bool swap_entries_put_nr(struct swap_info_struct *si, unlock_cluster(ci); goto fallback; } - for (i = 0; i < nr; i++) - WRITE_ONCE(si->swap_map[offset + i], SWAP_HAS_CACHE); if (!has_cache) swap_entries_free(si, ci, entry, nr); + else + for (i = 0; i < nr; i++) + WRITE_ONCE(si->swap_map[offset + i], SWAP_HAS_CACHE); unlock_cluster(ci); return has_cache; From f2252acf444764e7de7380201c518ee5ab6fd9da Mon Sep 17 00:00:00 2001 From: Kemeng Shi Date: Wed, 26 Mar 2025 00:25:25 +0800 Subject: [PATCH 049/285] mm: swap: drop last SWAP_MAP_SHMEM flag in batch in swap_entries_put_nr() The SWAP_MAP_SHMEM indicates last map from shmem. Therefore we can drop SWAP_MAP_SHMEM in batch in similar way to drop last ref count in batch. Link: https://lkml.kernel.org/r/20250325162528.68385-6-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi Reviewed-by: Tim Chen Reviewed-by: Baoquan He Cc: Kairui Song Signed-off-by: Andrew Morton --- mm/swapfile.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index f8fe507be4f3..668684dc9efa 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -192,7 +192,7 @@ static bool swap_is_last_map(struct swap_info_struct *si, unsigned char *map_end = map + nr_pages; unsigned char count = *map; - if (swap_count(count) != 1) + if (swap_count(count) != 1 && swap_count(count) != SWAP_MAP_SHMEM) return false; while (++map < map_end) { @@ -1487,7 +1487,10 @@ static bool swap_entries_put_nr(struct swap_info_struct *si, unsigned char count; int i; - if (nr <= 1 || swap_count(data_race(si->swap_map[offset])) != 1) + if (nr <= 1) + goto fallback; + count = swap_count(data_race(si->swap_map[offset])); + if (count != 1 && count != SWAP_MAP_SHMEM) goto fallback; /* cross into another cluster */ if (nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER) From 4d71d9062dd7b2ace56f2351b9f3f06e6c0acf81 Mon Sep 17 00:00:00 2001 From: Kemeng Shi Date: Wed, 26 Mar 2025 00:25:26 +0800 Subject: [PATCH 050/285] mm: swap: free each cluster individually in swap_entries_put_map_nr() 1. Factor out general swap_entries_put_map() helper to drop entries belonging to one cluster. If entries are last map, free entries in batch, otherwise put entries with cluster lock acquired and released only once. 2. Iterate and call swap_entries_put_map() for each cluster in swap_entries_put_nr() to leverage batch-remove for last map belonging to one cluster and reduce lock acquire/release in fallback case. 3. As swap_entries_put_nr() won't handle SWAP_HSA_CACHE drop, rename it to swap_entries_put_map_nr(). 4. As we won't drop each entry invidually with swap_entry_put() now, do reclaim in free_swap_and_cache_nr() because swap_entries_put_map_nr() is general routine to drop reference and the relcaim work should only be done in free_swap_and_cache_nr(). Remove stale comment accordingly. Link: https://lkml.kernel.org/r/20250325162528.68385-7-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi Reviewed-by: Tim Chen Reviewed-by: Baoquan He Cc: Kairui Song Signed-off-by: Andrew Morton --- mm/swapfile.c | 70 +++++++++++++++++++++++---------------------------- 1 file changed, 32 insertions(+), 38 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 668684dc9efa..4f4fc74239d6 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1463,25 +1463,10 @@ put_out: return NULL; } -static unsigned char swap_entry_put(struct swap_info_struct *si, - swp_entry_t entry) -{ - struct swap_cluster_info *ci; - unsigned long offset = swp_offset(entry); - unsigned char usage; - - ci = lock_cluster(si, offset); - usage = swap_entry_put_locked(si, ci, entry, 1); - unlock_cluster(ci); - - return usage; -} - -static bool swap_entries_put_nr(struct swap_info_struct *si, - swp_entry_t entry, int nr) +static bool swap_entries_put_map(struct swap_info_struct *si, + swp_entry_t entry, int nr) { unsigned long offset = swp_offset(entry); - unsigned int type = swp_type(entry); struct swap_cluster_info *ci; bool has_cache = false; unsigned char count; @@ -1492,14 +1477,10 @@ static bool swap_entries_put_nr(struct swap_info_struct *si, count = swap_count(data_race(si->swap_map[offset])); if (count != 1 && count != SWAP_MAP_SHMEM) goto fallback; - /* cross into another cluster */ - if (nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER) - goto fallback; ci = lock_cluster(si, offset); if (!swap_is_last_map(si, offset, nr, &has_cache)) { - unlock_cluster(ci); - goto fallback; + goto locked_fallback; } if (!has_cache) swap_entries_free(si, ci, entry, nr); @@ -1511,15 +1492,34 @@ static bool swap_entries_put_nr(struct swap_info_struct *si, return has_cache; fallback: - for (i = 0; i < nr; i++) { - if (data_race(si->swap_map[offset + i])) { - count = swap_entry_put(si, swp_entry(type, offset + i)); - if (count == SWAP_HAS_CACHE) - has_cache = true; - } else { - WARN_ON_ONCE(1); - } + ci = lock_cluster(si, offset); +locked_fallback: + for (i = 0; i < nr; i++, entry.val++) { + count = swap_entry_put_locked(si, ci, entry, 1); + if (count == SWAP_HAS_CACHE) + has_cache = true; } + unlock_cluster(ci); + return has_cache; + +} + +static bool swap_entries_put_map_nr(struct swap_info_struct *si, + swp_entry_t entry, int nr) +{ + int cluster_nr, cluster_rest; + unsigned long offset = swp_offset(entry); + bool has_cache = false; + + cluster_rest = SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER; + while (nr) { + cluster_nr = min(nr, cluster_rest); + has_cache |= swap_entries_put_map(si, entry, cluster_nr); + cluster_rest = SWAPFILE_CLUSTER; + nr -= cluster_nr; + entry.val += cluster_nr; + } + return has_cache; } @@ -1818,7 +1818,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr) /* * First free all entries in the range. */ - any_only_cache = swap_entries_put_nr(si, entry, nr); + any_only_cache = swap_entries_put_map_nr(si, entry, nr); /* * Short-circuit the below loop if none of the entries had their @@ -1828,13 +1828,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr) goto out; /* - * Now go back over the range trying to reclaim the swap cache. This is - * more efficient for large folios because we will only try to reclaim - * the swap once per folio in the common case. If we do - * swap_entry_put() and __try_to_reclaim_swap() in the same loop, the - * latter will get a reference and lock the folio for every individual - * page but will only succeed once the swap slot for every subpage is - * zero. + * Now go back over the range trying to reclaim the swap cache. */ for (offset = start_offset; offset < end_offset; offset += nr) { nr = 1; From d4f8000bd6b0cc33c9dddd369e72a13c4c080cb1 Mon Sep 17 00:00:00 2001 From: Kemeng Shi Date: Wed, 26 Mar 2025 00:25:27 +0800 Subject: [PATCH 051/285] mm: swap: factor out helper to drop cache of entries within a single cluster Factor out helper swap_entries_put_cache() from put_swap_folio() to serve as a general-purpose routine for dropping cache flag of entries within a single cluster. Link: https://lkml.kernel.org/r/20250325162528.68385-8-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi Reviewed-by: Tim Chen Reviewed-by: Baoquan He Cc: Kairui Song Signed-off-by: Andrew Morton --- mm/swapfile.c | 27 +++++++++++++++++---------- 1 file changed, 17 insertions(+), 10 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 4f4fc74239d6..953dcd99006e 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1463,6 +1463,22 @@ put_out: return NULL; } +static void swap_entries_put_cache(struct swap_info_struct *si, + swp_entry_t entry, int nr) +{ + unsigned long offset = swp_offset(entry); + struct swap_cluster_info *ci; + + ci = lock_cluster(si, offset); + if (swap_only_has_cache(si, offset, nr)) + swap_entries_free(si, ci, entry, nr); + else { + for (int i = 0; i < nr; i++, entry.val++) + swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE); + } + unlock_cluster(ci); +} + static bool swap_entries_put_map(struct swap_info_struct *si, swp_entry_t entry, int nr) { @@ -1607,8 +1623,6 @@ void swap_free_nr(swp_entry_t entry, int nr_pages) */ void put_swap_folio(struct folio *folio, swp_entry_t entry) { - unsigned long offset = swp_offset(entry); - struct swap_cluster_info *ci; struct swap_info_struct *si; int size = 1 << swap_entry_order(folio_order(folio)); @@ -1616,14 +1630,7 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry) if (!si) return; - ci = lock_cluster(si, offset); - if (swap_only_has_cache(si, offset, size)) - swap_entries_free(si, ci, entry, size); - else { - for (int i = 0; i < size; i++, entry.val++) - swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE); - } - unlock_cluster(ci); + swap_entries_put_cache(si, entry, size); } int __swap_count(swp_entry_t entry) From ec9827cd28b13b88517812eb08b13d0ed97ae8f1 Mon Sep 17 00:00:00 2001 From: Kemeng Shi Date: Wed, 26 Mar 2025 00:25:28 +0800 Subject: [PATCH 052/285] mm: swap: replace cluster_swap_free_nr() with swap_entries_put_[map/cache]() Replace cluster_swap_free_nr() with swap_entries_put_[map/cache]() to remove repeat code and leverage batch-remove for entries with last flag. After removing cluster_swap_free_nr, only functions with "_nr" suffix could free entries spanning cross clusters. Add corresponding description in comment of swap_entries_put_map_nr() as is first function with "_nr" suffix and have a non-suffix variant function swap_entries_put_map(). Link: https://lkml.kernel.org/r/20250325162528.68385-9-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi Reviewed-by: Tim Chen Reviewed-by: Baoquan He Cc: Kairui Song Signed-off-by: Andrew Morton --- mm/swapfile.c | 30 +++++++++++------------------- 1 file changed, 11 insertions(+), 19 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 953dcd99006e..b86637cfb17a 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1520,6 +1520,11 @@ locked_fallback: } +/* + * Only functions with "_nr" suffix are able to free entries spanning + * cross multi clusters, so ensure the range is within a single cluster + * when freeing entries with functions without "_nr" suffix. + */ static bool swap_entries_put_map_nr(struct swap_info_struct *si, swp_entry_t entry, int nr) { @@ -1581,21 +1586,6 @@ static void swap_entries_free(struct swap_info_struct *si, partial_free_cluster(si, ci); } -static void cluster_swap_free_nr(struct swap_info_struct *si, - unsigned long offset, int nr_pages, - unsigned char usage) -{ - struct swap_cluster_info *ci; - unsigned long end = offset + nr_pages; - - ci = lock_cluster(si, offset); - do { - swap_entry_put_locked(si, ci, swp_entry(si->type, offset), - usage); - } while (++offset < end); - unlock_cluster(ci); -} - /* * Caller has made sure that the swap device corresponding to entry * is still around or has not been recycled. @@ -1612,7 +1602,7 @@ void swap_free_nr(swp_entry_t entry, int nr_pages) while (nr_pages) { nr = min_t(int, nr_pages, SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER); - cluster_swap_free_nr(sis, offset, nr, 1); + swap_entries_put_map(sis, swp_entry(sis->type, offset), nr); offset += nr; nr_pages -= nr; } @@ -3658,11 +3648,13 @@ int swapcache_prepare(swp_entry_t entry, int nr) return __swap_duplicate(entry, SWAP_HAS_CACHE, nr); } +/* + * Caller should ensure entries belong to the same folio so + * the entries won't span cross cluster boundary. + */ void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr) { - unsigned long offset = swp_offset(entry); - - cluster_swap_free_nr(si, offset, nr, SWAP_HAS_CACHE); + swap_entries_put_cache(si, entry, nr); } struct swap_info_struct *swp_swap_info(swp_entry_t entry) From f4d1c32489117c9d38206a673d880d23d7d3bb8a Mon Sep 17 00:00:00 2001 From: SoumishDas Date: Tue, 25 Mar 2025 23:43:25 +0530 Subject: [PATCH 053/285] mm: add kernel-doc comment for free_pgd_range() Provide kernel-doc for free_pgd_range() so it's easier to understand what the function does and how it is used. Link: https://lkml.kernel.org/r/20250325181325.5774-1-soumish.das@gmail.com Signed-off-by: SoumishDas Signed-off-by: Andrew Morton --- mm/memory.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 86e7e66e3c5b..f41fac7118ba 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -278,8 +278,17 @@ static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd, p4d_free_tlb(tlb, p4d, start); } -/* - * This function frees user-level page tables of a process. +/** + * free_pgd_range - Unmap and free page tables in the range + * @tlb: the mmu_gather containing pending TLB flush info + * @addr: virtual address start + * @end: virtual address end + * @floor: lowest address boundary + * @ceiling: highest address boundary + * + * This function tears down all user-level page tables in the + * specified virtual address range [@addr..@end). It is part of + * the memory unmap flow. */ void free_pgd_range(struct mmu_gather *tlb, unsigned long addr, unsigned long end, From 87a929ae4fb4709c577e1d41d6fb13f567a4aa03 Mon Sep 17 00:00:00 2001 From: "Dmitry V. Levin" Date: Mon, 3 Mar 2025 13:19:53 +0200 Subject: [PATCH 054/285] hexagon: add syscall_set_return_value() Patch series "ptrace: introduce PTRACE_SET_SYSCALL_INFO API", v7. PTRACE_SET_SYSCALL_INFO is a generic ptrace API that complements PTRACE_GET_SYSCALL_INFO by letting the ptracer modify details of system calls the tracee is blocked in. This API allows ptracers to obtain and modify system call details in a straightforward and architecture-agnostic way, providing a consistent way of manipulating the system call number and arguments across architectures. As in case of PTRACE_GET_SYSCALL_INFO, PTRACE_SET_SYSCALL_INFO also does not aim to address numerous architecture-specific system call ABI peculiarities, like differences in the number of system call arguments for such system calls as pread64 and preadv. The current implementation supports changing only those bits of system call information that are used by strace system call tampering, namely, syscall number, syscall arguments, and syscall return value. Support of changing additional details returned by PTRACE_GET_SYSCALL_INFO, such as instruction pointer and stack pointer, could be added later if needed, by using struct ptrace_syscall_info.flags to specify the additional details that should be set. Currently, "flags" and "reserved" fields of struct ptrace_syscall_info must be initialized with zeroes; "arch", "instruction_pointer", and "stack_pointer" fields are currently ignored. PTRACE_SET_SYSCALL_INFO currently supports only PTRACE_SYSCALL_INFO_ENTRY, PTRACE_SYSCALL_INFO_EXIT, and PTRACE_SYSCALL_INFO_SECCOMP operations. Other operations could be added later if needed. Ideally, PTRACE_SET_SYSCALL_INFO should have been introduced along with PTRACE_GET_SYSCALL_INFO, but it didn't happen. The last straw that convinced me to implement PTRACE_SET_SYSCALL_INFO was apparent failure to provide an API of changing the first system call argument on riscv architecture [1]. ptrace(2) man page: long ptrace(enum __ptrace_request request, pid_t pid, void *addr, void *data); ... PTRACE_SET_SYSCALL_INFO Modify information about the system call that caused the stop. The "data" argument is a pointer to struct ptrace_syscall_info that specifies the system call information to be set. The "addr" argument should be set to sizeof(struct ptrace_syscall_info)). This patch (of 6): hexagon is the only architecture that provides HAVE_ARCH_TRACEHOOK but doesn't define syscall_set_return_value(). Since this function is going to be needed on all HAVE_ARCH_TRACEHOOK architectures to implement PTRACE_SET_SYSCALL_INFO API, add it on hexagon, too. Link: https://lore.kernel.org/all/59505464-c84a-403d-972f-d4b2055eeaac@gmail.com/ [1] Link: https://lkml.kernel.org/r/20250303111953.GB24170@strace.io Signed-off-by: Dmitry V. Levin Cc: Alexander Gordeev Cc: Alexey Gladkov (Intel) Cc: Andreas Larsson Cc: anton ivanov Cc: Arnd Bergmann Cc: Borislav Betkov Cc: Brian Cain Cc: Charlie Jenkins Cc: Christian Borntraeger Cc: Christian Zankel Cc: Christophe Leroy Cc: Dave Hansen Cc: Davide Berardi Cc: David S. Miller Cc: Dinh Nguyen Cc: Eugene Syromyatnikov Cc: Geert Uytterhoeven Cc: Guo Ren Cc: Heiko Carstens Cc: Helge Deller Cc: "H. Peter Anvin" Cc: Huacai Chen Cc: Ingo Molnar Cc: Johannes Berg Cc: John Paul Adrian Glaubitz Cc: Jonas Bonn Cc: Maciej W. Rozycki Cc: Madhavan Srinivasan Cc: Max Filippov Cc: Michael Ellerman Cc: Michal Simek Cc: Mike Frysinger Cc: Naveen N Rao Cc: Nicholas Piggin Cc: Oleg Nesterov Cc: Renzo Davoi Cc: Richard Weinberger Cc: Rich Felker Cc: Russel King Cc: Shuah Khan Cc: Stafford Horne Cc: Stefan Kristiansson Cc: Sven Schnelle Cc: Thomas Gleinxer Cc: Vasily Gorbik Cc: Vineet Gupta Cc: WANG Xuerui Cc: Will Deacon Cc: Yoshinori Sato Cc: Eugene Syromiatnikov Signed-off-by: Andrew Morton --- arch/hexagon/include/asm/syscall.h | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/arch/hexagon/include/asm/syscall.h b/arch/hexagon/include/asm/syscall.h index f6e454f18038..951ca0ed8376 100644 --- a/arch/hexagon/include/asm/syscall.h +++ b/arch/hexagon/include/asm/syscall.h @@ -45,6 +45,13 @@ static inline long syscall_get_return_value(struct task_struct *task, return regs->r00; } +static inline void syscall_set_return_value(struct task_struct *task, + struct pt_regs *regs, + int error, long val) +{ + regs->r00 = (long) error ?: val; +} + static inline int syscall_get_arch(struct task_struct *task) { return AUDIT_ARCH_HEXAGON; From 17fc7b8f9bce5d3d61ef347dd8cfccb6365dcaa1 Mon Sep 17 00:00:00 2001 From: "Dmitry V. Levin" Date: Mon, 3 Mar 2025 13:20:09 +0200 Subject: [PATCH 055/285] syscall.h: add syscall_set_arguments() This function is going to be needed on all HAVE_ARCH_TRACEHOOK architectures to implement PTRACE_SET_SYSCALL_INFO API. This partially reverts commit 7962c2eddbfe ("arch: remove unused function syscall_set_arguments()") by reusing some of old syscall_set_arguments() implementations. [nathan@kernel.org: fix compile time fortify checks] Link: https://lkml.kernel.org/r/20250408213131.GA2872426@ax162 Link: https://lkml.kernel.org/r/20250303112009.GC24170@strace.io Signed-off-by: Dmitry V. Levin Signed-off-by: Nathan Chancellor Tested-by: Charlie Jenkins Reviewed-by: Charlie Jenkins Acked-by: Helge Deller # parisc Reviewed-by: Maciej W. Rozycki [mips] Cc: Alexander Gordeev Cc: Alexey Gladkov (Intel) Cc: Andreas Larsson Cc: anton ivanov Cc: Arnd Bergmann Cc: Borislav Betkov Cc: Brian Cain Cc: Christian Borntraeger Cc: Christian Zankel Cc: Christophe Leroy Cc: Dave Hansen Cc: Davide Berardi Cc: David S. Miller Cc: Dinh Nguyen Cc: Eugene Syromiatnikov Cc: Eugene Syromyatnikov Cc: Geert Uytterhoeven Cc: Guo Ren Cc: Heiko Carstens Cc: "H. Peter Anvin" Cc: Huacai Chen Cc: Ingo Molnar Cc: Johannes Berg Cc: John Paul Adrian Glaubitz Cc: Jonas Bonn Cc: Madhavan Srinivasan Cc: Max Filippov Cc: Michael Ellerman Cc: Michal Simek Cc: Mike Frysinger Cc: Naveen N Rao Cc: Nicholas Piggin Cc: Oleg Nesterov Cc: Renzo Davoi Cc: Richard Weinberger Cc: Rich Felker Cc: Russel King Cc: Shuah Khan Cc: Stafford Horne Cc: Stefan Kristiansson Cc: Sven Schnelle Cc: Thomas Gleinxer Cc: Vasily Gorbik Cc: Vineet Gupta Cc: WANG Xuerui Cc: Will Deacon Cc: Yoshinori Sato Signed-off-by: Andrew Morton --- arch/arc/include/asm/syscall.h | 14 +++++++++++ arch/arm/include/asm/syscall.h | 13 ++++++++++ arch/arm64/include/asm/syscall.h | 13 ++++++++++ arch/csky/include/asm/syscall.h | 13 ++++++++++ arch/hexagon/include/asm/syscall.h | 7 ++++++ arch/loongarch/include/asm/syscall.h | 8 ++++++ arch/mips/include/asm/syscall.h | 28 +++++++++++++++++++++ arch/nios2/include/asm/syscall.h | 11 ++++++++ arch/openrisc/include/asm/syscall.h | 7 ++++++ arch/parisc/include/asm/syscall.h | 12 +++++++++ arch/powerpc/include/asm/syscall.h | 10 ++++++++ arch/riscv/include/asm/syscall.h | 12 +++++++++ arch/s390/include/asm/syscall.h | 9 +++++++ arch/sh/include/asm/syscall_32.h | 12 +++++++++ arch/sparc/include/asm/syscall.h | 10 ++++++++ arch/um/include/asm/syscall-generic.h | 14 +++++++++++ arch/x86/include/asm/syscall.h | 36 +++++++++++++++++++++++++++ arch/xtensa/include/asm/syscall.h | 11 ++++++++ include/asm-generic/syscall.h | 16 ++++++++++++ 19 files changed, 256 insertions(+) diff --git a/arch/arc/include/asm/syscall.h b/arch/arc/include/asm/syscall.h index 9709256e31c8..89c1e1736356 100644 --- a/arch/arc/include/asm/syscall.h +++ b/arch/arc/include/asm/syscall.h @@ -67,6 +67,20 @@ syscall_get_arguments(struct task_struct *task, struct pt_regs *regs, } } +static inline void +syscall_set_arguments(struct task_struct *task, struct pt_regs *regs, + unsigned long *args) +{ + unsigned long *inside_ptregs = ®s->r0; + unsigned int n = 6; + unsigned int i = 0; + + while (n--) { + *inside_ptregs = args[i++]; + inside_ptregs--; + } +} + static inline int syscall_get_arch(struct task_struct *task) { diff --git a/arch/arm/include/asm/syscall.h b/arch/arm/include/asm/syscall.h index fe4326d938c1..21927fa0ae2b 100644 --- a/arch/arm/include/asm/syscall.h +++ b/arch/arm/include/asm/syscall.h @@ -80,6 +80,19 @@ static inline void syscall_get_arguments(struct task_struct *task, memcpy(args, ®s->ARM_r0 + 1, 5 * sizeof(args[0])); } +static inline void syscall_set_arguments(struct task_struct *task, + struct pt_regs *regs, + const unsigned long *args) +{ + memcpy(®s->ARM_r0, args, 6 * sizeof(args[0])); + /* + * Also copy the first argument into ARM_ORIG_r0 + * so that syscall_get_arguments() would return it + * instead of the previous value. + */ + regs->ARM_ORIG_r0 = regs->ARM_r0; +} + static inline int syscall_get_arch(struct task_struct *task) { /* ARM tasks don't change audit architectures on the fly. */ diff --git a/arch/arm64/include/asm/syscall.h b/arch/arm64/include/asm/syscall.h index ab8e14b96f68..76020b66286b 100644 --- a/arch/arm64/include/asm/syscall.h +++ b/arch/arm64/include/asm/syscall.h @@ -73,6 +73,19 @@ static inline void syscall_get_arguments(struct task_struct *task, memcpy(args, ®s->regs[1], 5 * sizeof(args[0])); } +static inline void syscall_set_arguments(struct task_struct *task, + struct pt_regs *regs, + const unsigned long *args) +{ + memcpy(®s->regs[0], args, 6 * sizeof(args[0])); + /* + * Also copy the first argument into orig_x0 + * so that syscall_get_arguments() would return it + * instead of the previous value. + */ + regs->orig_x0 = regs->regs[0]; +} + /* * We don't care about endianness (__AUDIT_ARCH_LE bit) here because * AArch64 has the same system calls both on little- and big- endian. diff --git a/arch/csky/include/asm/syscall.h b/arch/csky/include/asm/syscall.h index 0de5734950bf..717f44b4d26f 100644 --- a/arch/csky/include/asm/syscall.h +++ b/arch/csky/include/asm/syscall.h @@ -59,6 +59,19 @@ syscall_get_arguments(struct task_struct *task, struct pt_regs *regs, memcpy(args, ®s->a1, 5 * sizeof(args[0])); } +static inline void +syscall_set_arguments(struct task_struct *task, struct pt_regs *regs, + const unsigned long *args) +{ + memcpy(®s->a0, args, 6 * sizeof(regs->a0)); + /* + * Also copy the first argument into orig_a0 + * so that syscall_get_arguments() would return it + * instead of the previous value. + */ + regs->orig_a0 = regs->a0; +} + static inline int syscall_get_arch(struct task_struct *task) { diff --git a/arch/hexagon/include/asm/syscall.h b/arch/hexagon/include/asm/syscall.h index 951ca0ed8376..1024a6548d78 100644 --- a/arch/hexagon/include/asm/syscall.h +++ b/arch/hexagon/include/asm/syscall.h @@ -33,6 +33,13 @@ static inline void syscall_get_arguments(struct task_struct *task, memcpy(args, &(®s->r00)[0], 6 * sizeof(args[0])); } +static inline void syscall_set_arguments(struct task_struct *task, + struct pt_regs *regs, + unsigned long *args) +{ + memcpy(&(®s->r00)[0], args, 6 * sizeof(args[0])); +} + static inline long syscall_get_error(struct task_struct *task, struct pt_regs *regs) { diff --git a/arch/loongarch/include/asm/syscall.h b/arch/loongarch/include/asm/syscall.h index e286dc58476e..ff415b3c0a8e 100644 --- a/arch/loongarch/include/asm/syscall.h +++ b/arch/loongarch/include/asm/syscall.h @@ -61,6 +61,14 @@ static inline void syscall_get_arguments(struct task_struct *task, memcpy(&args[1], ®s->regs[5], 5 * sizeof(long)); } +static inline void syscall_set_arguments(struct task_struct *task, + struct pt_regs *regs, + unsigned long *args) +{ + regs->orig_a0 = args[0]; + memcpy(®s->regs[5], &args[1], 5 * sizeof(long)); +} + static inline int syscall_get_arch(struct task_struct *task) { return AUDIT_ARCH_LOONGARCH64; diff --git a/arch/mips/include/asm/syscall.h b/arch/mips/include/asm/syscall.h index 056aa1b713e2..f1926ce30d4b 100644 --- a/arch/mips/include/asm/syscall.h +++ b/arch/mips/include/asm/syscall.h @@ -74,6 +74,23 @@ static inline void mips_get_syscall_arg(unsigned long *arg, #endif } +static inline void mips_set_syscall_arg(unsigned long *arg, + struct task_struct *task, struct pt_regs *regs, unsigned int n) +{ +#ifdef CONFIG_32BIT + switch (n) { + case 0: case 1: case 2: case 3: + regs->regs[4 + n] = *arg; + return; + case 4: case 5: case 6: case 7: + *arg = regs->args[n] = *arg; + return; + } +#else + regs->regs[4 + n] = *arg; +#endif +} + static inline long syscall_get_error(struct task_struct *task, struct pt_regs *regs) { @@ -120,6 +137,17 @@ static inline void syscall_get_arguments(struct task_struct *task, mips_get_syscall_arg(args++, task, regs, i++); } +static inline void syscall_set_arguments(struct task_struct *task, + struct pt_regs *regs, + unsigned long *args) +{ + unsigned int i = 0; + unsigned int n = 6; + + while (n--) + mips_set_syscall_arg(args++, task, regs, i++); +} + extern const unsigned long sys_call_table[]; extern const unsigned long sys32_call_table[]; extern const unsigned long sysn32_call_table[]; diff --git a/arch/nios2/include/asm/syscall.h b/arch/nios2/include/asm/syscall.h index fff52205fb65..526449edd768 100644 --- a/arch/nios2/include/asm/syscall.h +++ b/arch/nios2/include/asm/syscall.h @@ -58,6 +58,17 @@ static inline void syscall_get_arguments(struct task_struct *task, *args = regs->r9; } +static inline void syscall_set_arguments(struct task_struct *task, + struct pt_regs *regs, const unsigned long *args) +{ + regs->r4 = *args++; + regs->r5 = *args++; + regs->r6 = *args++; + regs->r7 = *args++; + regs->r8 = *args++; + regs->r9 = *args; +} + static inline int syscall_get_arch(struct task_struct *task) { return AUDIT_ARCH_NIOS2; diff --git a/arch/openrisc/include/asm/syscall.h b/arch/openrisc/include/asm/syscall.h index 903ed882bdec..e6383be2a195 100644 --- a/arch/openrisc/include/asm/syscall.h +++ b/arch/openrisc/include/asm/syscall.h @@ -57,6 +57,13 @@ syscall_get_arguments(struct task_struct *task, struct pt_regs *regs, memcpy(args, ®s->gpr[3], 6 * sizeof(args[0])); } +static inline void +syscall_set_arguments(struct task_struct *task, struct pt_regs *regs, + const unsigned long *args) +{ + memcpy(®s->gpr[3], args, 6 * sizeof(args[0])); +} + static inline int syscall_get_arch(struct task_struct *task) { return AUDIT_ARCH_OPENRISC; diff --git a/arch/parisc/include/asm/syscall.h b/arch/parisc/include/asm/syscall.h index 00b127a5e09b..b146d0ae4c77 100644 --- a/arch/parisc/include/asm/syscall.h +++ b/arch/parisc/include/asm/syscall.h @@ -29,6 +29,18 @@ static inline void syscall_get_arguments(struct task_struct *tsk, args[0] = regs->gr[26]; } +static inline void syscall_set_arguments(struct task_struct *tsk, + struct pt_regs *regs, + unsigned long *args) +{ + regs->gr[21] = args[5]; + regs->gr[22] = args[4]; + regs->gr[23] = args[3]; + regs->gr[24] = args[2]; + regs->gr[25] = args[1]; + regs->gr[26] = args[0]; +} + static inline long syscall_get_error(struct task_struct *task, struct pt_regs *regs) { diff --git a/arch/powerpc/include/asm/syscall.h b/arch/powerpc/include/asm/syscall.h index 3dd36c5e334a..b2715448a660 100644 --- a/arch/powerpc/include/asm/syscall.h +++ b/arch/powerpc/include/asm/syscall.h @@ -110,6 +110,16 @@ static inline void syscall_get_arguments(struct task_struct *task, } } +static inline void syscall_set_arguments(struct task_struct *task, + struct pt_regs *regs, + const unsigned long *args) +{ + memcpy(®s->gpr[3], args, 6 * sizeof(args[0])); + + /* Also copy the first argument into orig_gpr3 */ + regs->orig_gpr3 = args[0]; +} + static inline int syscall_get_arch(struct task_struct *task) { if (is_tsk_32bit_task(task)) diff --git a/arch/riscv/include/asm/syscall.h b/arch/riscv/include/asm/syscall.h index eceabf59ae48..da56417b6705 100644 --- a/arch/riscv/include/asm/syscall.h +++ b/arch/riscv/include/asm/syscall.h @@ -69,6 +69,18 @@ static inline void syscall_get_arguments(struct task_struct *task, args[5] = regs->a5; } +static inline void syscall_set_arguments(struct task_struct *task, + struct pt_regs *regs, + const unsigned long *args) +{ + regs->orig_a0 = args[0]; + regs->a1 = args[1]; + regs->a2 = args[2]; + regs->a3 = args[3]; + regs->a4 = args[4]; + regs->a5 = args[5]; +} + static inline int syscall_get_arch(struct task_struct *task) { #ifdef CONFIG_64BIT diff --git a/arch/s390/include/asm/syscall.h b/arch/s390/include/asm/syscall.h index 0213ec800b57..b87d8bb2cbaa 100644 --- a/arch/s390/include/asm/syscall.h +++ b/arch/s390/include/asm/syscall.h @@ -76,6 +76,15 @@ static inline void syscall_get_arguments(struct task_struct *task, args[0] = regs->orig_gpr2 & mask; } +static inline void syscall_set_arguments(struct task_struct *task, + struct pt_regs *regs, + const unsigned long *args) +{ + regs->orig_gpr2 = args[0]; + for (int n = 1; n < 6; n++) + regs->gprs[2 + n] = args[n]; +} + static inline int syscall_get_arch(struct task_struct *task) { #ifdef CONFIG_COMPAT diff --git a/arch/sh/include/asm/syscall_32.h b/arch/sh/include/asm/syscall_32.h index d87738eebe30..cb51a7528384 100644 --- a/arch/sh/include/asm/syscall_32.h +++ b/arch/sh/include/asm/syscall_32.h @@ -57,6 +57,18 @@ static inline void syscall_get_arguments(struct task_struct *task, args[0] = regs->regs[4]; } +static inline void syscall_set_arguments(struct task_struct *task, + struct pt_regs *regs, + const unsigned long *args) +{ + regs->regs[1] = args[5]; + regs->regs[0] = args[4]; + regs->regs[7] = args[3]; + regs->regs[6] = args[2]; + regs->regs[5] = args[1]; + regs->regs[4] = args[0]; +} + static inline int syscall_get_arch(struct task_struct *task) { int arch = AUDIT_ARCH_SH; diff --git a/arch/sparc/include/asm/syscall.h b/arch/sparc/include/asm/syscall.h index 20c109ac8cc9..62a5a78804c4 100644 --- a/arch/sparc/include/asm/syscall.h +++ b/arch/sparc/include/asm/syscall.h @@ -117,6 +117,16 @@ static inline void syscall_get_arguments(struct task_struct *task, } } +static inline void syscall_set_arguments(struct task_struct *task, + struct pt_regs *regs, + const unsigned long *args) +{ + unsigned int i; + + for (i = 0; i < 6; i++) + regs->u_regs[UREG_I0 + i] = args[i]; +} + static inline int syscall_get_arch(struct task_struct *task) { #if defined(CONFIG_SPARC64) && defined(CONFIG_COMPAT) diff --git a/arch/um/include/asm/syscall-generic.h b/arch/um/include/asm/syscall-generic.h index 172b74143c4b..2984feb9d576 100644 --- a/arch/um/include/asm/syscall-generic.h +++ b/arch/um/include/asm/syscall-generic.h @@ -62,6 +62,20 @@ static inline void syscall_get_arguments(struct task_struct *task, *args = UPT_SYSCALL_ARG6(r); } +static inline void syscall_set_arguments(struct task_struct *task, + struct pt_regs *regs, + const unsigned long *args) +{ + struct uml_pt_regs *r = ®s->regs; + + UPT_SYSCALL_ARG1(r) = *args++; + UPT_SYSCALL_ARG2(r) = *args++; + UPT_SYSCALL_ARG3(r) = *args++; + UPT_SYSCALL_ARG4(r) = *args++; + UPT_SYSCALL_ARG5(r) = *args++; + UPT_SYSCALL_ARG6(r) = *args; +} + /* See arch/x86/um/asm/syscall.h for syscall_get_arch() definition. */ #endif /* __UM_SYSCALL_GENERIC_H */ diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h index 7c488ff0c764..b9c249dd9e3d 100644 --- a/arch/x86/include/asm/syscall.h +++ b/arch/x86/include/asm/syscall.h @@ -90,6 +90,18 @@ static inline void syscall_get_arguments(struct task_struct *task, args[5] = regs->bp; } +static inline void syscall_set_arguments(struct task_struct *task, + struct pt_regs *regs, + const unsigned long *args) +{ + regs->bx = args[0]; + regs->cx = args[1]; + regs->dx = args[2]; + regs->si = args[3]; + regs->di = args[4]; + regs->bp = args[5]; +} + static inline int syscall_get_arch(struct task_struct *task) { return AUDIT_ARCH_I386; @@ -121,6 +133,30 @@ static inline void syscall_get_arguments(struct task_struct *task, } } +static inline void syscall_set_arguments(struct task_struct *task, + struct pt_regs *regs, + const unsigned long *args) +{ +# ifdef CONFIG_IA32_EMULATION + if (task->thread_info.status & TS_COMPAT) { + regs->bx = *args++; + regs->cx = *args++; + regs->dx = *args++; + regs->si = *args++; + regs->di = *args++; + regs->bp = *args; + } else +# endif + { + regs->di = *args++; + regs->si = *args++; + regs->dx = *args++; + regs->r10 = *args++; + regs->r8 = *args++; + regs->r9 = *args; + } +} + static inline int syscall_get_arch(struct task_struct *task) { /* x32 tasks should be considered AUDIT_ARCH_X86_64. */ diff --git a/arch/xtensa/include/asm/syscall.h b/arch/xtensa/include/asm/syscall.h index 5ee974bf8330..f9a671cbf933 100644 --- a/arch/xtensa/include/asm/syscall.h +++ b/arch/xtensa/include/asm/syscall.h @@ -68,6 +68,17 @@ static inline void syscall_get_arguments(struct task_struct *task, args[i] = regs->areg[reg[i]]; } +static inline void syscall_set_arguments(struct task_struct *task, + struct pt_regs *regs, + const unsigned long *args) +{ + static const unsigned int reg[] = XTENSA_SYSCALL_ARGUMENT_REGS; + unsigned int i; + + for (i = 0; i < 6; ++i) + regs->areg[reg[i]] = args[i]; +} + asmlinkage long xtensa_rt_sigreturn(void); asmlinkage long xtensa_shmat(int, char __user *, int); asmlinkage long xtensa_fadvise64_64(int, int, diff --git a/include/asm-generic/syscall.h b/include/asm-generic/syscall.h index 182b039ce5fa..292b412f4e9a 100644 --- a/include/asm-generic/syscall.h +++ b/include/asm-generic/syscall.h @@ -117,6 +117,22 @@ void syscall_set_return_value(struct task_struct *task, struct pt_regs *regs, void syscall_get_arguments(struct task_struct *task, struct pt_regs *regs, unsigned long *args); +/** + * syscall_set_arguments - change system call parameter value + * @task: task of interest, must be in system call entry tracing + * @regs: task_pt_regs() of @task + * @args: array of argument values to store + * + * Changes 6 arguments to the system call. + * The first argument gets value @args[0], and so on. + * + * It's only valid to call this when @task is stopped for tracing on + * entry to a system call, due to %SYSCALL_WORK_SYSCALL_TRACE or + * %SYSCALL_WORK_SYSCALL_AUDIT. + */ +void syscall_set_arguments(struct task_struct *task, struct pt_regs *regs, + const unsigned long *args); + /** * syscall_get_arch - return the AUDIT_ARCH for the current system call * @task: task of interest, must be blocked From cc6622730be77fa88acc4fb0942cd39e6fa5ca27 Mon Sep 17 00:00:00 2001 From: "Dmitry V. Levin" Date: Mon, 3 Mar 2025 13:20:20 +0200 Subject: [PATCH 056/285] syscall.h: introduce syscall_set_nr() Similar to syscall_set_arguments() that complements syscall_get_arguments(), introduce syscall_set_nr() that complements syscall_get_nr(). syscall_set_nr() is going to be needed along with syscall_set_arguments() on all HAVE_ARCH_TRACEHOOK architectures to implement PTRACE_SET_SYSCALL_INFO API. Link: https://lkml.kernel.org/r/20250303112020.GD24170@strace.io Signed-off-by: Dmitry V. Levin Tested-by: Charlie Jenkins Reviewed-by: Charlie Jenkins Acked-by: Helge Deller # parisc Reviewed-by: Maciej W. Rozycki # mips Cc: Alexander Gordeev Cc: Alexey Gladkov (Intel) Cc: Andreas Larsson Cc: anton ivanov Cc: Arnd Bergmann Cc: Borislav Betkov Cc: Brian Cain Cc: Christian Borntraeger Cc: Christian Zankel Cc: Christophe Leroy Cc: Dave Hansen Cc: Davide Berardi Cc: David S. Miller Cc: Dinh Nguyen Cc: Eugene Syromiatnikov Cc: Eugene Syromyatnikov Cc: Geert Uytterhoeven Cc: Guo Ren Cc: Heiko Carstens Cc: "H. Peter Anvin" Cc: Huacai Chen Cc: Ingo Molnar Cc: Johannes Berg Cc: John Paul Adrian Glaubitz Cc: Jonas Bonn Cc: Madhavan Srinivasan Cc: Max Filippov Cc: Michael Ellerman Cc: Michal Simek Cc: Mike Frysinger Cc: Naveen N Rao Cc: Nicholas Piggin Cc: Oleg Nesterov Cc: Renzo Davoi Cc: Richard Weinberger Cc: Rich Felker Cc: Russel King Cc: Shuah Khan Cc: Stafford Horne Cc: Stefan Kristiansson Cc: Sven Schnelle Cc: Thomas Gleinxer Cc: Vasily Gorbik Cc: Vineet Gupta Cc: WANG Xuerui Cc: Will Deacon Cc: Yoshinori Sato Signed-off-by: Andrew Morton --- arch/arc/include/asm/syscall.h | 11 +++++++++++ arch/arm/include/asm/syscall.h | 24 ++++++++++++++++++++++++ arch/arm64/include/asm/syscall.h | 16 ++++++++++++++++ arch/hexagon/include/asm/syscall.h | 7 +++++++ arch/loongarch/include/asm/syscall.h | 7 +++++++ arch/m68k/include/asm/syscall.h | 7 +++++++ arch/microblaze/include/asm/syscall.h | 7 +++++++ arch/mips/include/asm/syscall.h | 15 +++++++++++++++ arch/nios2/include/asm/syscall.h | 5 +++++ arch/openrisc/include/asm/syscall.h | 6 ++++++ arch/parisc/include/asm/syscall.h | 7 +++++++ arch/powerpc/include/asm/syscall.h | 10 ++++++++++ arch/riscv/include/asm/syscall.h | 7 +++++++ arch/s390/include/asm/syscall.h | 12 ++++++++++++ arch/sh/include/asm/syscall_32.h | 12 ++++++++++++ arch/sparc/include/asm/syscall.h | 12 ++++++++++++ arch/um/include/asm/syscall-generic.h | 5 +++++ arch/x86/include/asm/syscall.h | 7 +++++++ arch/xtensa/include/asm/syscall.h | 7 +++++++ include/asm-generic/syscall.h | 14 ++++++++++++++ 20 files changed, 198 insertions(+) diff --git a/arch/arc/include/asm/syscall.h b/arch/arc/include/asm/syscall.h index 89c1e1736356..728d625a10f1 100644 --- a/arch/arc/include/asm/syscall.h +++ b/arch/arc/include/asm/syscall.h @@ -23,6 +23,17 @@ syscall_get_nr(struct task_struct *task, struct pt_regs *regs) return -1; } +static inline void +syscall_set_nr(struct task_struct *task, struct pt_regs *regs, int nr) +{ + /* + * Unlike syscall_get_nr(), syscall_set_nr() can be called only when + * the target task is stopped for tracing on entering syscall, so + * there is no need to have the same check syscall_get_nr() has. + */ + regs->r8 = nr; +} + static inline void syscall_rollback(struct task_struct *task, struct pt_regs *regs) { diff --git a/arch/arm/include/asm/syscall.h b/arch/arm/include/asm/syscall.h index 21927fa0ae2b..18b102a30741 100644 --- a/arch/arm/include/asm/syscall.h +++ b/arch/arm/include/asm/syscall.h @@ -68,6 +68,30 @@ static inline void syscall_set_return_value(struct task_struct *task, regs->ARM_r0 = (long) error ? error : val; } +static inline void syscall_set_nr(struct task_struct *task, + struct pt_regs *regs, + int nr) +{ + if (nr == -1) { + task_thread_info(task)->abi_syscall = -1; + /* + * When the syscall number is set to -1, the syscall will be + * skipped. In this case the syscall return value has to be + * set explicitly, otherwise the first syscall argument is + * returned as the syscall return value. + */ + syscall_set_return_value(task, regs, -ENOSYS, 0); + return; + } + if ((IS_ENABLED(CONFIG_AEABI) && !IS_ENABLED(CONFIG_OABI_COMPAT))) { + task_thread_info(task)->abi_syscall = nr; + return; + } + task_thread_info(task)->abi_syscall = + (task_thread_info(task)->abi_syscall & ~__NR_SYSCALL_MASK) | + (nr & __NR_SYSCALL_MASK); +} + #define SYSCALL_MAX_ARGS 7 static inline void syscall_get_arguments(struct task_struct *task, diff --git a/arch/arm64/include/asm/syscall.h b/arch/arm64/include/asm/syscall.h index 76020b66286b..712daa90e643 100644 --- a/arch/arm64/include/asm/syscall.h +++ b/arch/arm64/include/asm/syscall.h @@ -61,6 +61,22 @@ static inline void syscall_set_return_value(struct task_struct *task, regs->regs[0] = val; } +static inline void syscall_set_nr(struct task_struct *task, + struct pt_regs *regs, + int nr) +{ + regs->syscallno = nr; + if (nr == -1) { + /* + * When the syscall number is set to -1, the syscall will be + * skipped. In this case the syscall return value has to be + * set explicitly, otherwise the first syscall argument is + * returned as the syscall return value. + */ + syscall_set_return_value(task, regs, -ENOSYS, 0); + } +} + #define SYSCALL_MAX_ARGS 6 static inline void syscall_get_arguments(struct task_struct *task, diff --git a/arch/hexagon/include/asm/syscall.h b/arch/hexagon/include/asm/syscall.h index 1024a6548d78..70637261817a 100644 --- a/arch/hexagon/include/asm/syscall.h +++ b/arch/hexagon/include/asm/syscall.h @@ -26,6 +26,13 @@ static inline long syscall_get_nr(struct task_struct *task, return regs->r06; } +static inline void syscall_set_nr(struct task_struct *task, + struct pt_regs *regs, + int nr) +{ + regs->r06 = nr; +} + static inline void syscall_get_arguments(struct task_struct *task, struct pt_regs *regs, unsigned long *args) diff --git a/arch/loongarch/include/asm/syscall.h b/arch/loongarch/include/asm/syscall.h index ff415b3c0a8e..81d2733f7b94 100644 --- a/arch/loongarch/include/asm/syscall.h +++ b/arch/loongarch/include/asm/syscall.h @@ -26,6 +26,13 @@ static inline long syscall_get_nr(struct task_struct *task, return regs->regs[11]; } +static inline void syscall_set_nr(struct task_struct *task, + struct pt_regs *regs, + int nr) +{ + regs->regs[11] = nr; +} + static inline void syscall_rollback(struct task_struct *task, struct pt_regs *regs) { diff --git a/arch/m68k/include/asm/syscall.h b/arch/m68k/include/asm/syscall.h index d1453e850cdd..bf84b160c2eb 100644 --- a/arch/m68k/include/asm/syscall.h +++ b/arch/m68k/include/asm/syscall.h @@ -14,6 +14,13 @@ static inline int syscall_get_nr(struct task_struct *task, return regs->orig_d0; } +static inline void syscall_set_nr(struct task_struct *task, + struct pt_regs *regs, + int nr) +{ + regs->orig_d0 = nr; +} + static inline void syscall_rollback(struct task_struct *task, struct pt_regs *regs) { diff --git a/arch/microblaze/include/asm/syscall.h b/arch/microblaze/include/asm/syscall.h index 5eb3f624cc59..b5b6b91fae3e 100644 --- a/arch/microblaze/include/asm/syscall.h +++ b/arch/microblaze/include/asm/syscall.h @@ -14,6 +14,13 @@ static inline long syscall_get_nr(struct task_struct *task, return regs->r12; } +static inline void syscall_set_nr(struct task_struct *task, + struct pt_regs *regs, + int nr) +{ + regs->r12 = nr; +} + static inline void syscall_rollback(struct task_struct *task, struct pt_regs *regs) { diff --git a/arch/mips/include/asm/syscall.h b/arch/mips/include/asm/syscall.h index f1926ce30d4b..d19e67e2aa6a 100644 --- a/arch/mips/include/asm/syscall.h +++ b/arch/mips/include/asm/syscall.h @@ -41,6 +41,21 @@ static inline long syscall_get_nr(struct task_struct *task, return task_thread_info(task)->syscall; } +static inline void syscall_set_nr(struct task_struct *task, + struct pt_regs *regs, + int nr) +{ + /* + * New syscall number has to be assigned to regs[2] because + * it is loaded from there unconditionally after return from + * syscall_trace_enter() invocation. + * + * Consequently, if the syscall was indirect and nr != __NR_syscall, + * then after this assignment the syscall will cease to be indirect. + */ + task_thread_info(task)->syscall = regs->regs[2] = nr; +} + static inline void mips_syscall_update_nr(struct task_struct *task, struct pt_regs *regs) { diff --git a/arch/nios2/include/asm/syscall.h b/arch/nios2/include/asm/syscall.h index 526449edd768..8e3eb1d689bb 100644 --- a/arch/nios2/include/asm/syscall.h +++ b/arch/nios2/include/asm/syscall.h @@ -15,6 +15,11 @@ static inline int syscall_get_nr(struct task_struct *task, struct pt_regs *regs) return regs->r2; } +static inline void syscall_set_nr(struct task_struct *task, struct pt_regs *regs, int nr) +{ + regs->r2 = nr; +} + static inline void syscall_rollback(struct task_struct *task, struct pt_regs *regs) { diff --git a/arch/openrisc/include/asm/syscall.h b/arch/openrisc/include/asm/syscall.h index e6383be2a195..5e037d9659c5 100644 --- a/arch/openrisc/include/asm/syscall.h +++ b/arch/openrisc/include/asm/syscall.h @@ -25,6 +25,12 @@ syscall_get_nr(struct task_struct *task, struct pt_regs *regs) return regs->orig_gpr11; } +static inline void +syscall_set_nr(struct task_struct *task, struct pt_regs *regs, int nr) +{ + regs->orig_gpr11 = nr; +} + static inline void syscall_rollback(struct task_struct *task, struct pt_regs *regs) { diff --git a/arch/parisc/include/asm/syscall.h b/arch/parisc/include/asm/syscall.h index b146d0ae4c77..c11222798ab2 100644 --- a/arch/parisc/include/asm/syscall.h +++ b/arch/parisc/include/asm/syscall.h @@ -17,6 +17,13 @@ static inline long syscall_get_nr(struct task_struct *tsk, return regs->gr[20]; } +static inline void syscall_set_nr(struct task_struct *tsk, + struct pt_regs *regs, + int nr) +{ + regs->gr[20] = nr; +} + static inline void syscall_get_arguments(struct task_struct *tsk, struct pt_regs *regs, unsigned long *args) diff --git a/arch/powerpc/include/asm/syscall.h b/arch/powerpc/include/asm/syscall.h index b2715448a660..4b3c52ed6e9d 100644 --- a/arch/powerpc/include/asm/syscall.h +++ b/arch/powerpc/include/asm/syscall.h @@ -39,6 +39,16 @@ static inline int syscall_get_nr(struct task_struct *task, struct pt_regs *regs) return -1; } +static inline void syscall_set_nr(struct task_struct *task, struct pt_regs *regs, int nr) +{ + /* + * Unlike syscall_get_nr(), syscall_set_nr() can be called only when + * the target task is stopped for tracing on entering syscall, so + * there is no need to have the same check syscall_get_nr() has. + */ + regs->gpr[0] = nr; +} + static inline void syscall_rollback(struct task_struct *task, struct pt_regs *regs) { diff --git a/arch/riscv/include/asm/syscall.h b/arch/riscv/include/asm/syscall.h index da56417b6705..34313387f977 100644 --- a/arch/riscv/include/asm/syscall.h +++ b/arch/riscv/include/asm/syscall.h @@ -30,6 +30,13 @@ static inline int syscall_get_nr(struct task_struct *task, return regs->a7; } +static inline void syscall_set_nr(struct task_struct *task, + struct pt_regs *regs, + int nr) +{ + regs->a7 = nr; +} + static inline void syscall_rollback(struct task_struct *task, struct pt_regs *regs) { diff --git a/arch/s390/include/asm/syscall.h b/arch/s390/include/asm/syscall.h index b87d8bb2cbaa..bd4cb00ccd5e 100644 --- a/arch/s390/include/asm/syscall.h +++ b/arch/s390/include/asm/syscall.h @@ -24,6 +24,18 @@ static inline long syscall_get_nr(struct task_struct *task, (regs->int_code & 0xffff) : -1; } +static inline void syscall_set_nr(struct task_struct *task, + struct pt_regs *regs, + int nr) +{ + /* + * Unlike syscall_get_nr(), syscall_set_nr() can be called only when + * the target task is stopped for tracing on entering syscall, so + * there is no need to have the same check syscall_get_nr() has. + */ + regs->int_code = (regs->int_code & ~0xffff) | (nr & 0xffff); +} + static inline void syscall_rollback(struct task_struct *task, struct pt_regs *regs) { diff --git a/arch/sh/include/asm/syscall_32.h b/arch/sh/include/asm/syscall_32.h index cb51a7528384..7027d87d901d 100644 --- a/arch/sh/include/asm/syscall_32.h +++ b/arch/sh/include/asm/syscall_32.h @@ -15,6 +15,18 @@ static inline long syscall_get_nr(struct task_struct *task, return (regs->tra >= 0) ? regs->regs[3] : -1L; } +static inline void syscall_set_nr(struct task_struct *task, + struct pt_regs *regs, + int nr) +{ + /* + * Unlike syscall_get_nr(), syscall_set_nr() can be called only when + * the target task is stopped for tracing on entering syscall, so + * there is no need to have the same check syscall_get_nr() has. + */ + regs->regs[3] = nr; +} + static inline void syscall_rollback(struct task_struct *task, struct pt_regs *regs) { diff --git a/arch/sparc/include/asm/syscall.h b/arch/sparc/include/asm/syscall.h index 62a5a78804c4..b0233924d323 100644 --- a/arch/sparc/include/asm/syscall.h +++ b/arch/sparc/include/asm/syscall.h @@ -25,6 +25,18 @@ static inline long syscall_get_nr(struct task_struct *task, return (syscall_p ? regs->u_regs[UREG_G1] : -1L); } +static inline void syscall_set_nr(struct task_struct *task, + struct pt_regs *regs, + int nr) +{ + /* + * Unlike syscall_get_nr(), syscall_set_nr() can be called only when + * the target task is stopped for tracing on entering syscall, so + * there is no need to have the same check syscall_get_nr() has. + */ + regs->u_regs[UREG_G1] = nr; +} + static inline void syscall_rollback(struct task_struct *task, struct pt_regs *regs) { diff --git a/arch/um/include/asm/syscall-generic.h b/arch/um/include/asm/syscall-generic.h index 2984feb9d576..bcd73bcfe577 100644 --- a/arch/um/include/asm/syscall-generic.h +++ b/arch/um/include/asm/syscall-generic.h @@ -21,6 +21,11 @@ static inline int syscall_get_nr(struct task_struct *task, struct pt_regs *regs) return PT_REGS_SYSCALL_NR(regs); } +static inline void syscall_set_nr(struct task_struct *task, struct pt_regs *regs, int nr) +{ + PT_REGS_SYSCALL_NR(regs) = nr; +} + static inline void syscall_rollback(struct task_struct *task, struct pt_regs *regs) { diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h index b9c249dd9e3d..c10dbb74cd00 100644 --- a/arch/x86/include/asm/syscall.h +++ b/arch/x86/include/asm/syscall.h @@ -38,6 +38,13 @@ static inline int syscall_get_nr(struct task_struct *task, struct pt_regs *regs) return regs->orig_ax; } +static inline void syscall_set_nr(struct task_struct *task, + struct pt_regs *regs, + int nr) +{ + regs->orig_ax = nr; +} + static inline void syscall_rollback(struct task_struct *task, struct pt_regs *regs) { diff --git a/arch/xtensa/include/asm/syscall.h b/arch/xtensa/include/asm/syscall.h index f9a671cbf933..7db3b489c8ad 100644 --- a/arch/xtensa/include/asm/syscall.h +++ b/arch/xtensa/include/asm/syscall.h @@ -28,6 +28,13 @@ static inline long syscall_get_nr(struct task_struct *task, return regs->syscall; } +static inline void syscall_set_nr(struct task_struct *task, + struct pt_regs *regs, + int nr) +{ + regs->syscall = nr; +} + static inline void syscall_rollback(struct task_struct *task, struct pt_regs *regs) { diff --git a/include/asm-generic/syscall.h b/include/asm-generic/syscall.h index 292b412f4e9a..c5a3ad53beec 100644 --- a/include/asm-generic/syscall.h +++ b/include/asm-generic/syscall.h @@ -37,6 +37,20 @@ struct pt_regs; */ int syscall_get_nr(struct task_struct *task, struct pt_regs *regs); +/** + * syscall_set_nr - change the system call a task is executing + * @task: task of interest, must be blocked + * @regs: task_pt_regs() of @task + * @nr: system call number + * + * Changes the system call number @task is about to execute. + * + * It's only valid to call this when @task is stopped for tracing on + * entry to a system call, due to %SYSCALL_WORK_SYSCALL_TRACE or + * %SYSCALL_WORK_SYSCALL_AUDIT. + */ +void syscall_set_nr(struct task_struct *task, struct pt_regs *regs, int nr); + /** * syscall_rollback - roll back registers after an aborted system call * @task: task of interest, must be in system call exit tracing From c354ec9cee909136e168f4f4900d56a491c4d6c5 Mon Sep 17 00:00:00 2001 From: "Dmitry V. Levin" Date: Mon, 3 Mar 2025 13:20:38 +0200 Subject: [PATCH 057/285] ptrace_get_syscall_info: factor out ptrace_get_syscall_info_op Move the code that calculates the type of the system call stop out of ptrace_get_syscall_info() into a separate function ptrace_get_syscall_info_op() which is going to be used later to implement PTRACE_SET_SYSCALL_INFO API. Link: https://lkml.kernel.org/r/20250303112038.GE24170@strace.io Signed-off-by: Dmitry V. Levin Reviewed-by: Oleg Nesterov Cc: Alexander Gordeev Cc: Alexey Gladkov (Intel) Cc: Andreas Larsson Cc: anton ivanov Cc: Arnd Bergmann Cc: Borislav Betkov Cc: Brian Cain Cc: Charlie Jenkins Cc: Christian Borntraeger Cc: Christian Zankel Cc: Christophe Leroy Cc: Dave Hansen Cc: Davide Berardi Cc: David S. Miller Cc: Dinh Nguyen Cc: Eugene Syromiatnikov Cc: Eugene Syromyatnikov Cc: Geert Uytterhoeven Cc: Guo Ren Cc: Heiko Carstens Cc: Helge Deller Cc: "H. Peter Anvin" Cc: Huacai Chen Cc: Ingo Molnar Cc: Johannes Berg Cc: John Paul Adrian Glaubitz Cc: Jonas Bonn Cc: Maciej W. Rozycki Cc: Madhavan Srinivasan Cc: Max Filippov Cc: Michael Ellerman Cc: Michal Simek Cc: Mike Frysinger Cc: Naveen N Rao Cc: Nicholas Piggin Cc: Renzo Davoi Cc: Richard Weinberger Cc: Rich Felker Cc: Russel King Cc: Shuah Khan Cc: Stafford Horne Cc: Stefan Kristiansson Cc: Sven Schnelle Cc: Thomas Gleinxer Cc: Vasily Gorbik Cc: Vineet Gupta Cc: WANG Xuerui Cc: Will Deacon Cc: Yoshinori Sato Signed-off-by: Andrew Morton --- kernel/ptrace.c | 58 +++++++++++++++++++++++++++++-------------------- 1 file changed, 34 insertions(+), 24 deletions(-) diff --git a/kernel/ptrace.c b/kernel/ptrace.c index d5f89f9ef29f..22e7d74cf4cd 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -921,7 +921,6 @@ ptrace_get_syscall_info_entry(struct task_struct *child, struct pt_regs *regs, unsigned long args[ARRAY_SIZE(info->entry.args)]; int i; - info->op = PTRACE_SYSCALL_INFO_ENTRY; info->entry.nr = syscall_get_nr(child, regs); syscall_get_arguments(child, regs, args); for (i = 0; i < ARRAY_SIZE(args); i++) @@ -943,7 +942,6 @@ ptrace_get_syscall_info_seccomp(struct task_struct *child, struct pt_regs *regs, * diverge significantly enough. */ ptrace_get_syscall_info_entry(child, regs, info); - info->op = PTRACE_SYSCALL_INFO_SECCOMP; info->seccomp.ret_data = child->ptrace_message; /* ret_data is the last field in struct ptrace_syscall_info.seccomp */ @@ -954,7 +952,6 @@ static unsigned long ptrace_get_syscall_info_exit(struct task_struct *child, struct pt_regs *regs, struct ptrace_syscall_info *info) { - info->op = PTRACE_SYSCALL_INFO_EXIT; info->exit.rval = syscall_get_error(child, regs); info->exit.is_error = !!info->exit.rval; if (!info->exit.is_error) @@ -965,19 +962,8 @@ ptrace_get_syscall_info_exit(struct task_struct *child, struct pt_regs *regs, } static int -ptrace_get_syscall_info(struct task_struct *child, unsigned long user_size, - void __user *datavp) +ptrace_get_syscall_info_op(struct task_struct *child) { - struct pt_regs *regs = task_pt_regs(child); - struct ptrace_syscall_info info = { - .op = PTRACE_SYSCALL_INFO_NONE, - .arch = syscall_get_arch(child), - .instruction_pointer = instruction_pointer(regs), - .stack_pointer = user_stack_pointer(regs), - }; - unsigned long actual_size = offsetof(struct ptrace_syscall_info, entry); - unsigned long write_size; - /* * This does not need lock_task_sighand() to access * child->last_siginfo because ptrace_freeze_traced() @@ -988,18 +974,42 @@ ptrace_get_syscall_info(struct task_struct *child, unsigned long user_size, case SIGTRAP | 0x80: switch (child->ptrace_message) { case PTRACE_EVENTMSG_SYSCALL_ENTRY: - actual_size = ptrace_get_syscall_info_entry(child, regs, - &info); - break; + return PTRACE_SYSCALL_INFO_ENTRY; case PTRACE_EVENTMSG_SYSCALL_EXIT: - actual_size = ptrace_get_syscall_info_exit(child, regs, - &info); - break; + return PTRACE_SYSCALL_INFO_EXIT; + default: + return PTRACE_SYSCALL_INFO_NONE; } - break; case SIGTRAP | (PTRACE_EVENT_SECCOMP << 8): - actual_size = ptrace_get_syscall_info_seccomp(child, regs, - &info); + return PTRACE_SYSCALL_INFO_SECCOMP; + default: + return PTRACE_SYSCALL_INFO_NONE; + } +} + +static int +ptrace_get_syscall_info(struct task_struct *child, unsigned long user_size, + void __user *datavp) +{ + struct pt_regs *regs = task_pt_regs(child); + struct ptrace_syscall_info info = { + .op = ptrace_get_syscall_info_op(child), + .arch = syscall_get_arch(child), + .instruction_pointer = instruction_pointer(regs), + .stack_pointer = user_stack_pointer(regs), + }; + unsigned long actual_size = offsetof(struct ptrace_syscall_info, entry); + unsigned long write_size; + + switch (info.op) { + case PTRACE_SYSCALL_INFO_ENTRY: + actual_size = ptrace_get_syscall_info_entry(child, regs, &info); + break; + case PTRACE_SYSCALL_INFO_EXIT: + actual_size = ptrace_get_syscall_info_exit(child, regs, &info); + break; + case PTRACE_SYSCALL_INFO_SECCOMP: + actual_size = ptrace_get_syscall_info_seccomp(child, regs, &info); break; } From 26bb32768fe6552de044f782a58b3272073fbfc0 Mon Sep 17 00:00:00 2001 From: "Dmitry V. Levin" Date: Mon, 3 Mar 2025 13:20:44 +0200 Subject: [PATCH 058/285] ptrace: introduce PTRACE_SET_SYSCALL_INFO request PTRACE_SET_SYSCALL_INFO is a generic ptrace API that complements PTRACE_GET_SYSCALL_INFO by letting the ptracer modify details of system calls the tracee is blocked in. This API allows ptracers to obtain and modify system call details in a straightforward and architecture-agnostic way, providing a consistent way of manipulating the system call number and arguments across architectures. As in case of PTRACE_GET_SYSCALL_INFO, PTRACE_SET_SYSCALL_INFO also does not aim to address numerous architecture-specific system call ABI peculiarities, like differences in the number of system call arguments for such system calls as pread64 and preadv. The current implementation supports changing only those bits of system call information that are used by strace system call tampering, namely, syscall number, syscall arguments, and syscall return value. Support of changing additional details returned by PTRACE_GET_SYSCALL_INFO, such as instruction pointer and stack pointer, could be added later if needed, by using struct ptrace_syscall_info.flags to specify the additional details that should be set. Currently, "flags" and "reserved" fields of struct ptrace_syscall_info must be initialized with zeroes; "arch", "instruction_pointer", and "stack_pointer" fields are currently ignored. PTRACE_SET_SYSCALL_INFO currently supports only PTRACE_SYSCALL_INFO_ENTRY, PTRACE_SYSCALL_INFO_EXIT, and PTRACE_SYSCALL_INFO_SECCOMP operations. Other operations could be added later if needed. Ideally, PTRACE_SET_SYSCALL_INFO should have been introduced along with PTRACE_GET_SYSCALL_INFO, but it didn't happen. The last straw that convinced me to implement PTRACE_SET_SYSCALL_INFO was apparent failure to provide an API of changing the first system call argument on riscv architecture. ptrace(2) man page: long ptrace(enum __ptrace_request request, pid_t pid, void *addr, void *data); ... PTRACE_SET_SYSCALL_INFO Modify information about the system call that caused the stop. The "data" argument is a pointer to struct ptrace_syscall_info that specifies the system call information to be set. The "addr" argument should be set to sizeof(struct ptrace_syscall_info)). Link: https://lore.kernel.org/all/59505464-c84a-403d-972f-d4b2055eeaac@gmail.com/ Link: https://lkml.kernel.org/r/20250303112044.GF24170@strace.io Signed-off-by: Dmitry V. Levin Reviewed-by: Alexey Gladkov Reviewed-by: Charlie Jenkins Tested-by: Charlie Jenkins Reviewed-by: Eugene Syromiatnikov Reviewed-by: Oleg Nesterov Cc: Alexander Gordeev Cc: Andreas Larsson Cc: anton ivanov Cc: Arnd Bergmann Cc: Borislav Betkov Cc: Brian Cain Cc: Christian Borntraeger Cc: Christian Zankel Cc: Christophe Leroy Cc: Dave Hansen Cc: Davide Berardi Cc: David S. Miller Cc: Dinh Nguyen Cc: Eugene Syromyatnikov Cc: Geert Uytterhoeven Cc: Guo Ren Cc: Heiko Carstens Cc: Helge Deller Cc: "H. Peter Anvin" Cc: Huacai Chen Cc: Ingo Molnar Cc: Johannes Berg Cc: John Paul Adrian Glaubitz Cc: Jonas Bonn Cc: Maciej W. Rozycki Cc: Madhavan Srinivasan Cc: Max Filippov Cc: Michael Ellerman Cc: Michal Simek Cc: Mike Frysinger Cc: Naveen N Rao Cc: Nicholas Piggin Cc: Renzo Davoi Cc: Richard Weinberger Cc: Rich Felker Cc: Russel King Cc: Shuah Khan Cc: Stafford Horne Cc: Stefan Kristiansson Cc: Sven Schnelle Cc: Thomas Gleinxer Cc: Vasily Gorbik Cc: Vineet Gupta Cc: WANG Xuerui Cc: Will Deacon Cc: Yoshinori Sato Signed-off-by: Andrew Morton --- include/uapi/linux/ptrace.h | 7 ++- kernel/ptrace.c | 121 +++++++++++++++++++++++++++++++++++- 2 files changed, 126 insertions(+), 2 deletions(-) diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h index 72c038fc71d0..5f8ef6156752 100644 --- a/include/uapi/linux/ptrace.h +++ b/include/uapi/linux/ptrace.h @@ -74,6 +74,7 @@ struct seccomp_metadata { }; #define PTRACE_GET_SYSCALL_INFO 0x420e +#define PTRACE_SET_SYSCALL_INFO 0x4212 #define PTRACE_SYSCALL_INFO_NONE 0 #define PTRACE_SYSCALL_INFO_ENTRY 1 #define PTRACE_SYSCALL_INFO_EXIT 2 @@ -81,7 +82,8 @@ struct seccomp_metadata { struct ptrace_syscall_info { __u8 op; /* PTRACE_SYSCALL_INFO_* */ - __u8 pad[3]; + __u8 reserved; + __u16 flags; __u32 arch; __u64 instruction_pointer; __u64 stack_pointer; @@ -98,6 +100,7 @@ struct ptrace_syscall_info { __u64 nr; __u64 args[6]; __u32 ret_data; + __u32 reserved2; } seccomp; }; }; @@ -142,6 +145,8 @@ struct ptrace_sud_config { __u64 len; }; +/* 0x4212 is PTRACE_SET_SYSCALL_INFO */ + /* * These values are stored in task->ptrace_message * by ptrace_stop to describe the current syscall-stop. diff --git a/kernel/ptrace.c b/kernel/ptrace.c index 22e7d74cf4cd..75a84efad40f 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -944,7 +944,10 @@ ptrace_get_syscall_info_seccomp(struct task_struct *child, struct pt_regs *regs, ptrace_get_syscall_info_entry(child, regs, info); info->seccomp.ret_data = child->ptrace_message; - /* ret_data is the last field in struct ptrace_syscall_info.seccomp */ + /* + * ret_data is the last non-reserved field + * in struct ptrace_syscall_info.seccomp + */ return offsetofend(struct ptrace_syscall_info, seccomp.ret_data); } @@ -1016,6 +1019,118 @@ ptrace_get_syscall_info(struct task_struct *child, unsigned long user_size, write_size = min(actual_size, user_size); return copy_to_user(datavp, &info, write_size) ? -EFAULT : actual_size; } + +static int +ptrace_set_syscall_info_entry(struct task_struct *child, struct pt_regs *regs, + struct ptrace_syscall_info *info) +{ + unsigned long args[ARRAY_SIZE(info->entry.args)]; + int nr = info->entry.nr; + int i; + + /* + * Check that the syscall number specified in info->entry.nr + * is either a value of type "int" or a sign-extended value + * of type "int". + */ + if (nr != info->entry.nr) + return -ERANGE; + + for (i = 0; i < ARRAY_SIZE(args); i++) { + args[i] = info->entry.args[i]; + /* + * Check that the syscall argument specified in + * info->entry.args[i] is either a value of type + * "unsigned long" or a sign-extended value of type "long". + */ + if (args[i] != info->entry.args[i]) + return -ERANGE; + } + + syscall_set_nr(child, regs, nr); + /* + * If the syscall number is set to -1, setting syscall arguments is not + * just pointless, it would also clobber the syscall return value on + * those architectures that share the same register both for the first + * argument of syscall and its return value. + */ + if (nr != -1) + syscall_set_arguments(child, regs, args); + + return 0; +} + +static int +ptrace_set_syscall_info_seccomp(struct task_struct *child, struct pt_regs *regs, + struct ptrace_syscall_info *info) +{ + /* + * info->entry is currently a subset of info->seccomp, + * info->seccomp.ret_data is currently ignored. + */ + return ptrace_set_syscall_info_entry(child, regs, info); +} + +static int +ptrace_set_syscall_info_exit(struct task_struct *child, struct pt_regs *regs, + struct ptrace_syscall_info *info) +{ + long rval = info->exit.rval; + + /* + * Check that the return value specified in info->exit.rval + * is either a value of type "long" or a sign-extended value + * of type "long". + */ + if (rval != info->exit.rval) + return -ERANGE; + + if (info->exit.is_error) + syscall_set_return_value(child, regs, rval, 0); + else + syscall_set_return_value(child, regs, 0, rval); + + return 0; +} + +static int +ptrace_set_syscall_info(struct task_struct *child, unsigned long user_size, + const void __user *datavp) +{ + struct pt_regs *regs = task_pt_regs(child); + struct ptrace_syscall_info info; + + if (user_size < sizeof(info)) + return -EINVAL; + + /* + * The compatibility is tracked by info.op and info.flags: if user-space + * does not instruct us to use unknown extra bits from future versions + * of ptrace_syscall_info, we are not going to read them either. + */ + if (copy_from_user(&info, datavp, sizeof(info))) + return -EFAULT; + + /* Reserved for future use. */ + if (info.flags || info.reserved) + return -EINVAL; + + /* Changing the type of the system call stop is not supported yet. */ + if (ptrace_get_syscall_info_op(child) != info.op) + return -EINVAL; + + switch (info.op) { + case PTRACE_SYSCALL_INFO_ENTRY: + return ptrace_set_syscall_info_entry(child, regs, &info); + case PTRACE_SYSCALL_INFO_EXIT: + return ptrace_set_syscall_info_exit(child, regs, &info); + case PTRACE_SYSCALL_INFO_SECCOMP: + return ptrace_set_syscall_info_seccomp(child, regs, &info); + default: + /* Other types of system call stops are not supported yet. */ + return -EINVAL; + } +} #endif /* CONFIG_HAVE_ARCH_TRACEHOOK */ int ptrace_request(struct task_struct *child, long request, @@ -1234,6 +1349,10 @@ int ptrace_request(struct task_struct *child, long request, case PTRACE_GET_SYSCALL_INFO: ret = ptrace_get_syscall_info(child, addr, datavp); break; + + case PTRACE_SET_SYSCALL_INFO: + ret = ptrace_set_syscall_info(child, addr, datavp); + break; #endif case PTRACE_SECCOMP_GET_FILTER: From bc6fa711951185fa0fdf5974c50a1c4d0cd65be3 Mon Sep 17 00:00:00 2001 From: "Dmitry V. Levin" Date: Mon, 3 Mar 2025 13:20:52 +0200 Subject: [PATCH 059/285] selftests/ptrace: add a test case for PTRACE_SET_SYSCALL_INFO Check whether PTRACE_SET_SYSCALL_INFO semantics implemented in the kernel matches userspace expectations. Link: https://lkml.kernel.org/r/20250303112052.GG24170@strace.io Signed-off-by: Dmitry V. Levin Reviewed-by: Oleg Nesterov Cc: Alexander Gordeev Cc: Alexey Gladkov (Intel) Cc: Andreas Larsson Cc: anton ivanov Cc: Arnd Bergmann Cc: Borislav Betkov Cc: Brian Cain Cc: Charlie Jenkins Cc: Christian Borntraeger Cc: Christian Zankel Cc: Christophe Leroy Cc: Dave Hansen Cc: Davide Berardi Cc: David S. Miller Cc: Dinh Nguyen Cc: Eugene Syromiatnikov Cc: Eugene Syromyatnikov Cc: Geert Uytterhoeven Cc: Guo Ren Cc: Heiko Carstens Cc: Helge Deller Cc: "H. Peter Anvin" Cc: Huacai Chen Cc: Ingo Molnar Cc: Johannes Berg Cc: John Paul Adrian Glaubitz Cc: Jonas Bonn Cc: Maciej W. Rozycki Cc: Madhavan Srinivasan Cc: Max Filippov Cc: Michael Ellerman Cc: Michal Simek Cc: Mike Frysinger Cc: Naveen N Rao Cc: Nicholas Piggin Cc: Renzo Davoi Cc: Richard Weinberger Cc: Rich Felker Cc: Russel King Cc: Shuah Khan Cc: Stafford Horne Cc: Stefan Kristiansson Cc: Sven Schnelle Cc: Thomas Gleinxer Cc: Vasily Gorbik Cc: Vineet Gupta Cc: WANG Xuerui Cc: Will Deacon Cc: Yoshinori Sato Signed-off-by: Andrew Morton --- tools/testing/selftests/ptrace/Makefile | 2 +- .../selftests/ptrace/set_syscall_info.c | 519 ++++++++++++++++++ 2 files changed, 520 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/ptrace/set_syscall_info.c diff --git a/tools/testing/selftests/ptrace/Makefile b/tools/testing/selftests/ptrace/Makefile index 1c631740a730..c5e0b76ba6ac 100644 --- a/tools/testing/selftests/ptrace/Makefile +++ b/tools/testing/selftests/ptrace/Makefile @@ -1,6 +1,6 @@ # SPDX-License-Identifier: GPL-2.0-only CFLAGS += -std=c99 -pthread -Wall $(KHDR_INCLUDES) -TEST_GEN_PROGS := get_syscall_info peeksiginfo vmaccess get_set_sud +TEST_GEN_PROGS := get_syscall_info set_syscall_info peeksiginfo vmaccess get_set_sud include ../lib.mk diff --git a/tools/testing/selftests/ptrace/set_syscall_info.c b/tools/testing/selftests/ptrace/set_syscall_info.c new file mode 100644 index 000000000000..4198248ef874 --- /dev/null +++ b/tools/testing/selftests/ptrace/set_syscall_info.c @@ -0,0 +1,519 @@ +// SPDX-License-Identifier: GPL-2.0+ +/* + * Copyright (c) 2018-2025 Dmitry V. Levin + * All rights reserved. + * + * Check whether PTRACE_SET_SYSCALL_INFO semantics implemented in the kernel + * matches userspace expectations. + */ + +#include "../kselftest_harness.h" +#include +#include +#include +#include +#include +#include + +#if defined(_MIPS_SIM) && _MIPS_SIM == _MIPS_SIM_NABI32 +/* + * MIPS N32 is the only architecture where __kernel_ulong_t + * does not match the bitness of syscall arguments. + */ +typedef unsigned long long kernel_ulong_t; +#else +typedef __kernel_ulong_t kernel_ulong_t; +#endif + +struct si_entry { + int nr; + kernel_ulong_t args[6]; +}; +struct si_exit { + unsigned int is_error; + int rval; +}; + +static unsigned int ptrace_stop; +static pid_t tracee_pid; + +static int +kill_tracee(pid_t pid) +{ + if (!pid) + return 0; + + int saved_errno = errno; + + int rc = kill(pid, SIGKILL); + + errno = saved_errno; + return rc; +} + +static long +sys_ptrace(int request, pid_t pid, unsigned long addr, unsigned long data) +{ + return syscall(__NR_ptrace, request, pid, addr, data); +} + +#define LOG_KILL_TRACEE(fmt, ...) \ + do { \ + kill_tracee(tracee_pid); \ + TH_LOG("wait #%d: " fmt, \ + ptrace_stop, ##__VA_ARGS__); \ + } while (0) + +static void +check_psi_entry(struct __test_metadata *_metadata, + const struct ptrace_syscall_info *info, + const struct si_entry *exp_entry, + const char *text) +{ + unsigned int i; + int exp_nr = exp_entry->nr; +#if defined __s390__ || defined __s390x__ + /* s390 is the only architecture that has 16-bit syscall numbers */ + exp_nr &= 0xffff; +#endif + + ASSERT_EQ(PTRACE_SYSCALL_INFO_ENTRY, info->op) { + LOG_KILL_TRACEE("%s: entry stop mismatch", text); + } + ASSERT_TRUE(info->arch) { + LOG_KILL_TRACEE("%s: entry stop mismatch", text); + } + ASSERT_TRUE(info->instruction_pointer) { + LOG_KILL_TRACEE("%s: entry stop mismatch", text); + } + ASSERT_TRUE(info->stack_pointer) { + LOG_KILL_TRACEE("%s: entry stop mismatch", text); + } + ASSERT_EQ(exp_nr, info->entry.nr) { + LOG_KILL_TRACEE("%s: syscall nr mismatch", text); + } + for (i = 0; i < ARRAY_SIZE(exp_entry->args); ++i) { + ASSERT_EQ(exp_entry->args[i], info->entry.args[i]) { + LOG_KILL_TRACEE("%s: syscall arg #%u mismatch", + text, i); + } + } +} + +static void +check_psi_exit(struct __test_metadata *_metadata, + const struct ptrace_syscall_info *info, + const struct si_exit *exp_exit, + const char *text) +{ + ASSERT_EQ(PTRACE_SYSCALL_INFO_EXIT, info->op) { + LOG_KILL_TRACEE("%s: exit stop mismatch", text); + } + ASSERT_TRUE(info->arch) { + LOG_KILL_TRACEE("%s: exit stop mismatch", text); + } + ASSERT_TRUE(info->instruction_pointer) { + LOG_KILL_TRACEE("%s: exit stop mismatch", text); + } + ASSERT_TRUE(info->stack_pointer) { + LOG_KILL_TRACEE("%s: exit stop mismatch", text); + } + ASSERT_EQ(exp_exit->is_error, info->exit.is_error) { + LOG_KILL_TRACEE("%s: exit stop mismatch", text); + } + ASSERT_EQ(exp_exit->rval, info->exit.rval) { + LOG_KILL_TRACEE("%s: exit stop mismatch", text); + } +} + +TEST(set_syscall_info) +{ + const pid_t tracer_pid = getpid(); + const kernel_ulong_t dummy[] = { + (kernel_ulong_t) 0xdad0bef0bad0fed0ULL, + (kernel_ulong_t) 0xdad1bef1bad1fed1ULL, + (kernel_ulong_t) 0xdad2bef2bad2fed2ULL, + (kernel_ulong_t) 0xdad3bef3bad3fed3ULL, + (kernel_ulong_t) 0xdad4bef4bad4fed4ULL, + (kernel_ulong_t) 0xdad5bef5bad5fed5ULL, + }; + int splice_in[2], splice_out[2]; + + ASSERT_EQ(0, pipe(splice_in)); + ASSERT_EQ(0, pipe(splice_out)); + ASSERT_EQ(sizeof(dummy), write(splice_in[1], dummy, sizeof(dummy))); + + const struct { + struct si_entry entry[2]; + struct si_exit exit[2]; + } si[] = { + /* change scno, keep non-error rval */ + { + { + { + __NR_gettid, + { + dummy[0], dummy[1], dummy[2], + dummy[3], dummy[4], dummy[5] + } + }, { + __NR_getppid, + { + dummy[0], dummy[1], dummy[2], + dummy[3], dummy[4], dummy[5] + } + } + }, { + { 0, tracer_pid }, { 0, tracer_pid } + } + }, + + /* set scno to -1, keep error rval */ + { + { + { + __NR_chdir, + { + (uintptr_t) ".", + dummy[1], dummy[2], + dummy[3], dummy[4], dummy[5] + } + }, { + -1, + { + (uintptr_t) ".", + dummy[1], dummy[2], + dummy[3], dummy[4], dummy[5] + } + } + }, { + { 1, -ENOSYS }, { 1, -ENOSYS } + } + }, + + /* keep scno, change non-error rval */ + { + { + { + __NR_getppid, + { + dummy[0], dummy[1], dummy[2], + dummy[3], dummy[4], dummy[5] + } + }, { + __NR_getppid, + { + dummy[0], dummy[1], dummy[2], + dummy[3], dummy[4], dummy[5] + } + } + }, { + { 0, tracer_pid }, { 0, tracer_pid + 1 } + } + }, + + /* change arg1, keep non-error rval */ + { + { + { + __NR_chdir, + { + (uintptr_t) "", + dummy[1], dummy[2], + dummy[3], dummy[4], dummy[5] + } + }, { + __NR_chdir, + { + (uintptr_t) ".", + dummy[1], dummy[2], + dummy[3], dummy[4], dummy[5] + } + } + }, { + { 0, 0 }, { 0, 0 } + } + }, + + /* set scno to -1, change error rval to non-error */ + { + { + { + __NR_gettid, + { + dummy[0], dummy[1], dummy[2], + dummy[3], dummy[4], dummy[5] + } + }, { + -1, + { + dummy[0], dummy[1], dummy[2], + dummy[3], dummy[4], dummy[5] + } + } + }, { + { 1, -ENOSYS }, { 0, tracer_pid } + } + }, + + /* change scno, change non-error rval to error */ + { + { + { + __NR_chdir, + { + dummy[0], dummy[1], dummy[2], + dummy[3], dummy[4], dummy[5] + } + }, { + __NR_getppid, + { + dummy[0], dummy[1], dummy[2], + dummy[3], dummy[4], dummy[5] + } + } + }, { + { 0, tracer_pid }, { 1, -EISDIR } + } + }, + + /* change scno and all args, change non-error rval */ + { + { + { + __NR_gettid, + { + dummy[0], dummy[1], dummy[2], + dummy[3], dummy[4], dummy[5] + } + }, { + __NR_splice, + { + splice_in[0], 0, splice_out[1], 0, + sizeof(dummy), SPLICE_F_NONBLOCK + } + } + }, { + { 0, sizeof(dummy) }, { 0, sizeof(dummy) + 1 } + } + }, + + /* change arg1, no exit stop */ + { + { + { + __NR_exit_group, + { + dummy[0], dummy[1], dummy[2], + dummy[3], dummy[4], dummy[5] + } + }, { + __NR_exit_group, + { + 0, dummy[1], dummy[2], + dummy[3], dummy[4], dummy[5] + } + } + }, { + { 0, 0 }, { 0, 0 } + } + }, + }; + + long rc; + unsigned int i; + + tracee_pid = fork(); + + ASSERT_LE(0, tracee_pid) { + TH_LOG("fork: %m"); + } + + if (tracee_pid == 0) { + /* get the pid before PTRACE_TRACEME */ + tracee_pid = getpid(); + ASSERT_EQ(0, sys_ptrace(PTRACE_TRACEME, 0, 0, 0)) { + TH_LOG("PTRACE_TRACEME: %m"); + } + ASSERT_EQ(0, kill(tracee_pid, SIGSTOP)) { + /* cannot happen */ + TH_LOG("kill SIGSTOP: %m"); + } + for (i = 0; i < ARRAY_SIZE(si); ++i) { + rc = syscall(si[i].entry[0].nr, + si[i].entry[0].args[0], + si[i].entry[0].args[1], + si[i].entry[0].args[2], + si[i].entry[0].args[3], + si[i].entry[0].args[4], + si[i].entry[0].args[5]); + if (si[i].exit[1].is_error) { + if (rc != -1 || errno != -si[i].exit[1].rval) + break; + } else { + if (rc != si[i].exit[1].rval) + break; + } + } + /* + * Something went wrong, but in this state tracee + * cannot reliably issue syscalls, so just crash. + */ + *(volatile unsigned char *) (uintptr_t) i = 42; + /* unreachable */ + _exit(i + 1); + } + + for (ptrace_stop = 0; ; ++ptrace_stop) { + struct ptrace_syscall_info info = { + .op = 0xff /* invalid PTRACE_SYSCALL_INFO_* op */ + }; + const size_t size = sizeof(info); + const int expected_entry_size = + (void *) &info.entry.args[6] - (void *) &info; + const int expected_exit_size = + (void *) (&info.exit.is_error + 1) - + (void *) &info; + int status; + + ASSERT_EQ(tracee_pid, wait(&status)) { + /* cannot happen */ + LOG_KILL_TRACEE("wait: %m"); + } + if (WIFEXITED(status)) { + tracee_pid = 0; /* the tracee is no more */ + ASSERT_EQ(0, WEXITSTATUS(status)) { + LOG_KILL_TRACEE("unexpected exit status %u", + WEXITSTATUS(status)); + } + break; + } + ASSERT_FALSE(WIFSIGNALED(status)) { + tracee_pid = 0; /* the tracee is no more */ + LOG_KILL_TRACEE("unexpected signal %u", + WTERMSIG(status)); + } + ASSERT_TRUE(WIFSTOPPED(status)) { + /* cannot happen */ + LOG_KILL_TRACEE("unexpected wait status %#x", status); + } + + ASSERT_LT(ptrace_stop, ARRAY_SIZE(si) * 2) { + LOG_KILL_TRACEE("ptrace stop overflow"); + } + + switch (WSTOPSIG(status)) { + case SIGSTOP: + ASSERT_EQ(0, ptrace_stop) { + LOG_KILL_TRACEE("unexpected signal stop"); + } + ASSERT_EQ(0, sys_ptrace(PTRACE_SETOPTIONS, tracee_pid, + 0, PTRACE_O_TRACESYSGOOD)) { + LOG_KILL_TRACEE("PTRACE_SETOPTIONS: %m"); + } + break; + + case SIGTRAP | 0x80: + ASSERT_LT(0, ptrace_stop) { + LOG_KILL_TRACEE("unexpected syscall stop"); + } + ASSERT_LT(0, (rc = sys_ptrace(PTRACE_GET_SYSCALL_INFO, + tracee_pid, size, + (uintptr_t) &info))) { + LOG_KILL_TRACEE("PTRACE_GET_SYSCALL_INFO #1: %m"); + } + if (ptrace_stop & 1) { + /* entering syscall */ + const struct si_entry *exp_entry = + &si[ptrace_stop / 2].entry[0]; + const struct si_entry *set_entry = + &si[ptrace_stop / 2].entry[1]; + + /* check ptrace_syscall_info before the changes */ + ASSERT_EQ(expected_entry_size, rc) { + LOG_KILL_TRACEE("PTRACE_GET_SYSCALL_INFO #1" + ": entry stop mismatch"); + } + check_psi_entry(_metadata, &info, exp_entry, + "PTRACE_GET_SYSCALL_INFO #1"); + + /* apply the changes */ + info.entry.nr = set_entry->nr; + for (i = 0; i < ARRAY_SIZE(set_entry->args); ++i) + info.entry.args[i] = set_entry->args[i]; + ASSERT_EQ(0, sys_ptrace(PTRACE_SET_SYSCALL_INFO, + tracee_pid, size, + (uintptr_t) &info)) { + LOG_KILL_TRACEE("PTRACE_SET_SYSCALL_INFO: %m"); + } + + /* check ptrace_syscall_info after the changes */ + memset(&info, 0, sizeof(info)); + info.op = 0xff; + ASSERT_LT(0, (rc = sys_ptrace(PTRACE_GET_SYSCALL_INFO, + tracee_pid, size, + (uintptr_t) &info))) { + LOG_KILL_TRACEE("PTRACE_GET_SYSCALL_INFO: %m"); + } + ASSERT_EQ(expected_entry_size, rc) { + LOG_KILL_TRACEE("PTRACE_GET_SYSCALL_INFO #2" + ": entry stop mismatch"); + } + check_psi_entry(_metadata, &info, set_entry, + "PTRACE_GET_SYSCALL_INFO #2"); + } else { + /* exiting syscall */ + const struct si_exit *exp_exit = + &si[ptrace_stop / 2 - 1].exit[0]; + const struct si_exit *set_exit = + &si[ptrace_stop / 2 - 1].exit[1]; + + /* check ptrace_syscall_info before the changes */ + ASSERT_EQ(expected_exit_size, rc) { + LOG_KILL_TRACEE("PTRACE_GET_SYSCALL_INFO #1" + ": exit stop mismatch"); + } + check_psi_exit(_metadata, &info, exp_exit, + "PTRACE_GET_SYSCALL_INFO #1"); + + /* apply the changes */ + info.exit.is_error = set_exit->is_error; + info.exit.rval = set_exit->rval; + ASSERT_EQ(0, sys_ptrace(PTRACE_SET_SYSCALL_INFO, + tracee_pid, size, + (uintptr_t) &info)) { + LOG_KILL_TRACEE("PTRACE_SET_SYSCALL_INFO: %m"); + } + + /* check ptrace_syscall_info after the changes */ + memset(&info, 0, sizeof(info)); + info.op = 0xff; + ASSERT_LT(0, (rc = sys_ptrace(PTRACE_GET_SYSCALL_INFO, + tracee_pid, size, + (uintptr_t) &info))) { + LOG_KILL_TRACEE("PTRACE_GET_SYSCALL_INFO #2: %m"); + } + ASSERT_EQ(expected_exit_size, rc) { + LOG_KILL_TRACEE("PTRACE_GET_SYSCALL_INFO #2" + ": exit stop mismatch"); + } + check_psi_exit(_metadata, &info, set_exit, + "PTRACE_GET_SYSCALL_INFO #2"); + } + break; + + default: + LOG_KILL_TRACEE("unexpected stop signal %u", + WSTOPSIG(status)); + abort(); + } + + ASSERT_EQ(0, sys_ptrace(PTRACE_SYSCALL, tracee_pid, 0, 0)) { + LOG_KILL_TRACEE("PTRACE_SYSCALL: %m"); + } + } + + ASSERT_EQ(ptrace_stop, ARRAY_SIZE(si) * 2); +} + +TEST_HARNESS_MAIN From 7eeafde0ac05fa84a74d37af0efe5c6c5270bbef Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Tue, 25 Mar 2025 17:04:16 +0900 Subject: [PATCH 060/285] zsmalloc: cleanup headers includes Remove unused headers includes from zsmalloc and move pagemap.h and migrate.h includes into zpdesc header. Link: https://lkml.kernel.org/r/20250325080427.3449359-1-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky Cc: Minchan Kim Signed-off-by: Andrew Morton --- mm/zpdesc.h | 3 +++ mm/zsmalloc.c | 12 +----------- 2 files changed, 4 insertions(+), 11 deletions(-) diff --git a/mm/zpdesc.h b/mm/zpdesc.h index fa47fece2237..57e7a4d6c6ca 100644 --- a/mm/zpdesc.h +++ b/mm/zpdesc.h @@ -7,6 +7,9 @@ #ifndef __MM_ZPDESC_H__ #define __MM_ZPDESC_H__ +#include +#include + /* * struct zpdesc - Memory descriptor for zpool memory. * @flags: Page flags, mostly unused by zsmalloc. diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index 513b08c7c941..999b513c7fdf 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -26,17 +26,10 @@ #include #include #include -#include #include #include #include #include -#include -#include -#include -#include -#include -#include #include #include #include @@ -44,11 +37,8 @@ #include #include #include -#include -#include -#include #include -#include +#include #include "zpdesc.h" #define ZSPAGE_MAGIC 0x58 From a516403787e08119b70ce8bfff985272ef318a58 Mon Sep 17 00:00:00 2001 From: Andrei Vagin Date: Mon, 24 Mar 2025 06:53:26 +0000 Subject: [PATCH 061/285] fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions Patch series "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions", v2. Introduce the PAGE_IS_GUARD flag in the PAGEMAP_SCAN ioctl to expose information about guard regions. This allows userspace tools, such as CRIU, to detect and handle guard regions. Currently, CRIU utilizes PAGEMAP_SCAN as a more efficient alternative to parsing /proc/pid/pagemap. Without this change, guard regions are incorrectly reported as swap-anon regions, leading CRIU to attempt dumping them and subsequently failing. The series includes updates to the documentation and selftests to reflect the new functionality. This patch (of 3): Introduce the PAGE_IS_GUARD flag in the PAGEMAP_SCAN ioctl to expose information about guard regions. This allows userspace tools, such as CRIU, to detect and handle guard regions. Link: https://lkml.kernel.org/r/20250324065328.107678-1-avagin@google.com Link: https://lkml.kernel.org/r/20250324065328.107678-2-avagin@google.com Signed-off-by: Andrei Vagin Acked-by: David Hildenbrand Reviewed-by: Lorenzo Stoakes Cc: Jonathan Corbet Cc: Shuah Khan Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/pagemap.rst | 1 + fs/proc/task_mmu.c | 17 ++++++++++------- include/uapi/linux/fs.h | 1 + 3 files changed, 12 insertions(+), 7 deletions(-) diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst index afce291649dd..e60e9211fd9b 100644 --- a/Documentation/admin-guide/mm/pagemap.rst +++ b/Documentation/admin-guide/mm/pagemap.rst @@ -250,6 +250,7 @@ Following flags about pages are currently supported: - ``PAGE_IS_PFNZERO`` - Page has zero PFN - ``PAGE_IS_HUGE`` - Page is PMD-mapped THP or Hugetlb backed - ``PAGE_IS_SOFT_DIRTY`` - Page is soft-dirty +- ``PAGE_IS_GUARD`` - Page is a part of a guard region The ``struct pm_scan_arg`` is used as the argument of the IOCTL. diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 994cde10e3f4..b9e4fbbdf6e6 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -2087,7 +2087,8 @@ static int pagemap_release(struct inode *inode, struct file *file) #define PM_SCAN_CATEGORIES (PAGE_IS_WPALLOWED | PAGE_IS_WRITTEN | \ PAGE_IS_FILE | PAGE_IS_PRESENT | \ PAGE_IS_SWAPPED | PAGE_IS_PFNZERO | \ - PAGE_IS_HUGE | PAGE_IS_SOFT_DIRTY) + PAGE_IS_HUGE | PAGE_IS_SOFT_DIRTY | \ + PAGE_IS_GUARD) #define PM_SCAN_FLAGS (PM_SCAN_WP_MATCHING | PM_SCAN_CHECK_WPASYNC) struct pagemap_scan_private { @@ -2128,12 +2129,14 @@ static unsigned long pagemap_page_category(struct pagemap_scan_private *p, if (!pte_swp_uffd_wp_any(pte)) categories |= PAGE_IS_WRITTEN; - if (p->masks_of_interest & PAGE_IS_FILE) { - swp = pte_to_swp_entry(pte); - if (is_pfn_swap_entry(swp) && - !folio_test_anon(pfn_swap_entry_folio(swp))) - categories |= PAGE_IS_FILE; - } + swp = pte_to_swp_entry(pte); + if (is_guard_swp_entry(swp)) + categories |= PAGE_IS_GUARD; + else if ((p->masks_of_interest & PAGE_IS_FILE) && + is_pfn_swap_entry(swp) && + !folio_test_anon(pfn_swap_entry_folio(swp))) + categories |= PAGE_IS_FILE; + if (pte_swp_soft_dirty(pte)) categories |= PAGE_IS_SOFT_DIRTY; } diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index e762e1af650c..0098b0ce8ccb 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -361,6 +361,7 @@ typedef int __bitwise __kernel_rwf_t; #define PAGE_IS_PFNZERO (1 << 5) #define PAGE_IS_HUGE (1 << 6) #define PAGE_IS_SOFT_DIRTY (1 << 7) +#define PAGE_IS_GUARD (1 << 8) /* * struct page_region - Page region with flags From 267bee0cd87a98832fd9da1976f0f53788b6a2b2 Mon Sep 17 00:00:00 2001 From: Andrei Vagin Date: Mon, 24 Mar 2025 06:53:27 +0000 Subject: [PATCH 062/285] tools headers UAPI: sync linux/fs.h with the kernel sources Required for a new PAGEMAP_SCAN test to verify guard region reporting. Link: https://lkml.kernel.org/r/20250324065328.107678-3-avagin@google.com Signed-off-by: Andrei Vagin Reviewed-by: Lorenzo Stoakes Cc: David Hildenbrand Cc: Jonathan Corbet Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/include/uapi/linux/fs.h | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/tools/include/uapi/linux/fs.h b/tools/include/uapi/linux/fs.h index 8a27bc5c7a7f..24ddf7bc4f25 100644 --- a/tools/include/uapi/linux/fs.h +++ b/tools/include/uapi/linux/fs.h @@ -40,6 +40,15 @@ #define BLOCK_SIZE_BITS 10 #define BLOCK_SIZE (1< Date: Mon, 24 Mar 2025 06:53:28 +0000 Subject: [PATCH 063/285] selftests/mm: add PAGEMAP_SCAN guard region test Add a selftest to verify the PAGEMAP_SCAN ioctl correctly reports guard regions using the newly introduced PAGE_IS_GUARD flag. Link: https://lkml.kernel.org/r/20250324065328.107678-4-avagin@google.com Signed-off-by: Andrei Vagin Reviewed-by: Lorenzo Stoakes Cc: David Hildenbrand Cc: Jonathan Corbet Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/guard-regions.c | 57 ++++++++++++++++++++++ 1 file changed, 57 insertions(+) diff --git a/tools/testing/selftests/mm/guard-regions.c b/tools/testing/selftests/mm/guard-regions.c index eba43ead13ae..0cd9d236649d 100644 --- a/tools/testing/selftests/mm/guard-regions.c +++ b/tools/testing/selftests/mm/guard-regions.c @@ -8,6 +8,7 @@ #include #include #include +#include #include #include #include @@ -2075,4 +2076,60 @@ TEST_F(guard_regions, pagemap) ASSERT_EQ(munmap(ptr, 10 * page_size), 0); } +/* + * Assert that PAGEMAP_SCAN correctly reports guard region ranges. + */ +TEST_F(guard_regions, pagemap_scan) +{ + const unsigned long page_size = self->page_size; + struct page_region pm_regs[10]; + struct pm_scan_arg pm_scan_args = { + .size = sizeof(struct pm_scan_arg), + .category_anyof_mask = PAGE_IS_GUARD, + .return_mask = PAGE_IS_GUARD, + .vec = (long)&pm_regs, + .vec_len = ARRAY_SIZE(pm_regs), + }; + int proc_fd, i; + char *ptr; + + proc_fd = open("/proc/self/pagemap", O_RDONLY); + ASSERT_NE(proc_fd, -1); + + ptr = mmap_(self, variant, NULL, 10 * page_size, + PROT_READ | PROT_WRITE, 0, 0); + ASSERT_NE(ptr, MAP_FAILED); + + pm_scan_args.start = (long)ptr; + pm_scan_args.end = (long)ptr + 10 * page_size; + ASSERT_EQ(ioctl(proc_fd, PAGEMAP_SCAN, &pm_scan_args), 0); + ASSERT_EQ(pm_scan_args.walk_end, (long)ptr + 10 * page_size); + + /* Install a guard region in every other page. */ + for (i = 0; i < 10; i += 2) { + char *ptr_p = &ptr[i * page_size]; + + ASSERT_EQ(syscall(__NR_madvise, ptr_p, page_size, MADV_GUARD_INSTALL), 0); + } + + /* + * Assert ioctl() returns the count of located regions, where each + * region spans every other page within the range of 10 pages. + */ + ASSERT_EQ(ioctl(proc_fd, PAGEMAP_SCAN, &pm_scan_args), 5); + ASSERT_EQ(pm_scan_args.walk_end, (long)ptr + 10 * page_size); + + /* Re-read from pagemap, and assert guard regions are detected. */ + for (i = 0; i < 5; i++) { + long ptr_p = (long)&ptr[2 * i * page_size]; + + ASSERT_EQ(pm_regs[i].start, ptr_p); + ASSERT_EQ(pm_regs[i].end, ptr_p + page_size); + ASSERT_EQ(pm_regs[i].categories, PAGE_IS_GUARD); + } + + ASSERT_EQ(close(proc_fd), 0); + ASSERT_EQ(munmap(ptr, 10 * page_size), 0); +} + TEST_HARNESS_MAIN From 979f3ef0f798d9b4fda4806d37fb1a264fc38566 Mon Sep 17 00:00:00 2001 From: Gavin Shan Date: Fri, 21 Mar 2025 22:02:21 +1000 Subject: [PATCH 064/285] mm: fix parameter passed to page_mapcount_is_type() Patch series "Fix parameter passed to page_mapcount_is_type()", v2. Found by code inspection. There are two places where the parameter passed to page_mapcount_is_type() is (page->_mapcount), which is incorrect since it should be one more than the value, as explained in the comments to page_mapcount_is_type(): (a) page_has_type() in page-flags.h (b) __dump_folio() in mm/debug.c PATCH[1] fixes the parameter for (a) PATCH[2] fixes the parameter for (b) Note that the issue doesn't cause any visible impacts due to the safety gap introduced by PGTY_mapcount_underflow limit. So the tag 'Cc: stable@vger.kernel.org' isn't needed. This patch (of 2): As the comments of page_mapcount_is_type() indicate, the parameter passed to the function should be one more than page->_mapcount. However, page->_mapcount (equivalent to page->page_type) is passed to the function by commit 4ffca5a96678 ("mm: support only one page_type per page") page_type_has_type() is replaced by page_mapcount_is_type(), but the parameter isn't adjusted. Fix it by replacing page_mapcount_is_type() with page_type_has_type() in page_has_type(). Note that the issue doesn't cause any visible impacts due to the safety gap introduced by PGTY_mapcount_underflow limit. Link: https://lkml.kernel.org/r/20250321120222.1456770-1-gshan@redhat.com Link: https://lkml.kernel.org/r/20250321120222.1456770-2-gshan@redhat.com Fixes: 4ffca5a96678 ("mm: support only one page_type per page") Signed-off-by: Gavin Shan Acked-by: David Hildenbrand Acked-by: Vlastimil Babka Cc: gehao Cc: Matthew Wilcox (Oracle) Cc: Miaohe Lin Signed-off-by: Andrew Morton --- include/linux/page-flags.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index e6a21b62dcce..d3909cb1e576 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -982,7 +982,7 @@ static inline bool page_mapcount_is_type(unsigned int mapcount) static inline bool page_has_type(const struct page *page) { - return page_mapcount_is_type(data_race(page->page_type)); + return page_type_has_type(data_race(page->page_type)); } #define FOLIO_TYPE_OPS(lname, fname) \ From 79049bb48a76333646d076e29d4f99fedefdaf0d Mon Sep 17 00:00:00 2001 From: Gavin Shan Date: Fri, 21 Mar 2025 22:02:22 +1000 Subject: [PATCH 065/285] mm/debug: fix parameter passed to page_mapcount_is_type() As the comments of page_mapcount_is_type() indicate, the parameter passed to the function should be one more than page->_mapcount. However, page->_mapcount is passed to the function by commit 4ffca5a96678 ("mm: support only one page_type per page") where page_type_has_type() is replaced by page_mapcount_is_type(), but the parameter isn't adjusted. Fix the parameter for page_mapcount_is_type() to be (page->__mapcount + 1). Note that the issue doesn't cause any visible impacts due to the safety gap introduced by PGTY_mapcount_underflow limit. [akpm@linux-foundation.org: simplify __dump_folio(), per David] Link: https://lkml.kernel.org/r/20250321120222.1456770-3-gshan@redhat.com Fixes: 4ffca5a96678 ("mm: support only one page_type per page") Signed-off-by: Gavin Shan Acked-by: David Hildenbrand Acked-by: Vlastimil Babka Cc: gehao Cc: Matthew Wilcox (Oracle) Cc: Miaohe Lin Signed-off-by: Andrew Morton --- mm/debug.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/mm/debug.c b/mm/debug.c index db83e381a8ae..907382257062 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -71,10 +71,12 @@ static void __dump_folio(struct folio *folio, struct page *page, unsigned long pfn, unsigned long idx) { struct address_space *mapping = folio_mapping(folio); - int mapcount = atomic_read(&page->_mapcount); + int mapcount = atomic_read(&page->_mapcount) + 1; char *type = ""; - mapcount = page_mapcount_is_type(mapcount) ? 0 : mapcount + 1; + if (page_mapcount_is_type(mapcount)) + mapcount = 0; + pr_warn("page: refcount:%d mapcount:%d mapping:%p index:%#lx pfn:%#lx\n", folio_ref_count(folio), mapcount, mapping, folio->index + idx, pfn); From b56e64466554dfd39a7695f5025f583d8a20288d Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 21 Mar 2025 12:37:11 +0100 Subject: [PATCH 066/285] kernel/events/uprobes: pass VMA instead of MM to remove_breakpoint() Patch series "kernel/events/uprobes: uprobe_write_opcode() rewrite", v3. Currently, uprobe_write_opcode() implements COW-breaking manually, which is really far from ideal. Further, there is interest in supporting uprobes on hugetlb pages [1], and leaving at least the COW-breaking to the core will make this much easier. Also, I think the current code doesn't really handle some things properly (see patch #3) when replacing/zapping pages. Let's rewrite it, to leave COW-breaking to the fault handler, and handle registration/unregistration by temporarily unmapping the anonymous page, modifying it, and mapping it again. We still have to implement zapping of anonymous pages ourselves, unfortunately. We could look into not performing the temporary unmapping if we can perform the write atomically, which would likely also make adding hugetlb support a lot easier. But, limited (e.g., only PMD/PUD) hugetlb support could be added on top of this with some tweaking. Note that we now won't have to allocate another anonymous folio when unregistering (which will be beneficial for hugetlb as well), we can simply modify the already-mapped one from the registration (if any). When registering a uprobe, we'll first trigger a ptrace-like write fault to break COW, to then modify the already-mapped page. Briefly sanity tested with perf probes and with the bpf uprobes selftest. This patch (of 3): Pass VMA instead of MM to remove_breakpoint() and remove the "MM" argument from install_breakpoint(), because it can easily be derived from the VMA. Link: https://lkml.kernel.org/r/20250321113713.204682-1-david@redhat.com Link: https://lkml.kernel.org/r/20250321113713.204682-2-david@redhat.com Signed-off-by: David Hildenbrand Acked-by: Oleg Nesterov Acked-by: Peter Zijlstra (Intel) Cc: Jiri Olsa Cc: Adrian Hunter Cc: Alexander Shishkin Cc: Andrii Nakryiko Cc: Arnaldo Carvalho de Melo Cc: Ian Rogers Cc: Ingo Molnar Cc: Kan Liang Cc: Mark Rutland Cc: "Masami Hiramatsu (Google)" Cc: Matthew Wilcox (Oracle) Cc: Namhyung kim Cc: Russel King Cc: tongtiangen Signed-off-by: Andrew Morton --- kernel/events/uprobes.c | 20 +++++++++++--------- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 8d783b5882b6..8dc23ca9f66f 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -1134,10 +1134,10 @@ static bool filter_chain(struct uprobe *uprobe, struct mm_struct *mm) return ret; } -static int -install_breakpoint(struct uprobe *uprobe, struct mm_struct *mm, - struct vm_area_struct *vma, unsigned long vaddr) +static int install_breakpoint(struct uprobe *uprobe, struct vm_area_struct *vma, + unsigned long vaddr) { + struct mm_struct *mm = vma->vm_mm; bool first_uprobe; int ret; @@ -1162,9 +1162,11 @@ install_breakpoint(struct uprobe *uprobe, struct mm_struct *mm, return ret; } -static int -remove_breakpoint(struct uprobe *uprobe, struct mm_struct *mm, unsigned long vaddr) +static int remove_breakpoint(struct uprobe *uprobe, struct vm_area_struct *vma, + unsigned long vaddr) { + struct mm_struct *mm = vma->vm_mm; + set_bit(MMF_RECALC_UPROBES, &mm->flags); return set_orig_insn(&uprobe->arch, mm, vaddr); } @@ -1296,10 +1298,10 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new) if (is_register) { /* consult only the "caller", new consumer. */ if (consumer_filter(new, mm)) - err = install_breakpoint(uprobe, mm, vma, info->vaddr); + err = install_breakpoint(uprobe, vma, info->vaddr); } else if (test_bit(MMF_HAS_UPROBES, &mm->flags)) { if (!filter_chain(uprobe, mm)) - err |= remove_breakpoint(uprobe, mm, info->vaddr); + err |= remove_breakpoint(uprobe, vma, info->vaddr); } unlock: @@ -1472,7 +1474,7 @@ static int unapply_uprobe(struct uprobe *uprobe, struct mm_struct *mm) continue; vaddr = offset_to_vaddr(vma, uprobe->offset); - err |= remove_breakpoint(uprobe, mm, vaddr); + err |= remove_breakpoint(uprobe, vma, vaddr); } mmap_read_unlock(mm); @@ -1610,7 +1612,7 @@ int uprobe_mmap(struct vm_area_struct *vma) if (!fatal_signal_pending(current) && filter_chain(uprobe, vma->vm_mm)) { unsigned long vaddr = offset_to_vaddr(vma, uprobe->offset); - install_breakpoint(uprobe, vma->vm_mm, vma, vaddr); + install_breakpoint(uprobe, vma, vaddr); } put_uprobe(uprobe); } From 8a5577428e8e586a107ee5f39eb4c86d32c971cd Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 21 Mar 2025 12:37:12 +0100 Subject: [PATCH 067/285] kernel/events/uprobes: pass VMA to set_swbp(), set_orig_insn() and uprobe_write_opcode() We already have the VMA, no need to look it up using get_user_page_vma_remote(). We can now switch to get_user_pages_remote(). Link: https://lkml.kernel.org/r/20250321113713.204682-3-david@redhat.com Signed-off-by: David Hildenbrand Acked-by: Oleg Nesterov Acked-by: Peter Zijlstra (Intel) Cc: Adrian Hunter Cc: Alexander Shishkin Cc: Andrii Nakryiko Cc: Arnaldo Carvalho de Melo Cc: Ian Rogers Cc: Ingo Molnar Cc: Jiri Olsa Cc: Kan Liang Cc: Mark Rutland Cc: "Masami Hiramatsu (Google)" Cc: Matthew Wilcox (Oracle) Cc: Namhyung kim Cc: Russel King Cc: tongtiangen Signed-off-by: Andrew Morton --- arch/arm/probes/uprobes/core.c | 4 ++-- include/linux/uprobes.h | 6 +++--- kernel/events/uprobes.c | 33 +++++++++++++++++---------------- 3 files changed, 22 insertions(+), 21 deletions(-) diff --git a/arch/arm/probes/uprobes/core.c b/arch/arm/probes/uprobes/core.c index f5f790c6e5f8..885e0c5e8c20 100644 --- a/arch/arm/probes/uprobes/core.c +++ b/arch/arm/probes/uprobes/core.c @@ -26,10 +26,10 @@ bool is_swbp_insn(uprobe_opcode_t *insn) (UPROBE_SWBP_ARM_INSN & 0x0fffffff); } -int set_swbp(struct arch_uprobe *auprobe, struct mm_struct *mm, +int set_swbp(struct arch_uprobe *auprobe, struct vm_area_struct *vma, unsigned long vaddr) { - return uprobe_write_opcode(auprobe, mm, vaddr, + return uprobe_write_opcode(auprobe, vma, vaddr, __opcode_to_mem_arm(auprobe->bpinsn)); } diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index 2e46b69ff0a6..516217c39094 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -188,13 +188,13 @@ struct uprobes_state { }; extern void __init uprobes_init(void); -extern int set_swbp(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr); -extern int set_orig_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr); +extern int set_swbp(struct arch_uprobe *aup, struct vm_area_struct *vma, unsigned long vaddr); +extern int set_orig_insn(struct arch_uprobe *aup, struct vm_area_struct *vma, unsigned long vaddr); extern bool is_swbp_insn(uprobe_opcode_t *insn); extern bool is_trap_insn(uprobe_opcode_t *insn); extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs); extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs); -extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t); +extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct vm_area_struct *vma, unsigned long vaddr, uprobe_opcode_t); extern struct uprobe *uprobe_register(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc); extern int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool); extern void uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc); diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 8dc23ca9f66f..c33d710e8db7 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -474,19 +474,19 @@ static int update_ref_ctr(struct uprobe *uprobe, struct mm_struct *mm, * * uprobe_write_opcode - write the opcode at a given virtual address. * @auprobe: arch specific probepoint information. - * @mm: the probed process address space. + * @vma: the probed virtual memory area. * @vaddr: the virtual address to store the opcode. * @opcode: opcode to be written at @vaddr. * * Called with mm->mmap_lock held for read or write. * Return 0 (success) or a negative errno. */ -int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, - unsigned long vaddr, uprobe_opcode_t opcode) +int uprobe_write_opcode(struct arch_uprobe *auprobe, struct vm_area_struct *vma, + unsigned long vaddr, uprobe_opcode_t opcode) { + struct mm_struct *mm = vma->vm_mm; struct uprobe *uprobe; struct page *old_page, *new_page; - struct vm_area_struct *vma; int ret, is_register, ref_ctr_updated = 0; bool orig_page_huge = false; unsigned int gup_flags = FOLL_FORCE; @@ -498,9 +498,9 @@ retry: if (is_register) gup_flags |= FOLL_SPLIT_PMD; /* Read the page with vaddr into memory */ - old_page = get_user_page_vma_remote(mm, vaddr, gup_flags, &vma); - if (IS_ERR(old_page)) - return PTR_ERR(old_page); + ret = get_user_pages_remote(mm, vaddr, 1, gup_flags, &old_page, NULL); + if (ret != 1) + return ret; ret = verify_opcode(old_page, vaddr, &opcode); if (ret <= 0) @@ -590,30 +590,31 @@ put_old: /** * set_swbp - store breakpoint at a given address. * @auprobe: arch specific probepoint information. - * @mm: the probed process address space. + * @vma: the probed virtual memory area. * @vaddr: the virtual address to insert the opcode. * * For mm @mm, store the breakpoint instruction at @vaddr. * Return 0 (success) or a negative errno. */ -int __weak set_swbp(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr) +int __weak set_swbp(struct arch_uprobe *auprobe, struct vm_area_struct *vma, + unsigned long vaddr) { - return uprobe_write_opcode(auprobe, mm, vaddr, UPROBE_SWBP_INSN); + return uprobe_write_opcode(auprobe, vma, vaddr, UPROBE_SWBP_INSN); } /** * set_orig_insn - Restore the original instruction. - * @mm: the probed process address space. + * @vma: the probed virtual memory area. * @auprobe: arch specific probepoint information. * @vaddr: the virtual address to insert the opcode. * * For mm @mm, restore the original opcode (opcode) at @vaddr. * Return 0 (success) or a negative errno. */ -int __weak -set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr) +int __weak set_orig_insn(struct arch_uprobe *auprobe, + struct vm_area_struct *vma, unsigned long vaddr) { - return uprobe_write_opcode(auprobe, mm, vaddr, + return uprobe_write_opcode(auprobe, vma, vaddr, *(uprobe_opcode_t *)&auprobe->insn); } @@ -1153,7 +1154,7 @@ static int install_breakpoint(struct uprobe *uprobe, struct vm_area_struct *vma, if (first_uprobe) set_bit(MMF_HAS_UPROBES, &mm->flags); - ret = set_swbp(&uprobe->arch, mm, vaddr); + ret = set_swbp(&uprobe->arch, vma, vaddr); if (!ret) clear_bit(MMF_RECALC_UPROBES, &mm->flags); else if (first_uprobe) @@ -1168,7 +1169,7 @@ static int remove_breakpoint(struct uprobe *uprobe, struct vm_area_struct *vma, struct mm_struct *mm = vma->vm_mm; set_bit(MMF_RECALC_UPROBES, &mm->flags); - return set_orig_insn(&uprobe->arch, mm, vaddr); + return set_orig_insn(&uprobe->arch, vma, vaddr); } struct map_info { From 6e3092d788be1de0aac56b81fc17551c76645cdf Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 21 Mar 2025 12:37:13 +0100 Subject: [PATCH 068/285] kernel/events/uprobes: uprobe_write_opcode() rewrite uprobe_write_opcode() does some pretty low-level things that really, it shouldn't be doing: for example, manually breaking COW by allocating anonymous folios and replacing mapped pages. Further, it does seem to do some shaky things: for example, writing to possible COW-shared anonymous pages or zapping anonymous pages that might be pinned. We're also not taking care of uffd, uffd-wp, softdirty ... although rather corner cases here. Let's just get it right like ordinary ptrace writes would. Let's rewrite the code, leaving COW-breaking to core-MM, triggered by FOLL_FORCE|FOLL_WRITE (note that the code was already using FOLL_FORCE). We'll use GUP to lookup/faultin the page and break COW if required. Then, we'll walk the page tables using a folio_walk to perform our page modification atomically by temporarily unmap the PTE + flushing the TLB. Likely, we could avoid the temporary unmap in case we can just atomically write the instruction, but that will be a separate project. Unfortunately, we still have to implement the zapping logic manually, because we only want to zap in specific circumstances (e.g., page content identical). Note that we can now handle large folios (compound pages) and the shared zeropage just fine, so drop these checks. Link: https://lkml.kernel.org/r/20250321113713.204682-4-david@redhat.com Signed-off-by: David Hildenbrand Acked-by: Oleg Nesterov Acked-by: Peter Zijlstra (Intel) Cc: Adrian Hunter Cc: Alexander Shishkin Cc: Andrii Nakryiko Cc: Arnaldo Carvalho de Melo Cc: Ian Rogers Cc: Ingo Molnar Cc: Jiri Olsa Cc: Kan Liang Cc: Mark Rutland Cc: "Masami Hiramatsu (Google)" Cc: Matthew Wilcox (Oracle) Cc: Namhyung kim Cc: Russel King Cc: tongtiangen Signed-off-by: Andrew Morton --- kernel/events/uprobes.c | 316 ++++++++++++++++++++-------------------- 1 file changed, 160 insertions(+), 156 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index c33d710e8db7..4c965ba77f9f 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -29,6 +29,7 @@ #include #include #include /* check_stable_address_space */ +#include #include @@ -151,91 +152,6 @@ static loff_t vaddr_to_offset(struct vm_area_struct *vma, unsigned long vaddr) return ((loff_t)vma->vm_pgoff << PAGE_SHIFT) + (vaddr - vma->vm_start); } -/** - * __replace_page - replace page in vma by new page. - * based on replace_page in mm/ksm.c - * - * @vma: vma that holds the pte pointing to page - * @addr: address the old @page is mapped at - * @old_page: the page we are replacing by new_page - * @new_page: the modified page we replace page by - * - * If @new_page is NULL, only unmap @old_page. - * - * Returns 0 on success, negative error code otherwise. - */ -static int __replace_page(struct vm_area_struct *vma, unsigned long addr, - struct page *old_page, struct page *new_page) -{ - struct folio *old_folio = page_folio(old_page); - struct folio *new_folio; - struct mm_struct *mm = vma->vm_mm; - DEFINE_FOLIO_VMA_WALK(pvmw, old_folio, vma, addr, 0); - int err; - struct mmu_notifier_range range; - pte_t pte; - - mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, addr, - addr + PAGE_SIZE); - - if (new_page) { - new_folio = page_folio(new_page); - err = mem_cgroup_charge(new_folio, vma->vm_mm, GFP_KERNEL); - if (err) - return err; - } - - /* For folio_free_swap() below */ - folio_lock(old_folio); - - mmu_notifier_invalidate_range_start(&range); - err = -EAGAIN; - if (!page_vma_mapped_walk(&pvmw)) - goto unlock; - VM_BUG_ON_PAGE(addr != pvmw.address, old_page); - pte = ptep_get(pvmw.pte); - - /* - * Handle PFN swap PTES, such as device-exclusive ones, that actually - * map pages: simply trigger GUP again to fix it up. - */ - if (unlikely(!pte_present(pte))) { - page_vma_mapped_walk_done(&pvmw); - goto unlock; - } - - if (new_page) { - folio_get(new_folio); - folio_add_new_anon_rmap(new_folio, vma, addr, RMAP_EXCLUSIVE); - folio_add_lru_vma(new_folio, vma); - } else - /* no new page, just dec_mm_counter for old_page */ - dec_mm_counter(mm, MM_ANONPAGES); - - if (!folio_test_anon(old_folio)) { - dec_mm_counter(mm, mm_counter_file(old_folio)); - inc_mm_counter(mm, MM_ANONPAGES); - } - - flush_cache_page(vma, addr, pte_pfn(pte)); - ptep_clear_flush(vma, addr, pvmw.pte); - if (new_page) - set_pte_at(mm, addr, pvmw.pte, - mk_pte(new_page, vma->vm_page_prot)); - - folio_remove_rmap_pte(old_folio, old_page, vma); - if (!folio_mapped(old_folio)) - folio_free_swap(old_folio); - page_vma_mapped_walk_done(&pvmw); - folio_put(old_folio); - - err = 0; - unlock: - mmu_notifier_invalidate_range_end(&range); - folio_unlock(old_folio); - return err; -} - /** * is_swbp_insn - check if instruction is breakpoint instruction. * @insn: instruction to be checked. @@ -463,6 +379,95 @@ static int update_ref_ctr(struct uprobe *uprobe, struct mm_struct *mm, return ret; } +static bool orig_page_is_identical(struct vm_area_struct *vma, + unsigned long vaddr, struct page *page, bool *pmd_mappable) +{ + const pgoff_t index = vaddr_to_offset(vma, vaddr) >> PAGE_SHIFT; + struct folio *orig_folio = filemap_get_folio(vma->vm_file->f_mapping, + index); + struct page *orig_page; + bool identical; + + if (IS_ERR(orig_folio)) + return false; + orig_page = folio_file_page(orig_folio, index); + + *pmd_mappable = folio_test_pmd_mappable(orig_folio); + identical = folio_test_uptodate(orig_folio) && + pages_identical(page, orig_page); + folio_put(orig_folio); + return identical; +} + +static int __uprobe_write_opcode(struct vm_area_struct *vma, + struct folio_walk *fw, struct folio *folio, + unsigned long opcode_vaddr, uprobe_opcode_t opcode) +{ + const unsigned long vaddr = opcode_vaddr & PAGE_MASK; + const bool is_register = !!is_swbp_insn(&opcode); + bool pmd_mappable; + + /* For now, we'll only handle PTE-mapped folios. */ + if (fw->level != FW_LEVEL_PTE) + return -EFAULT; + + /* + * See can_follow_write_pte(): we'd actually prefer a writable PTE here, + * but the VMA might not be writable. + */ + if (!pte_write(fw->pte)) { + if (!PageAnonExclusive(fw->page)) + return -EFAULT; + if (unlikely(userfaultfd_pte_wp(vma, fw->pte))) + return -EFAULT; + /* SOFTDIRTY is handled via pte_mkdirty() below. */ + } + + /* + * We'll temporarily unmap the page and flush the TLB, such that we can + * modify the page atomically. + */ + flush_cache_page(vma, vaddr, pte_pfn(fw->pte)); + fw->pte = ptep_clear_flush(vma, vaddr, fw->ptep); + copy_to_page(fw->page, opcode_vaddr, &opcode, UPROBE_SWBP_INSN_SIZE); + + /* + * When unregistering, we may only zap a PTE if uffd is disabled and + * there are no unexpected folio references ... + */ + if (is_register || userfaultfd_missing(vma) || + (folio_ref_count(folio) != folio_mapcount(folio) + 1 + + folio_test_swapcache(folio) * folio_nr_pages(folio))) + goto remap; + + /* + * ... and the mapped page is identical to the original page that + * would get faulted in on next access. + */ + if (!orig_page_is_identical(vma, vaddr, fw->page, &pmd_mappable)) + goto remap; + + dec_mm_counter(vma->vm_mm, MM_ANONPAGES); + folio_remove_rmap_pte(folio, fw->page, vma); + if (!folio_mapped(folio) && folio_test_swapcache(folio) && + folio_trylock(folio)) { + folio_free_swap(folio); + folio_unlock(folio); + } + folio_put(folio); + + return pmd_mappable; +remap: + /* + * Make sure that our copy_to_page() changes become visible before the + * set_pte_at() write. + */ + smp_wmb(); + /* We modified the page. Make sure to mark the PTE dirty. */ + set_pte_at(vma->vm_mm, vaddr, fw->ptep, pte_mkdirty(fw->pte)); + return 0; +} + /* * NOTE: * Expect the breakpoint instruction to be the smallest size instruction for @@ -475,116 +480,115 @@ static int update_ref_ctr(struct uprobe *uprobe, struct mm_struct *mm, * uprobe_write_opcode - write the opcode at a given virtual address. * @auprobe: arch specific probepoint information. * @vma: the probed virtual memory area. - * @vaddr: the virtual address to store the opcode. - * @opcode: opcode to be written at @vaddr. + * @opcode_vaddr: the virtual address to store the opcode. + * @opcode: opcode to be written at @opcode_vaddr. * * Called with mm->mmap_lock held for read or write. * Return 0 (success) or a negative errno. */ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct vm_area_struct *vma, - unsigned long vaddr, uprobe_opcode_t opcode) + const unsigned long opcode_vaddr, uprobe_opcode_t opcode) { + const unsigned long vaddr = opcode_vaddr & PAGE_MASK; struct mm_struct *mm = vma->vm_mm; struct uprobe *uprobe; - struct page *old_page, *new_page; int ret, is_register, ref_ctr_updated = 0; - bool orig_page_huge = false; unsigned int gup_flags = FOLL_FORCE; + struct mmu_notifier_range range; + struct folio_walk fw; + struct folio *folio; + struct page *page; is_register = is_swbp_insn(&opcode); uprobe = container_of(auprobe, struct uprobe, arch); -retry: + if (WARN_ON_ONCE(!is_cow_mapping(vma->vm_flags))) + return -EINVAL; + + /* + * When registering, we have to break COW to get an exclusive anonymous + * page that we can safely modify. Use FOLL_WRITE to trigger a write + * fault if required. When unregistering, we might be lucky and the + * anon page is already gone. So defer write faults until really + * required. Use FOLL_SPLIT_PMD, because __uprobe_write_opcode() + * cannot deal with PMDs yet. + */ if (is_register) - gup_flags |= FOLL_SPLIT_PMD; - /* Read the page with vaddr into memory */ - ret = get_user_pages_remote(mm, vaddr, 1, gup_flags, &old_page, NULL); - if (ret != 1) - return ret; + gup_flags |= FOLL_WRITE | FOLL_SPLIT_PMD; - ret = verify_opcode(old_page, vaddr, &opcode); +retry: + ret = get_user_pages_remote(mm, vaddr, 1, gup_flags, &page, NULL); if (ret <= 0) - goto put_old; + goto out; + folio = page_folio(page); - if (is_zero_page(old_page)) { - ret = -EINVAL; - goto put_old; - } - - if (WARN(!is_register && PageCompound(old_page), - "uprobe unregister should never work on compound page\n")) { - ret = -EINVAL; - goto put_old; + ret = verify_opcode(page, opcode_vaddr, &opcode); + if (ret <= 0) { + folio_put(folio); + goto out; } /* We are going to replace instruction, update ref_ctr. */ if (!ref_ctr_updated && uprobe->ref_ctr_offset) { ret = update_ref_ctr(uprobe, mm, is_register ? 1 : -1); - if (ret) - goto put_old; + if (ret) { + folio_put(folio); + goto out; + } ref_ctr_updated = 1; } ret = 0; - if (!is_register && !PageAnon(old_page)) - goto put_old; - - ret = anon_vma_prepare(vma); - if (ret) - goto put_old; - - ret = -ENOMEM; - new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vaddr); - if (!new_page) - goto put_old; - - __SetPageUptodate(new_page); - copy_highpage(new_page, old_page); - copy_to_page(new_page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE); - - if (!is_register) { - struct page *orig_page; - pgoff_t index; - - VM_BUG_ON_PAGE(!PageAnon(old_page), old_page); - - index = vaddr_to_offset(vma, vaddr & PAGE_MASK) >> PAGE_SHIFT; - orig_page = find_get_page(vma->vm_file->f_inode->i_mapping, - index); - - if (orig_page) { - if (PageUptodate(orig_page) && - pages_identical(new_page, orig_page)) { - /* let go new_page */ - put_page(new_page); - new_page = NULL; - - if (PageCompound(orig_page)) - orig_page_huge = true; - } - put_page(orig_page); - } + if (unlikely(!folio_test_anon(folio))) { + VM_WARN_ON_ONCE(is_register); + folio_put(folio); + goto out; } - ret = __replace_page(vma, vaddr & PAGE_MASK, old_page, new_page); - if (new_page) - put_page(new_page); -put_old: - put_page(old_page); + if (!is_register) { + /* + * In the common case, we'll be able to zap the page when + * unregistering. So trigger MMU notifiers now, as we won't + * be able to do it under PTL. + */ + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, + vaddr, vaddr + PAGE_SIZE); + mmu_notifier_invalidate_range_start(&range); + } - if (unlikely(ret == -EAGAIN)) + ret = -EAGAIN; + /* Walk the page tables again, to perform the actual update. */ + if (folio_walk_start(&fw, vma, vaddr, 0)) { + if (fw.page == page) + ret = __uprobe_write_opcode(vma, &fw, folio, opcode_vaddr, opcode); + folio_walk_end(&fw, vma); + } + + if (!is_register) + mmu_notifier_invalidate_range_end(&range); + + folio_put(folio); + switch (ret) { + case -EFAULT: + gup_flags |= FOLL_WRITE | FOLL_SPLIT_PMD; + fallthrough; + case -EAGAIN: goto retry; + default: + break; + } +out: /* Revert back reference counter if instruction update failed. */ - if (ret && is_register && ref_ctr_updated) + if (ret < 0 && is_register && ref_ctr_updated) update_ref_ctr(uprobe, mm, -1); /* try collapse pmd for compound page */ - if (!ret && orig_page_huge) + if (ret > 0) collapse_pte_mapped_thp(mm, vaddr, false); - return ret; + return ret < 0 ? ret : 0; } /** From ee414bd97b3fa0a4f74e40004e3b4191326bd46c Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Mon, 7 Apr 2025 14:01:54 -0400 Subject: [PATCH 069/285] mm: page_alloc: tighten up find_suitable_fallback() find_suitable_fallback() is not as efficient as it could be, and somewhat difficult to follow. 1. should_try_claim_block() is a loop invariant. There is no point in checking fallback areas if the caller is interested in claimable blocks but the order and the migratetype don't allow for that. 2. __rmqueue_steal() doesn't care about claimability, so it shouldn't have to run those tests. Different callers want different things from this helper: 1. __compact_finished() scans orders up until it finds a claimable block 2. __rmqueue_claim() scans orders down as long as blocks are claimable 3. __rmqueue_steal() doesn't care about claimability at all Move should_try_claim_block() out of the loop. Only test it for the two callers who care in the first place. Distinguish "no blocks" from "order + mt are not claimable" in the return value; __rmqueue_claim() can stop once order becomes unclaimable, __compact_finished() can keep advancing until order becomes claimable. Before: Performance counter stats for './run case-lru-file-mmap-read' (5 runs): 85,294.85 msec task-clock # 5.644 CPUs utilized ( +- 0.32% ) 15,968 context-switches # 187.209 /sec ( +- 3.81% ) 153 cpu-migrations # 1.794 /sec ( +- 3.29% ) 801,808 page-faults # 9.400 K/sec ( +- 0.10% ) 733,358,331,786 instructions # 1.87 insn per cycle ( +- 0.20% ) (64.94%) 392,622,904,199 cycles # 4.603 GHz ( +- 0.31% ) (64.84%) 148,563,488,531 branches # 1.742 G/sec ( +- 0.18% ) (63.86%) 152,143,228 branch-misses # 0.10% of all branches ( +- 1.19% ) (62.82%) 15.1128 +- 0.0637 seconds time elapsed ( +- 0.42% ) After: Performance counter stats for './run case-lru-file-mmap-read' (5 runs): 84,380.21 msec task-clock # 5.664 CPUs utilized ( +- 0.21% ) 16,656 context-switches # 197.392 /sec ( +- 3.27% ) 151 cpu-migrations # 1.790 /sec ( +- 3.28% ) 801,703 page-faults # 9.501 K/sec ( +- 0.09% ) 731,914,183,060 instructions # 1.88 insn per cycle ( +- 0.38% ) (64.90%) 388,673,535,116 cycles # 4.606 GHz ( +- 0.24% ) (65.06%) 148,251,482,143 branches # 1.757 G/sec ( +- 0.37% ) (63.92%) 149,766,550 branch-misses # 0.10% of all branches ( +- 1.22% ) (62.88%) 14.8968 +- 0.0486 seconds time elapsed ( +- 0.33% ) Link: https://lkml.kernel.org/r/20250407180154.63348-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner Reviewed-by: Brendan Jackman Tested-by: Shivank Garg Reviewed-by: Vlastimil Babka Cc: Carlos Song Cc: Mel Gorman Signed-off-by: Andrew Morton --- mm/compaction.c | 4 +--- mm/internal.h | 2 +- mm/page_alloc.c | 31 +++++++++++++------------------ 3 files changed, 15 insertions(+), 22 deletions(-) diff --git a/mm/compaction.c b/mm/compaction.c index fe51a73b91a7..3925cb61dbb8 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -2344,7 +2344,6 @@ static enum compact_result __compact_finished(struct compact_control *cc) ret = COMPACT_NO_SUITABLE_PAGE; for (order = cc->order; order < NR_PAGE_ORDERS; order++) { struct free_area *area = &cc->zone->free_area[order]; - bool claim_block; /* Job done if page is free of the right migratetype */ if (!free_area_empty(area, migratetype)) @@ -2360,8 +2359,7 @@ static enum compact_result __compact_finished(struct compact_control *cc) * Job done if allocation would steal freepages from * other migratetype buddy lists. */ - if (find_suitable_fallback(area, order, migratetype, - true, &claim_block) != -1) + if (find_suitable_fallback(area, order, migratetype, true) >= 0) /* * Movable pages are OK in any pageblock. If we are * stealing for a non-movable allocation, make sure diff --git a/mm/internal.h b/mm/internal.h index 5c7a2b43ad76..0cf1f534ee1a 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -910,7 +910,7 @@ static inline void init_cma_pageblock(struct page *page) int find_suitable_fallback(struct free_area *area, unsigned int order, - int migratetype, bool claim_only, bool *claim_block); + int migratetype, bool claimable); static inline bool free_area_empty(struct free_area *area, int migratetype) { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index b8c15d572f2d..237dcca69e51 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2068,31 +2068,25 @@ static bool should_try_claim_block(unsigned int order, int start_mt) /* * Check whether there is a suitable fallback freepage with requested order. - * Sets *claim_block to instruct the caller whether it should convert a whole - * pageblock to the returned migratetype. - * If only_claim is true, this function returns fallback_mt only if + * If claimable is true, this function returns fallback_mt only if * we would do this whole-block claiming. This would help to reduce * fragmentation due to mixed migratetype pages in one pageblock. */ int find_suitable_fallback(struct free_area *area, unsigned int order, - int migratetype, bool only_claim, bool *claim_block) + int migratetype, bool claimable) { int i; - int fallback_mt; + + if (claimable && !should_try_claim_block(order, migratetype)) + return -2; if (area->nr_free == 0) return -1; - *claim_block = false; for (i = 0; i < MIGRATE_PCPTYPES - 1 ; i++) { - fallback_mt = fallbacks[migratetype][i]; - if (free_area_empty(area, fallback_mt)) - continue; + int fallback_mt = fallbacks[migratetype][i]; - if (should_try_claim_block(order, migratetype)) - *claim_block = true; - - if (*claim_block || !only_claim) + if (!free_area_empty(area, fallback_mt)) return fallback_mt; } @@ -2189,7 +2183,6 @@ __rmqueue_claim(struct zone *zone, int order, int start_migratetype, int min_order = order; struct page *page; int fallback_mt; - bool claim_block; /* * Do not steal pages from freelists belonging to other pageblocks @@ -2208,11 +2201,14 @@ __rmqueue_claim(struct zone *zone, int order, int start_migratetype, --current_order) { area = &(zone->free_area[current_order]); fallback_mt = find_suitable_fallback(area, current_order, - start_migratetype, false, &claim_block); + start_migratetype, true); + + /* No block in that order */ if (fallback_mt == -1) continue; - if (!claim_block) + /* Advanced into orders too low to claim, abort */ + if (fallback_mt == -2) break; page = get_page_from_free_area(area, fallback_mt); @@ -2240,12 +2236,11 @@ __rmqueue_steal(struct zone *zone, int order, int start_migratetype) int current_order; struct page *page; int fallback_mt; - bool claim_block; for (current_order = order; current_order < NR_PAGE_ORDERS; current_order++) { area = &(zone->free_area[current_order]); fallback_mt = find_suitable_fallback(area, current_order, - start_migratetype, false, &claim_block); + start_migratetype, false); if (fallback_mt == -1) continue; From e487a5d513cb6a0faf7e48523416434b111318b3 Mon Sep 17 00:00:00 2001 From: Li Wang Date: Mon, 7 Apr 2025 16:42:01 +0800 Subject: [PATCH 070/285] selftest/mm: make hugetlb_reparenting_test tolerant to async reparenting In cgroup v2, memory and hugetlb usage reparenting is asynchronous. This can cause test flakiness when immediately asserting usage after deleting a child cgroup. To address this, add a helper function `assert_with_retry()` that checks usage values with a timeout-based retry. This improves test stability without relying on fixed sleep delays. Also bump up the tolerance size to 7MB. To avoid False Positives: ... # Assert memory charged correctly for child only use. # actual a = 11 MB # expected a = 0 MB # fail # cleanup # [FAIL] not ok 11 hugetlb_reparenting_test.sh -cgroup-v2 # exit=1 # 0 # SUMMARY: PASS=10 SKIP=0 FAIL=1 Link: https://lkml.kernel.org/r/20250407084201.74492-1-liwang@redhat.com Signed-off-by: Li Wang Tested-by: Donet Tom Cc: Waiman Long Cc: Anshuman Khandual Cc: Dev Jain Cc: Kirill A. Shuemov Cc: Shuah Khan Signed-off-by: Andrew Morton --- .../selftests/mm/hugetlb_reparenting_test.sh | 96 ++++++++----------- 1 file changed, 41 insertions(+), 55 deletions(-) diff --git a/tools/testing/selftests/mm/hugetlb_reparenting_test.sh b/tools/testing/selftests/mm/hugetlb_reparenting_test.sh index 0b0d4ba1af27..9245549b66cf 100755 --- a/tools/testing/selftests/mm/hugetlb_reparenting_test.sh +++ b/tools/testing/selftests/mm/hugetlb_reparenting_test.sh @@ -36,7 +36,7 @@ else do_umount=1 fi fi -MNT='/mnt/huge/' +MNT='/mnt/huge' function get_machine_hugepage_size() { hpz=$(grep -i hugepagesize /proc/meminfo) @@ -60,6 +60,41 @@ function cleanup() { set -e } +function assert_with_retry() { + local actual_path="$1" + local expected="$2" + local tolerance=$((7 * 1024 * 1024)) + local timeout=20 + local interval=1 + local start_time + local now + local elapsed + local actual + + start_time=$(date +%s) + + while true; do + actual="$(cat "$actual_path")" + + if [[ $actual -ge $(($expected - $tolerance)) ]] && + [[ $actual -le $(($expected + $tolerance)) ]]; then + return 0 + fi + + now=$(date +%s) + elapsed=$((now - start_time)) + + if [[ $elapsed -ge $timeout ]]; then + echo "actual = $((${actual%% *} / 1024 / 1024)) MB" + echo "expected = $((${expected%% *} / 1024 / 1024)) MB" + cleanup + exit 1 + fi + + sleep $interval + done +} + function assert_state() { local expected_a="$1" local expected_a_hugetlb="$2" @@ -70,58 +105,13 @@ function assert_state() { expected_b="$3" expected_b_hugetlb="$4" fi - local tolerance=$((5 * 1024 * 1024)) - local actual_a - actual_a="$(cat "$CGROUP_ROOT"/a/memory.$usage_file)" - if [[ $actual_a -lt $(($expected_a - $tolerance)) ]] || - [[ $actual_a -gt $(($expected_a + $tolerance)) ]]; then - echo actual a = $((${actual_a%% *} / 1024 / 1024)) MB - echo expected a = $((${expected_a%% *} / 1024 / 1024)) MB - echo fail + assert_with_retry "$CGROUP_ROOT/a/memory.$usage_file" "$expected_a" + assert_with_retry "$CGROUP_ROOT/a/hugetlb.${MB}MB.$usage_file" "$expected_a_hugetlb" - cleanup - exit 1 - fi - - local actual_a_hugetlb - actual_a_hugetlb="$(cat "$CGROUP_ROOT"/a/hugetlb.${MB}MB.$usage_file)" - if [[ $actual_a_hugetlb -lt $(($expected_a_hugetlb - $tolerance)) ]] || - [[ $actual_a_hugetlb -gt $(($expected_a_hugetlb + $tolerance)) ]]; then - echo actual a hugetlb = $((${actual_a_hugetlb%% *} / 1024 / 1024)) MB - echo expected a hugetlb = $((${expected_a_hugetlb%% *} / 1024 / 1024)) MB - echo fail - - cleanup - exit 1 - fi - - if [[ -z "$expected_b" || -z "$expected_b_hugetlb" ]]; then - return - fi - - local actual_b - actual_b="$(cat "$CGROUP_ROOT"/a/b/memory.$usage_file)" - if [[ $actual_b -lt $(($expected_b - $tolerance)) ]] || - [[ $actual_b -gt $(($expected_b + $tolerance)) ]]; then - echo actual b = $((${actual_b%% *} / 1024 / 1024)) MB - echo expected b = $((${expected_b%% *} / 1024 / 1024)) MB - echo fail - - cleanup - exit 1 - fi - - local actual_b_hugetlb - actual_b_hugetlb="$(cat "$CGROUP_ROOT"/a/b/hugetlb.${MB}MB.$usage_file)" - if [[ $actual_b_hugetlb -lt $(($expected_b_hugetlb - $tolerance)) ]] || - [[ $actual_b_hugetlb -gt $(($expected_b_hugetlb + $tolerance)) ]]; then - echo actual b hugetlb = $((${actual_b_hugetlb%% *} / 1024 / 1024)) MB - echo expected b hugetlb = $((${expected_b_hugetlb%% *} / 1024 / 1024)) MB - echo fail - - cleanup - exit 1 + if [[ -n "$expected_b" && -n "$expected_b_hugetlb" ]]; then + assert_with_retry "$CGROUP_ROOT/a/b/memory.$usage_file" "$expected_b" + assert_with_retry "$CGROUP_ROOT/a/b/hugetlb.${MB}MB.$usage_file" "$expected_b_hugetlb" fi } @@ -174,7 +164,6 @@ size=$((${MB} * 1024 * 1024 * 25)) # 50MB = 25 * 2MB hugepages. cleanup -echo echo echo Test charge, rmdir, uncharge setup @@ -195,7 +184,6 @@ cleanup echo done echo -echo if [[ ! $cgroup2 ]]; then echo "Test parent and child hugetlb usage" setup @@ -212,7 +200,6 @@ if [[ ! $cgroup2 ]]; then assert_state 0 $(($size * 2)) 0 $size rmdir "$CGROUP_ROOT"/a/b - sleep 5 echo Assert memory reparent correctly. assert_state 0 $(($size * 2)) @@ -224,7 +211,6 @@ if [[ ! $cgroup2 ]]; then cleanup fi -echo echo echo "Test child only hugetlb usage" echo setup From e064e7384f991c7df81999cad4ce30fed7ef7d88 Mon Sep 17 00:00:00 2001 From: Anshuman Khandual Date: Mon, 7 Apr 2025 11:01:11 +0530 Subject: [PATCH 071/285] mm/ptdump: split note_page() into level specific callbacks Patch series "mm/ptdump: Drop assumption that pxd_val() is u64", v2. Last argument passed down in note_page() is u64 assuming pxd_val() returned value (all page table levels) is 64 bit - which might not be the case going ahead when D128 page tables is enabled on arm64 platform. Besides pxd_val() is very platform specific and its type should not be assumed in generic MM. A similar problem exists for effective_prot(), although it is restricted to x86 platform. This series splits note_page() and effective_prot() into individual page table level specific callbacks which accepts corresponding pxd_t page table entry as an argument instead and later on all subscribing platforms could derive pxd_val() from the table entries as required and proceed as before. Define ptdesc_t type which describes the basic page table descriptor layout on arm64 platform. Subsequently all level specific pxxval_t descriptors are derived from ptdesc_t thus establishing a common original format, which can also be appropriate for page table entries, masks and protection values etc which are used at all page table levels. This patch (of 3): Last argument passed down in note_page() is u64 assuming pxd_val() returned value (all page table levels) is 64 bit - which might not be the case going ahead when D128 page tables is enabled on arm64 platform. Besides pxd_val() is very platform specific and its type should not be assumed in generic MM. Split note_page() into individual page table level specific callbacks which accepts corresponding pxd_t argument instead and then subscribing platforms just derive pxd_val() from the entries as required and proceed as earlier. Also add a note_page_flush() callback for flushing the last page table page that was being handled earlier via level = -1. Link: https://lkml.kernel.org/r/20250407053113.746295-1-anshuman.khandual@arm.com Link: https://lkml.kernel.org/r/20250407053113.746295-2-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual Cc: Catalin Marinas Cc: Will Deacon Cc: Madhavan Srinivasan Cc: Nicholas Piggin Cc: Paul Walmsley Cc: Palmer Dabbelt Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Ard Biesheuvel Cc: Dave Hansen Cc: Mark Rutland Cc: Ryan Roberts Signed-off-by: Andrew Morton --- arch/arm64/include/asm/ptdump.h | 16 +++++++++-- arch/arm64/mm/ptdump.c | 48 ++++++++++++++++++++++++++++++--- arch/powerpc/mm/ptdump/ptdump.c | 46 +++++++++++++++++++++++++++++-- arch/riscv/mm/ptdump.c | 46 +++++++++++++++++++++++++++++-- arch/s390/mm/dump_pagetables.c | 46 +++++++++++++++++++++++++++++-- arch/x86/mm/dump_pagetables.c | 39 ++++++++++++++++++++++++++- include/linux/ptdump.h | 9 ++++--- mm/ptdump.c | 40 ++++++++++++++++++++------- 8 files changed, 266 insertions(+), 24 deletions(-) diff --git a/arch/arm64/include/asm/ptdump.h b/arch/arm64/include/asm/ptdump.h index b2931d1ae0fb..01033c1d38dc 100644 --- a/arch/arm64/include/asm/ptdump.h +++ b/arch/arm64/include/asm/ptdump.h @@ -59,7 +59,13 @@ struct ptdump_pg_state { void ptdump_walk(struct seq_file *s, struct ptdump_info *info); void note_page(struct ptdump_state *pt_st, unsigned long addr, int level, - u64 val); + pteval_t val); +void note_page_pte(struct ptdump_state *st, unsigned long addr, pte_t pte); +void note_page_pmd(struct ptdump_state *st, unsigned long addr, pmd_t pmd); +void note_page_pud(struct ptdump_state *st, unsigned long addr, pud_t pud); +void note_page_p4d(struct ptdump_state *st, unsigned long addr, p4d_t p4d); +void note_page_pgd(struct ptdump_state *st, unsigned long addr, pgd_t pgd); +void note_page_flush(struct ptdump_state *st); #ifdef CONFIG_PTDUMP_DEBUGFS #define EFI_RUNTIME_MAP_END DEFAULT_MAP_WINDOW_64 void __init ptdump_debugfs_register(struct ptdump_info *info, const char *name); @@ -69,7 +75,13 @@ static inline void ptdump_debugfs_register(struct ptdump_info *info, #endif /* CONFIG_PTDUMP_DEBUGFS */ #else static inline void note_page(struct ptdump_state *pt_st, unsigned long addr, - int level, u64 val) { } + int level, pteval_t val) { } +static inline void note_page_pte(struct ptdump_state *st, unsigned long addr, pte_t pte) { } +static inline void note_page_pmd(struct ptdump_state *st, unsigned long addr, pmd_t pmd) { } +static inline void note_page_pud(struct ptdump_state *st, unsigned long addr, pud_t pud) { } +static inline void note_page_p4d(struct ptdump_state *st, unsigned long addr, p4d_t p4d) { } +static inline void note_page_pgd(struct ptdump_state *st, unsigned long addr, pgd_t pgd) { } +static inline void note_page_flush(struct ptdump_state *st) { } #endif /* CONFIG_PTDUMP */ #endif /* __ASM_PTDUMP_H */ diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c index 8cec0da4cff2..ac0c20ba0cd9 100644 --- a/arch/arm64/mm/ptdump.c +++ b/arch/arm64/mm/ptdump.c @@ -189,7 +189,7 @@ static void note_prot_wx(struct ptdump_pg_state *st, unsigned long addr) } void note_page(struct ptdump_state *pt_st, unsigned long addr, int level, - u64 val) + pteval_t val) { struct ptdump_pg_state *st = container_of(pt_st, struct ptdump_pg_state, ptdump); struct ptdump_pg_level *pg_level = st->pg_level; @@ -251,6 +251,38 @@ void note_page(struct ptdump_state *pt_st, unsigned long addr, int level, } +void note_page_pte(struct ptdump_state *pt_st, unsigned long addr, pte_t pte) +{ + note_page(pt_st, addr, 4, pte_val(pte)); +} + +void note_page_pmd(struct ptdump_state *pt_st, unsigned long addr, pmd_t pmd) +{ + note_page(pt_st, addr, 3, pmd_val(pmd)); +} + +void note_page_pud(struct ptdump_state *pt_st, unsigned long addr, pud_t pud) +{ + note_page(pt_st, addr, 2, pud_val(pud)); +} + +void note_page_p4d(struct ptdump_state *pt_st, unsigned long addr, p4d_t p4d) +{ + note_page(pt_st, addr, 1, p4d_val(p4d)); +} + +void note_page_pgd(struct ptdump_state *pt_st, unsigned long addr, pgd_t pgd) +{ + note_page(pt_st, addr, 0, pgd_val(pgd)); +} + +void note_page_flush(struct ptdump_state *pt_st) +{ + pte_t pte_zero = {0}; + + note_page(pt_st, 0, -1, pte_val(pte_zero)); +} + void ptdump_walk(struct seq_file *s, struct ptdump_info *info) { unsigned long end = ~0UL; @@ -266,7 +298,12 @@ void ptdump_walk(struct seq_file *s, struct ptdump_info *info) .pg_level = &kernel_pg_levels[0], .level = -1, .ptdump = { - .note_page = note_page, + .note_page_pte = note_page_pte, + .note_page_pmd = note_page_pmd, + .note_page_pud = note_page_pud, + .note_page_p4d = note_page_p4d, + .note_page_pgd = note_page_pgd, + .note_page_flush = note_page_flush, .range = (struct ptdump_range[]){ {info->base_addr, end}, {0, 0} @@ -303,7 +340,12 @@ bool ptdump_check_wx(void) .level = -1, .check_wx = true, .ptdump = { - .note_page = note_page, + .note_page_pte = note_page_pte, + .note_page_pmd = note_page_pmd, + .note_page_pud = note_page_pud, + .note_page_p4d = note_page_p4d, + .note_page_pgd = note_page_pgd, + .note_page_flush = note_page_flush, .range = (struct ptdump_range[]) { {_PAGE_OFFSET(vabits_actual), ~0UL}, {0, 0} diff --git a/arch/powerpc/mm/ptdump/ptdump.c b/arch/powerpc/mm/ptdump/ptdump.c index 9dc239967b77..b2358d794855 100644 --- a/arch/powerpc/mm/ptdump/ptdump.c +++ b/arch/powerpc/mm/ptdump/ptdump.c @@ -298,6 +298,38 @@ static void populate_markers(void) #endif } +static void note_page_pte(struct ptdump_state *pt_st, unsigned long addr, pte_t pte) +{ + note_page(pt_st, addr, 4, pte_val(pte)); +} + +static void note_page_pmd(struct ptdump_state *pt_st, unsigned long addr, pmd_t pmd) +{ + note_page(pt_st, addr, 3, pmd_val(pmd)); +} + +static void note_page_pud(struct ptdump_state *pt_st, unsigned long addr, pud_t pud) +{ + note_page(pt_st, addr, 2, pud_val(pud)); +} + +static void note_page_p4d(struct ptdump_state *pt_st, unsigned long addr, p4d_t p4d) +{ + note_page(pt_st, addr, 1, p4d_val(p4d)); +} + +static void note_page_pgd(struct ptdump_state *pt_st, unsigned long addr, pgd_t pgd) +{ + note_page(pt_st, addr, 0, pgd_val(pgd)); +} + +static void note_page_flush(struct ptdump_state *pt_st) +{ + pte_t pte_zero = {0}; + + note_page(pt_st, 0, -1, pte_val(pte_zero)); +} + static int ptdump_show(struct seq_file *m, void *v) { struct pg_state st = { @@ -305,7 +337,12 @@ static int ptdump_show(struct seq_file *m, void *v) .marker = address_markers, .level = -1, .ptdump = { - .note_page = note_page, + .note_page_pte = note_page_pte, + .note_page_pmd = note_page_pmd, + .note_page_pud = note_page_pud, + .note_page_p4d = note_page_p4d, + .note_page_pgd = note_page_pgd, + .note_page_flush = note_page_flush, .range = ptdump_range, } }; @@ -338,7 +375,12 @@ bool ptdump_check_wx(void) .level = -1, .check_wx = true, .ptdump = { - .note_page = note_page, + .note_page_pte = note_page_pte, + .note_page_pmd = note_page_pmd, + .note_page_pud = note_page_pud, + .note_page_p4d = note_page_p4d, + .note_page_pgd = note_page_pgd, + .note_page_flush = note_page_flush, .range = ptdump_range, } }; diff --git a/arch/riscv/mm/ptdump.c b/arch/riscv/mm/ptdump.c index 9d5f657a251b..32922550a50a 100644 --- a/arch/riscv/mm/ptdump.c +++ b/arch/riscv/mm/ptdump.c @@ -318,6 +318,38 @@ static void note_page(struct ptdump_state *pt_st, unsigned long addr, } } +static void note_page_pte(struct ptdump_state *pt_st, unsigned long addr, pte_t pte) +{ + note_page(pt_st, addr, 4, pte_val(pte)); +} + +static void note_page_pmd(struct ptdump_state *pt_st, unsigned long addr, pmd_t pmd) +{ + note_page(pt_st, addr, 3, pmd_val(pmd)); +} + +static void note_page_pud(struct ptdump_state *pt_st, unsigned long addr, pud_t pud) +{ + note_page(pt_st, addr, 2, pud_val(pud)); +} + +static void note_page_p4d(struct ptdump_state *pt_st, unsigned long addr, p4d_t p4d) +{ + note_page(pt_st, addr, 1, p4d_val(p4d)); +} + +static void note_page_pgd(struct ptdump_state *pt_st, unsigned long addr, pgd_t pgd) +{ + note_page(pt_st, addr, 0, pgd_val(pgd)); +} + +static void note_page_flush(struct ptdump_state *pt_st) +{ + pte_t pte_zero = {0}; + + note_page(pt_st, 0, -1, pte_val(pte_zero)); +} + static void ptdump_walk(struct seq_file *s, struct ptd_mm_info *pinfo) { struct pg_state st = { @@ -325,7 +357,12 @@ static void ptdump_walk(struct seq_file *s, struct ptd_mm_info *pinfo) .marker = pinfo->markers, .level = -1, .ptdump = { - .note_page = note_page, + .note_page_pte = note_page_pte, + .note_page_pmd = note_page_pmd, + .note_page_pud = note_page_pud, + .note_page_p4d = note_page_p4d, + .note_page_pgd = note_page_pgd, + .note_page_flush = note_page_flush, .range = (struct ptdump_range[]) { {pinfo->base_addr, pinfo->end}, {0, 0} @@ -347,7 +384,12 @@ bool ptdump_check_wx(void) .level = -1, .check_wx = true, .ptdump = { - .note_page = note_page, + .note_page_pte = note_page_pte, + .note_page_pmd = note_page_pmd, + .note_page_pud = note_page_pud, + .note_page_p4d = note_page_p4d, + .note_page_pgd = note_page_pgd, + .note_page_flush = note_page_flush, .range = (struct ptdump_range[]) { {KERN_VIRT_START, ULONG_MAX}, {0, 0} diff --git a/arch/s390/mm/dump_pagetables.c b/arch/s390/mm/dump_pagetables.c index d3e943752fa0..ac604b176660 100644 --- a/arch/s390/mm/dump_pagetables.c +++ b/arch/s390/mm/dump_pagetables.c @@ -147,11 +147,48 @@ static void note_page(struct ptdump_state *pt_st, unsigned long addr, int level, } } +static void note_page_pte(struct ptdump_state *pt_st, unsigned long addr, pte_t pte) +{ + note_page(pt_st, addr, 4, pte_val(pte)); +} + +static void note_page_pmd(struct ptdump_state *pt_st, unsigned long addr, pmd_t pmd) +{ + note_page(pt_st, addr, 3, pmd_val(pmd)); +} + +static void note_page_pud(struct ptdump_state *pt_st, unsigned long addr, pud_t pud) +{ + note_page(pt_st, addr, 2, pud_val(pud)); +} + +static void note_page_p4d(struct ptdump_state *pt_st, unsigned long addr, p4d_t p4d) +{ + note_page(pt_st, addr, 1, p4d_val(p4d)); +} + +static void note_page_pgd(struct ptdump_state *pt_st, unsigned long addr, pgd_t pgd) +{ + note_page(pt_st, addr, 0, pgd_val(pgd)); +} + +static void note_page_flush(struct ptdump_state *pt_st) +{ + pte_t pte_zero = {0}; + + note_page(pt_st, 0, -1, pte_val(pte_zero)); +} + bool ptdump_check_wx(void) { struct pg_state st = { .ptdump = { - .note_page = note_page, + .note_page_pte = note_page_pte, + .note_page_pmd = note_page_pmd, + .note_page_pud = note_page_pud, + .note_page_p4d = note_page_p4d, + .note_page_pgd = note_page_pgd, + .note_page_flush = note_page_flush, .range = (struct ptdump_range[]) { {.start = 0, .end = max_addr}, {.start = 0, .end = 0}, @@ -190,7 +227,12 @@ static int ptdump_show(struct seq_file *m, void *v) { struct pg_state st = { .ptdump = { - .note_page = note_page, + .note_page_pte = note_page_pte, + .note_page_pmd = note_page_pmd, + .note_page_pud = note_page_pud, + .note_page_p4d = note_page_p4d, + .note_page_pgd = note_page_pgd, + .note_page_flush = note_page_flush, .range = (struct ptdump_range[]) { {.start = 0, .end = max_addr}, {.start = 0, .end = 0}, diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c index 89079ea73e65..2e1c2d006ace 100644 --- a/arch/x86/mm/dump_pagetables.c +++ b/arch/x86/mm/dump_pagetables.c @@ -362,6 +362,38 @@ static void note_page(struct ptdump_state *pt_st, unsigned long addr, int level, } } +static void note_page_pte(struct ptdump_state *pt_st, unsigned long addr, pte_t pte) +{ + note_page(pt_st, addr, 4, pte_val(pte)); +} + +static void note_page_pmd(struct ptdump_state *pt_st, unsigned long addr, pmd_t pmd) +{ + note_page(pt_st, addr, 3, pmd_val(pmd)); +} + +static void note_page_pud(struct ptdump_state *pt_st, unsigned long addr, pud_t pud) +{ + note_page(pt_st, addr, 2, pud_val(pud)); +} + +static void note_page_p4d(struct ptdump_state *pt_st, unsigned long addr, p4d_t p4d) +{ + note_page(pt_st, addr, 1, p4d_val(p4d)); +} + +static void note_page_pgd(struct ptdump_state *pt_st, unsigned long addr, pgd_t pgd) +{ + note_page(pt_st, addr, 0, pgd_val(pgd)); +} + +static void note_page_flush(struct ptdump_state *pt_st) +{ + pte_t pte_zero = {0}; + + note_page(pt_st, 0, -1, pte_val(pte_zero)); +} + bool ptdump_walk_pgd_level_core(struct seq_file *m, struct mm_struct *mm, pgd_t *pgd, bool checkwx, bool dmesg) @@ -378,7 +410,12 @@ bool ptdump_walk_pgd_level_core(struct seq_file *m, struct pg_state st = { .ptdump = { - .note_page = note_page, + .note_page_pte = note_page_pte, + .note_page_pmd = note_page_pmd, + .note_page_pud = note_page_pud, + .note_page_p4d = note_page_p4d, + .note_page_pgd = note_page_pgd, + .note_page_flush = note_page_flush, .effective_prot = effective_prot, .range = ptdump_ranges }, diff --git a/include/linux/ptdump.h b/include/linux/ptdump.h index 8dbd51ea8626..1c1eb1fae199 100644 --- a/include/linux/ptdump.h +++ b/include/linux/ptdump.h @@ -11,9 +11,12 @@ struct ptdump_range { }; struct ptdump_state { - /* level is 0:PGD to 4:PTE, or -1 if unknown */ - void (*note_page)(struct ptdump_state *st, unsigned long addr, - int level, u64 val); + void (*note_page_pte)(struct ptdump_state *st, unsigned long addr, pte_t pte); + void (*note_page_pmd)(struct ptdump_state *st, unsigned long addr, pmd_t pmd); + void (*note_page_pud)(struct ptdump_state *st, unsigned long addr, pud_t pud); + void (*note_page_p4d)(struct ptdump_state *st, unsigned long addr, p4d_t p4d); + void (*note_page_pgd)(struct ptdump_state *st, unsigned long addr, pgd_t pgd); + void (*note_page_flush)(struct ptdump_state *st); void (*effective_prot)(struct ptdump_state *st, int level, u64 val); const struct ptdump_range *range; }; diff --git a/mm/ptdump.c b/mm/ptdump.c index 106e1d66e9f9..706cfc19439b 100644 --- a/mm/ptdump.c +++ b/mm/ptdump.c @@ -18,7 +18,7 @@ static inline int note_kasan_page_table(struct mm_walk *walk, { struct ptdump_state *st = walk->private; - st->note_page(st, addr, 4, pte_val(kasan_early_shadow_pte[0])); + st->note_page_pte(st, addr, kasan_early_shadow_pte[0]); walk->action = ACTION_CONTINUE; @@ -42,7 +42,7 @@ static int ptdump_pgd_entry(pgd_t *pgd, unsigned long addr, st->effective_prot(st, 0, pgd_val(val)); if (pgd_leaf(val)) { - st->note_page(st, addr, 0, pgd_val(val)); + st->note_page_pgd(st, addr, val); walk->action = ACTION_CONTINUE; } @@ -65,7 +65,7 @@ static int ptdump_p4d_entry(p4d_t *p4d, unsigned long addr, st->effective_prot(st, 1, p4d_val(val)); if (p4d_leaf(val)) { - st->note_page(st, addr, 1, p4d_val(val)); + st->note_page_p4d(st, addr, val); walk->action = ACTION_CONTINUE; } @@ -88,7 +88,7 @@ static int ptdump_pud_entry(pud_t *pud, unsigned long addr, st->effective_prot(st, 2, pud_val(val)); if (pud_leaf(val)) { - st->note_page(st, addr, 2, pud_val(val)); + st->note_page_pud(st, addr, val); walk->action = ACTION_CONTINUE; } @@ -109,7 +109,7 @@ static int ptdump_pmd_entry(pmd_t *pmd, unsigned long addr, if (st->effective_prot) st->effective_prot(st, 3, pmd_val(val)); if (pmd_leaf(val)) { - st->note_page(st, addr, 3, pmd_val(val)); + st->note_page_pmd(st, addr, val); walk->action = ACTION_CONTINUE; } @@ -125,7 +125,7 @@ static int ptdump_pte_entry(pte_t *pte, unsigned long addr, if (st->effective_prot) st->effective_prot(st, 4, pte_val(val)); - st->note_page(st, addr, 4, pte_val(val)); + st->note_page_pte(st, addr, val); return 0; } @@ -134,9 +134,31 @@ static int ptdump_hole(unsigned long addr, unsigned long next, int depth, struct mm_walk *walk) { struct ptdump_state *st = walk->private; + pte_t pte_zero = {0}; + pmd_t pmd_zero = {0}; + pud_t pud_zero = {0}; + p4d_t p4d_zero = {0}; + pgd_t pgd_zero = {0}; - st->note_page(st, addr, depth, 0); - + switch (depth) { + case 4: + st->note_page_pte(st, addr, pte_zero); + break; + case 3: + st->note_page_pmd(st, addr, pmd_zero); + break; + case 2: + st->note_page_pud(st, addr, pud_zero); + break; + case 1: + st->note_page_p4d(st, addr, p4d_zero); + break; + case 0: + st->note_page_pgd(st, addr, pgd_zero); + break; + default: + break; + } return 0; } @@ -162,7 +184,7 @@ void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm, pgd_t *pgd) mmap_write_unlock(mm); /* Flush out the last page */ - st->note_page(st, 0, -1, 0); + st->note_page_flush(st); } static int check_wx_show(struct seq_file *m, void *v) From 08978fc3b0d5709583f6dc2072a8607d0b527860 Mon Sep 17 00:00:00 2001 From: Anshuman Khandual Date: Mon, 7 Apr 2025 11:01:12 +0530 Subject: [PATCH 072/285] mm/ptdump: split effective_prot() into level specific callbacks Last argument in effective_prot() is u64 assuming pxd_val() returned value (all page table levels) is 64 bit. pxd_val() is very platform specific and its type should not be assumed in generic MM. Split effective_prot() into individual page table level specific callbacks which accepts corresponding pxd_t argument instead and then the subscribing platform (only x86) just derive pxd_val() from the entries as required and proceed as earlier. Link: https://lkml.kernel.org/r/20250407053113.746295-3-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual Cc: Paul Walmsley Cc: Palmer Dabbelt Cc: Dave Hansen Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Ard Biesheuvel Cc: Catalin Marinas Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Madhavan Srinivasan Cc: Mark Rutland Cc: Nicholas Piggin Cc: Ryan Roberts Cc: Will Deacon Signed-off-by: Andrew Morton --- arch/x86/mm/dump_pagetables.c | 32 +++++++++++++++++++++++++++++++- include/linux/ptdump.h | 6 +++++- mm/ptdump.c | 20 ++++++++++---------- 3 files changed, 46 insertions(+), 12 deletions(-) diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c index 2e1c2d006ace..a4700ef6eb64 100644 --- a/arch/x86/mm/dump_pagetables.c +++ b/arch/x86/mm/dump_pagetables.c @@ -266,6 +266,32 @@ static void effective_prot(struct ptdump_state *pt_st, int level, u64 val) st->prot_levels[level] = effective; } +static void effective_prot_pte(struct ptdump_state *st, pte_t pte) +{ + effective_prot(st, 4, pte_val(pte)); +} + +static void effective_prot_pmd(struct ptdump_state *st, pmd_t pmd) +{ + effective_prot(st, 3, pmd_val(pmd)); +} + +static void effective_prot_pud(struct ptdump_state *st, pud_t pud) +{ + effective_prot(st, 2, pud_val(pud)); +} + +static void effective_prot_p4d(struct ptdump_state *st, p4d_t p4d) +{ + effective_prot(st, 1, p4d_val(p4d)); +} + +static void effective_prot_pgd(struct ptdump_state *st, pgd_t pgd) +{ + effective_prot(st, 0, pgd_val(pgd)); +} + + /* * This function gets called on a break in a continuous series * of PTE entries; the next one is different so we need to @@ -416,7 +442,11 @@ bool ptdump_walk_pgd_level_core(struct seq_file *m, .note_page_p4d = note_page_p4d, .note_page_pgd = note_page_pgd, .note_page_flush = note_page_flush, - .effective_prot = effective_prot, + .effective_prot_pte = effective_prot_pte, + .effective_prot_pmd = effective_prot_pmd, + .effective_prot_pud = effective_prot_pud, + .effective_prot_p4d = effective_prot_p4d, + .effective_prot_pgd = effective_prot_pgd, .range = ptdump_ranges }, .level = -1, diff --git a/include/linux/ptdump.h b/include/linux/ptdump.h index 1c1eb1fae199..240bd3bff18d 100644 --- a/include/linux/ptdump.h +++ b/include/linux/ptdump.h @@ -17,7 +17,11 @@ struct ptdump_state { void (*note_page_p4d)(struct ptdump_state *st, unsigned long addr, p4d_t p4d); void (*note_page_pgd)(struct ptdump_state *st, unsigned long addr, pgd_t pgd); void (*note_page_flush)(struct ptdump_state *st); - void (*effective_prot)(struct ptdump_state *st, int level, u64 val); + void (*effective_prot_pte)(struct ptdump_state *st, pte_t pte); + void (*effective_prot_pmd)(struct ptdump_state *st, pmd_t pmd); + void (*effective_prot_pud)(struct ptdump_state *st, pud_t pud); + void (*effective_prot_p4d)(struct ptdump_state *st, p4d_t p4d); + void (*effective_prot_pgd)(struct ptdump_state *st, pgd_t pgd); const struct ptdump_range *range; }; diff --git a/mm/ptdump.c b/mm/ptdump.c index 706cfc19439b..9374f29cdc6f 100644 --- a/mm/ptdump.c +++ b/mm/ptdump.c @@ -38,8 +38,8 @@ static int ptdump_pgd_entry(pgd_t *pgd, unsigned long addr, return note_kasan_page_table(walk, addr); #endif - if (st->effective_prot) - st->effective_prot(st, 0, pgd_val(val)); + if (st->effective_prot_pgd) + st->effective_prot_pgd(st, val); if (pgd_leaf(val)) { st->note_page_pgd(st, addr, val); @@ -61,8 +61,8 @@ static int ptdump_p4d_entry(p4d_t *p4d, unsigned long addr, return note_kasan_page_table(walk, addr); #endif - if (st->effective_prot) - st->effective_prot(st, 1, p4d_val(val)); + if (st->effective_prot_p4d) + st->effective_prot_p4d(st, val); if (p4d_leaf(val)) { st->note_page_p4d(st, addr, val); @@ -84,8 +84,8 @@ static int ptdump_pud_entry(pud_t *pud, unsigned long addr, return note_kasan_page_table(walk, addr); #endif - if (st->effective_prot) - st->effective_prot(st, 2, pud_val(val)); + if (st->effective_prot_pud) + st->effective_prot_pud(st, val); if (pud_leaf(val)) { st->note_page_pud(st, addr, val); @@ -106,8 +106,8 @@ static int ptdump_pmd_entry(pmd_t *pmd, unsigned long addr, return note_kasan_page_table(walk, addr); #endif - if (st->effective_prot) - st->effective_prot(st, 3, pmd_val(val)); + if (st->effective_prot_pmd) + st->effective_prot_pmd(st, val); if (pmd_leaf(val)) { st->note_page_pmd(st, addr, val); walk->action = ACTION_CONTINUE; @@ -122,8 +122,8 @@ static int ptdump_pte_entry(pte_t *pte, unsigned long addr, struct ptdump_state *st = walk->private; pte_t val = ptep_get_lockless(pte); - if (st->effective_prot) - st->effective_prot(st, 4, pte_val(val)); + if (st->effective_prot_pte) + st->effective_prot_pte(st, val); st->note_page_pte(st, addr, val); From dbb9c166a08c7a237c79b969a19ee02baa847af1 Mon Sep 17 00:00:00 2001 From: Anshuman Khandual Date: Mon, 7 Apr 2025 11:01:13 +0530 Subject: [PATCH 073/285] arm64/mm: define ptdesc_t Define ptdesc_t type which describes the basic page table descriptor layout on arm64 platform. Subsequently all level specific pxxval_t descriptors are derived from ptdesc_t thus establishing a common original format, which can also be appropriate for page table entries, masks and protection values etc which are used at all page table levels. Link: https://lkml.kernel.org/r/20250407053113.746295-4-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual Suggested-by: Ryan Roberts Cc: Catalin Marinas Cc: Will Deacon Cc: Ard Biesheuvel Cc: Mark Rutland Cc: Dave Hansen Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Ingo Molnar Cc: Madhavan Srinivasan Cc: Nicholas Piggin Cc: Palmer Dabbelt Cc: Paul Walmsley Cc: Peter Zijlstra Cc: Thomas Gleixner Signed-off-by: Andrew Morton --- arch/arm64/include/asm/pgtable-types.h | 20 ++++++++++++++------ arch/arm64/include/asm/ptdump.h | 8 ++++---- arch/arm64/kernel/efi.c | 4 ++-- arch/arm64/kernel/pi/map_kernel.c | 2 +- arch/arm64/kernel/pi/map_range.c | 4 ++-- arch/arm64/kernel/pi/pi.h | 2 +- arch/arm64/mm/mmap.c | 2 +- arch/arm64/mm/ptdump.c | 2 +- 8 files changed, 26 insertions(+), 18 deletions(-) diff --git a/arch/arm64/include/asm/pgtable-types.h b/arch/arm64/include/asm/pgtable-types.h index 6d6d4065b0cb..265e8301d7ba 100644 --- a/arch/arm64/include/asm/pgtable-types.h +++ b/arch/arm64/include/asm/pgtable-types.h @@ -11,11 +11,19 @@ #include -typedef u64 pteval_t; -typedef u64 pmdval_t; -typedef u64 pudval_t; -typedef u64 p4dval_t; -typedef u64 pgdval_t; +/* + * Page Table Descriptor + * + * Generic page table descriptor format from which + * all level specific descriptors can be derived. + */ +typedef u64 ptdesc_t; + +typedef ptdesc_t pteval_t; +typedef ptdesc_t pmdval_t; +typedef ptdesc_t pudval_t; +typedef ptdesc_t p4dval_t; +typedef ptdesc_t pgdval_t; /* * These are used to make use of C type-checking.. @@ -46,7 +54,7 @@ typedef struct { pgdval_t pgd; } pgd_t; #define pgd_val(x) ((x).pgd) #define __pgd(x) ((pgd_t) { (x) } ) -typedef struct { pteval_t pgprot; } pgprot_t; +typedef struct { ptdesc_t pgprot; } pgprot_t; #define pgprot_val(x) ((x).pgprot) #define __pgprot(x) ((pgprot_t) { (x) } ) diff --git a/arch/arm64/include/asm/ptdump.h b/arch/arm64/include/asm/ptdump.h index 01033c1d38dc..fded5358641f 100644 --- a/arch/arm64/include/asm/ptdump.h +++ b/arch/arm64/include/asm/ptdump.h @@ -24,8 +24,8 @@ struct ptdump_info { }; struct ptdump_prot_bits { - u64 mask; - u64 val; + ptdesc_t mask; + ptdesc_t val; const char *set; const char *clear; }; @@ -34,7 +34,7 @@ struct ptdump_pg_level { const struct ptdump_prot_bits *bits; char name[4]; int num; - u64 mask; + ptdesc_t mask; }; /* @@ -51,7 +51,7 @@ struct ptdump_pg_state { const struct mm_struct *mm; unsigned long start_address; int level; - u64 current_prot; + ptdesc_t current_prot; bool check_wx; unsigned long wx_pages; unsigned long uxn_pages; diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c index 1d25d8899dbf..42e281c07c2f 100644 --- a/arch/arm64/kernel/efi.c +++ b/arch/arm64/kernel/efi.c @@ -29,7 +29,7 @@ static bool region_is_misaligned(const efi_memory_desc_t *md) * executable, everything else can be mapped with the XN bits * set. Also take the new (optional) RO/XP bits into account. */ -static __init pteval_t create_mapping_protection(efi_memory_desc_t *md) +static __init ptdesc_t create_mapping_protection(efi_memory_desc_t *md) { u64 attr = md->attribute; u32 type = md->type; @@ -83,7 +83,7 @@ static __init pteval_t create_mapping_protection(efi_memory_desc_t *md) int __init efi_create_mapping(struct mm_struct *mm, efi_memory_desc_t *md) { - pteval_t prot_val = create_mapping_protection(md); + ptdesc_t prot_val = create_mapping_protection(md); bool page_mappings_only = (md->type == EFI_RUNTIME_SERVICES_CODE || md->type == EFI_RUNTIME_SERVICES_DATA); diff --git a/arch/arm64/kernel/pi/map_kernel.c b/arch/arm64/kernel/pi/map_kernel.c index c6650cfe706c..0f4bd7771859 100644 --- a/arch/arm64/kernel/pi/map_kernel.c +++ b/arch/arm64/kernel/pi/map_kernel.c @@ -159,7 +159,7 @@ static void noinline __section(".idmap.text") set_ttbr0_for_lpa2(u64 ttbr) static void __init remap_idmap_for_lpa2(void) { /* clear the bits that change meaning once LPA2 is turned on */ - pteval_t mask = PTE_SHARED; + ptdesc_t mask = PTE_SHARED; /* * We have to clear bits [9:8] in all block or page descriptors in the diff --git a/arch/arm64/kernel/pi/map_range.c b/arch/arm64/kernel/pi/map_range.c index 81345f68f9fc..7982788e7b9a 100644 --- a/arch/arm64/kernel/pi/map_range.c +++ b/arch/arm64/kernel/pi/map_range.c @@ -30,7 +30,7 @@ void __init map_range(u64 *pte, u64 start, u64 end, u64 pa, pgprot_t prot, int level, pte_t *tbl, bool may_use_cont, u64 va_offset) { u64 cmask = (level == 3) ? CONT_PTE_SIZE - 1 : U64_MAX; - pteval_t protval = pgprot_val(prot) & ~PTE_TYPE_MASK; + ptdesc_t protval = pgprot_val(prot) & ~PTE_TYPE_MASK; int lshift = (3 - level) * PTDESC_TABLE_SHIFT; u64 lmask = (PAGE_SIZE << lshift) - 1; @@ -87,7 +87,7 @@ void __init map_range(u64 *pte, u64 start, u64 end, u64 pa, pgprot_t prot, } } -asmlinkage u64 __init create_init_idmap(pgd_t *pg_dir, pteval_t clrmask) +asmlinkage u64 __init create_init_idmap(pgd_t *pg_dir, ptdesc_t clrmask) { u64 ptep = (u64)pg_dir + PAGE_SIZE; pgprot_t text_prot = PAGE_KERNEL_ROX; diff --git a/arch/arm64/kernel/pi/pi.h b/arch/arm64/kernel/pi/pi.h index c91e5e965cd3..91dcb5b6bbd1 100644 --- a/arch/arm64/kernel/pi/pi.h +++ b/arch/arm64/kernel/pi/pi.h @@ -33,4 +33,4 @@ void map_range(u64 *pgd, u64 start, u64 end, u64 pa, pgprot_t prot, asmlinkage void early_map_kernel(u64 boot_status, void *fdt); -asmlinkage u64 create_init_idmap(pgd_t *pgd, pteval_t clrmask); +asmlinkage u64 create_init_idmap(pgd_t *pgd, ptdesc_t clrmask); diff --git a/arch/arm64/mm/mmap.c b/arch/arm64/mm/mmap.c index 07aeab8a7606..c86c348857c4 100644 --- a/arch/arm64/mm/mmap.c +++ b/arch/arm64/mm/mmap.c @@ -83,7 +83,7 @@ arch_initcall(adjust_protection_map); pgprot_t vm_get_page_prot(unsigned long vm_flags) { - pteval_t prot; + ptdesc_t prot; /* Short circuit GCS to avoid bloating the table. */ if (system_supports_gcs() && (vm_flags & VM_SHADOW_STACK)) { diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c index ac0c20ba0cd9..421a5de806c6 100644 --- a/arch/arm64/mm/ptdump.c +++ b/arch/arm64/mm/ptdump.c @@ -194,7 +194,7 @@ void note_page(struct ptdump_state *pt_st, unsigned long addr, int level, struct ptdump_pg_state *st = container_of(pt_st, struct ptdump_pg_state, ptdump); struct ptdump_pg_level *pg_level = st->pg_level; static const char units[] = "KMGTPE"; - u64 prot = 0; + ptdesc_t prot = 0; /* check if the current level has been folded dynamically */ if (st->mm && ((level == 1 && mm_p4d_folded(st->mm)) || From 4c97a17a252bf8396f7bd65efced00bf401a8c25 Mon Sep 17 00:00:00 2001 From: Przemek Kitszel Date: Thu, 20 Mar 2025 11:22:19 +0100 Subject: [PATCH 074/285] xarray: make xa_alloc_cyclic() return 0 on all success cases Change xa_alloc_cyclic() to return 0 even on wrap-around. Do the same for xa_alloc_cyclic_irq() and xa_alloc_cyclic_bh(). This will prevent any future bug of treating return of 1 as an error: int ret = xa_alloc_cyclic(...) if (ret) // currently mishandles ret==1 goto failure; If there will be someone interested in when wrap-around occurs, there is still __xa_alloc_cyclic() that behaves as before. For now there is no such user. Link: https://lkml.kernel.org/r/20250320102219.8101-1-przemyslaw.kitszel@intel.com Signed-off-by: Przemek Kitszel Suggested-by: Matthew Wilcox Link: https://lore.kernel.org/netdev/Z9gUd-5t8b5NX2wE@casper.infradead.org Cc: Andriy Shevchenko Cc: Dave Hansen Cc: Michal Swiatkowski Cc: Przemek Kitszel Signed-off-by: Andrew Morton --- include/linux/xarray.h | 24 +++++++++++++++--------- lib/test_xarray.c | 17 +++++++++++++++-- 2 files changed, 30 insertions(+), 11 deletions(-) diff --git a/include/linux/xarray.h b/include/linux/xarray.h index 78eede109b1a..be850174e802 100644 --- a/include/linux/xarray.h +++ b/include/linux/xarray.h @@ -965,10 +965,12 @@ static inline int __must_check xa_alloc_irq(struct xarray *xa, u32 *id, * Must only be operated on an xarray initialized with flag XA_FLAGS_ALLOC set * in xa_init_flags(). * + * Note that callers interested in whether wrapping has occurred should + * use __xa_alloc_cyclic() instead. + * * Context: Any context. Takes and releases the xa_lock. May sleep if * the @gfp flags permit. - * Return: 0 if the allocation succeeded without wrapping. 1 if the - * allocation succeeded after wrapping, -ENOMEM if memory could not be + * Return: 0 if the allocation succeeded, -ENOMEM if memory could not be * allocated or -EBUSY if there are no free entries in @limit. */ static inline int xa_alloc_cyclic(struct xarray *xa, u32 *id, void *entry, @@ -981,7 +983,7 @@ static inline int xa_alloc_cyclic(struct xarray *xa, u32 *id, void *entry, err = __xa_alloc_cyclic(xa, id, entry, limit, next, gfp); xa_unlock(xa); - return err; + return err < 0 ? err : 0; } /** @@ -1002,10 +1004,12 @@ static inline int xa_alloc_cyclic(struct xarray *xa, u32 *id, void *entry, * Must only be operated on an xarray initialized with flag XA_FLAGS_ALLOC set * in xa_init_flags(). * + * Note that callers interested in whether wrapping has occurred should + * use __xa_alloc_cyclic() instead. + * * Context: Any context. Takes and releases the xa_lock while * disabling softirqs. May sleep if the @gfp flags permit. - * Return: 0 if the allocation succeeded without wrapping. 1 if the - * allocation succeeded after wrapping, -ENOMEM if memory could not be + * Return: 0 if the allocation succeeded, -ENOMEM if memory could not be * allocated or -EBUSY if there are no free entries in @limit. */ static inline int xa_alloc_cyclic_bh(struct xarray *xa, u32 *id, void *entry, @@ -1018,7 +1022,7 @@ static inline int xa_alloc_cyclic_bh(struct xarray *xa, u32 *id, void *entry, err = __xa_alloc_cyclic(xa, id, entry, limit, next, gfp); xa_unlock_bh(xa); - return err; + return err < 0 ? err : 0; } /** @@ -1039,10 +1043,12 @@ static inline int xa_alloc_cyclic_bh(struct xarray *xa, u32 *id, void *entry, * Must only be operated on an xarray initialized with flag XA_FLAGS_ALLOC set * in xa_init_flags(). * + * Note that callers interested in whether wrapping has occurred should + * use __xa_alloc_cyclic() instead. + * * Context: Process context. Takes and releases the xa_lock while * disabling interrupts. May sleep if the @gfp flags permit. - * Return: 0 if the allocation succeeded without wrapping. 1 if the - * allocation succeeded after wrapping, -ENOMEM if memory could not be + * Return: 0 if the allocation succeeded, -ENOMEM if memory could not be * allocated or -EBUSY if there are no free entries in @limit. */ static inline int xa_alloc_cyclic_irq(struct xarray *xa, u32 *id, void *entry, @@ -1055,7 +1061,7 @@ static inline int xa_alloc_cyclic_irq(struct xarray *xa, u32 *id, void *entry, err = __xa_alloc_cyclic(xa, id, entry, limit, next, gfp); xa_unlock_irq(xa); - return err; + return err < 0 ? err : 0; } /** diff --git a/lib/test_xarray.c b/lib/test_xarray.c index 080a39d22e73..5ca0aefee9aa 100644 --- a/lib/test_xarray.c +++ b/lib/test_xarray.c @@ -1040,6 +1040,7 @@ static noinline void check_xa_alloc_3(struct xarray *xa, unsigned int base) unsigned int i, id; unsigned long index; void *entry; + int ret; XA_BUG_ON(xa, xa_alloc_cyclic(xa, &id, xa_mk_index(1), limit, &next, GFP_KERNEL) != 0); @@ -1059,7 +1060,7 @@ static noinline void check_xa_alloc_3(struct xarray *xa, unsigned int base) else entry = xa_mk_index(i - 0x3fff); XA_BUG_ON(xa, xa_alloc_cyclic(xa, &id, entry, limit, - &next, GFP_KERNEL) != (id == 1)); + &next, GFP_KERNEL) != 0); XA_BUG_ON(xa, xa_mk_index(id) != entry); } @@ -1072,7 +1073,7 @@ static noinline void check_xa_alloc_3(struct xarray *xa, unsigned int base) xa_limit_32b, &next, GFP_KERNEL) != 0); XA_BUG_ON(xa, id != UINT_MAX); XA_BUG_ON(xa, xa_alloc_cyclic(xa, &id, xa_mk_index(base), - xa_limit_32b, &next, GFP_KERNEL) != 1); + xa_limit_32b, &next, GFP_KERNEL) != 0); XA_BUG_ON(xa, id != base); XA_BUG_ON(xa, xa_alloc_cyclic(xa, &id, xa_mk_index(base + 1), xa_limit_32b, &next, GFP_KERNEL) != 0); @@ -1080,7 +1081,19 @@ static noinline void check_xa_alloc_3(struct xarray *xa, unsigned int base) xa_for_each(xa, index, entry) xa_erase_index(xa, index); + XA_BUG_ON(xa, !xa_empty(xa)); + /* check wrap-around return of __xa_alloc_cyclic() */ + next = UINT_MAX; + XA_BUG_ON(xa, xa_alloc_cyclic(xa, &id, xa_mk_index(UINT_MAX), + xa_limit_32b, &next, GFP_KERNEL) != 0); + xa_lock(xa); + ret = __xa_alloc_cyclic(xa, &id, xa_mk_index(base), xa_limit_32b, + &next, GFP_KERNEL); + xa_unlock(xa); + XA_BUG_ON(xa, ret != 1); + xa_for_each(xa, index, entry) + xa_erase_index(xa, index); XA_BUG_ON(xa, !xa_empty(xa)); } From a40b3fa844b4c29276a228ce878bd427a3149d37 Mon Sep 17 00:00:00 2001 From: Liu Ye Date: Tue, 18 Mar 2025 14:32:26 +0800 Subject: [PATCH 075/285] fs/proc/page: refactor to reduce code duplication kpageflags_read() and kpagecgroup_read() are quite similar to kpagecount_read(). Refactor common code into a helper function to reduce code duplication. Link: https://lkml.kernel.org/r/20250318063226.223284-1-liuyerd@163.com Signed-off-by: Liu Ye Acked-by: David Hildenbrand Cc: Johannes Weiner Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Muchun Song Cc: Ran Xiaokai Cc: Roman Gushchin Cc: Shakeel Butt Cc: Svetly Todorov Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- fs/proc/page.c | 159 +++++++++++++------------------------ include/linux/memcontrol.h | 4 + 2 files changed, 57 insertions(+), 106 deletions(-) diff --git a/fs/proc/page.c b/fs/proc/page.c index 23fc771100ae..999af26c7298 100644 --- a/fs/proc/page.c +++ b/fs/proc/page.c @@ -22,6 +22,12 @@ #define KPMMASK (KPMSIZE - 1) #define KPMBITS (KPMSIZE * BITS_PER_BYTE) +enum kpage_operation { + KPAGE_FLAGS, + KPAGE_COUNT, + KPAGE_CGROUP, +}; + static inline unsigned long get_max_dump_pfn(void) { #ifdef CONFIG_SPARSEMEM @@ -37,19 +43,17 @@ static inline unsigned long get_max_dump_pfn(void) #endif } -/* /proc/kpagecount - an array exposing page mapcounts - * - * Each entry is a u64 representing the corresponding - * physical page mapcount. - */ -static ssize_t kpagecount_read(struct file *file, char __user *buf, - size_t count, loff_t *ppos) +static ssize_t kpage_read(struct file *file, char __user *buf, + size_t count, loff_t *ppos, + enum kpage_operation op) { const unsigned long max_dump_pfn = get_max_dump_pfn(); u64 __user *out = (u64 __user *)buf; + struct page *page; unsigned long src = *ppos; unsigned long pfn; ssize_t ret = 0; + u64 info; pfn = src / KPMSIZE; if (src & KPMMASK || count & KPMMASK) @@ -59,24 +63,34 @@ static ssize_t kpagecount_read(struct file *file, char __user *buf, count = min_t(unsigned long, count, (max_dump_pfn * KPMSIZE) - src); while (count > 0) { - struct page *page; - u64 mapcount = 0; - /* * TODO: ZONE_DEVICE support requires to identify * memmaps that were actually initialized. */ page = pfn_to_online_page(pfn); + if (page) { - struct folio *folio = page_folio(page); + switch (op) { + case KPAGE_FLAGS: + info = stable_page_flags(page); + break; + case KPAGE_COUNT: + if (IS_ENABLED(CONFIG_PAGE_MAPCOUNT)) + info = folio_precise_page_mapcount(page_folio(page), page); + else + info = folio_average_page_mapcount(page_folio(page)); + break; + case KPAGE_CGROUP: + info = page_cgroup_ino(page); + break; + default: + info = 0; + break; + } + } else + info = 0; - if (IS_ENABLED(CONFIG_PAGE_MAPCOUNT)) - mapcount = folio_precise_page_mapcount(folio, page); - else - mapcount = folio_average_page_mapcount(folio); - } - - if (put_user(mapcount, out)) { + if (put_user(info, out)) { ret = -EFAULT; break; } @@ -94,17 +108,23 @@ static ssize_t kpagecount_read(struct file *file, char __user *buf, return ret; } +/* /proc/kpagecount - an array exposing page mapcounts + * + * Each entry is a u64 representing the corresponding + * physical page mapcount. + */ +static ssize_t kpagecount_read(struct file *file, char __user *buf, + size_t count, loff_t *ppos) +{ + return kpage_read(file, buf, count, ppos, KPAGE_COUNT); +} + static const struct proc_ops kpagecount_proc_ops = { .proc_flags = PROC_ENTRY_PERMANENT, .proc_lseek = mem_lseek, .proc_read = kpagecount_read, }; -/* /proc/kpageflags - an array exposing page flags - * - * Each entry is a u64 representing the corresponding - * physical page flags. - */ static inline u64 kpf_copy_bit(u64 kflags, int ubit, int kbit) { @@ -225,47 +245,17 @@ u64 stable_page_flags(const struct page *page) #endif return u; -}; +} +/* /proc/kpageflags - an array exposing page flags + * + * Each entry is a u64 representing the corresponding + * physical page flags. + */ static ssize_t kpageflags_read(struct file *file, char __user *buf, - size_t count, loff_t *ppos) + size_t count, loff_t *ppos) { - const unsigned long max_dump_pfn = get_max_dump_pfn(); - u64 __user *out = (u64 __user *)buf; - unsigned long src = *ppos; - unsigned long pfn; - ssize_t ret = 0; - - pfn = src / KPMSIZE; - if (src & KPMMASK || count & KPMMASK) - return -EINVAL; - if (src >= max_dump_pfn * KPMSIZE) - return 0; - count = min_t(unsigned long, count, (max_dump_pfn * KPMSIZE) - src); - - while (count > 0) { - /* - * TODO: ZONE_DEVICE support requires to identify - * memmaps that were actually initialized. - */ - struct page *page = pfn_to_online_page(pfn); - - if (put_user(stable_page_flags(page), out)) { - ret = -EFAULT; - break; - } - - pfn++; - out++; - count -= KPMSIZE; - - cond_resched(); - } - - *ppos += (char __user *)out - buf; - if (!ret) - ret = (char __user *)out - buf; - return ret; + return kpage_read(file, buf, count, ppos, KPAGE_FLAGS); } static const struct proc_ops kpageflags_proc_ops = { @@ -276,53 +266,10 @@ static const struct proc_ops kpageflags_proc_ops = { #ifdef CONFIG_MEMCG static ssize_t kpagecgroup_read(struct file *file, char __user *buf, - size_t count, loff_t *ppos) + size_t count, loff_t *ppos) { - const unsigned long max_dump_pfn = get_max_dump_pfn(); - u64 __user *out = (u64 __user *)buf; - struct page *ppage; - unsigned long src = *ppos; - unsigned long pfn; - ssize_t ret = 0; - u64 ino; - - pfn = src / KPMSIZE; - if (src & KPMMASK || count & KPMMASK) - return -EINVAL; - if (src >= max_dump_pfn * KPMSIZE) - return 0; - count = min_t(unsigned long, count, (max_dump_pfn * KPMSIZE) - src); - - while (count > 0) { - /* - * TODO: ZONE_DEVICE support requires to identify - * memmaps that were actually initialized. - */ - ppage = pfn_to_online_page(pfn); - - if (ppage) - ino = page_cgroup_ino(ppage); - else - ino = 0; - - if (put_user(ino, out)) { - ret = -EFAULT; - break; - } - - pfn++; - out++; - count -= KPMSIZE; - - cond_resched(); - } - - *ppos += (char __user *)out - buf; - if (!ret) - ret = (char __user *)out - buf; - return ret; + return kpage_read(file, buf, count, ppos, KPAGE_CGROUP); } - static const struct proc_ops kpagecgroup_proc_ops = { .proc_flags = PROC_ENTRY_PERMANENT, .proc_lseek = mem_lseek, diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 53364526d877..5264d148bdd9 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1793,6 +1793,10 @@ static inline void count_objcg_events(struct obj_cgroup *objcg, { } +static inline ino_t page_cgroup_ino(struct page *page) +{ + return 0; +} #endif /* CONFIG_MEMCG */ #if defined(CONFIG_MEMCG) && defined(CONFIG_ZSWAP) From 4318255091eab6722bea3816bdee09ac7db68035 Mon Sep 17 00:00:00 2001 From: "Uladzislau Rezki (Sony)" Date: Tue, 8 Apr 2025 17:15:47 +0200 Subject: [PATCH 076/285] vmalloc: add for_each_vmap_node() helper To simplify iteration over vmap-nodes, add the for_each_vmap_node() macro that iterates over all nodes in a system. It tends to simplify the code. Link: https://lkml.kernel.org/r/20250408151549.77937-1-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) Cc: Christop Hellwig Signed-off-by: Andrew Morton --- mm/vmalloc.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 44b735a4573a..55949360a5cb 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -900,6 +900,11 @@ static struct vmap_node *vmap_nodes = &single; static __read_mostly unsigned int nr_vmap_nodes = 1; static __read_mostly unsigned int vmap_zone_size = 1; +/* A simple iterator over all vmap-nodes. */ +#define for_each_vmap_node(vn) \ + for ((vn) = &vmap_nodes[0]; \ + (vn) < &vmap_nodes[nr_vmap_nodes]; (vn)++) + static inline unsigned int addr_to_node_id(unsigned long addr) { From ce906d7679e1f1feab54bdddac4d39df846d663a Mon Sep 17 00:00:00 2001 From: "Uladzislau Rezki (Sony)" Date: Tue, 8 Apr 2025 17:15:48 +0200 Subject: [PATCH 077/285] vmalloc: switch to for_each_vmap_node() helper There are places which can be updated easily to use the helper to iterate over all vmap-nodes. This is what this patch does. The aim is to improve readability and simplify the code. [akpm@linux-foundation.org: fix build warning] Link: https://lkml.kernel.org/r/20250408151549.77937-2-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) Cc: Christop Hellwig Signed-off-by: Andrew Morton --- mm/vmalloc.c | 40 +++++++++++++++------------------------- 1 file changed, 15 insertions(+), 25 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 55949360a5cb..e80783fa4d2c 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1061,12 +1061,11 @@ find_vmap_area_exceed_addr_lock(unsigned long addr, struct vmap_area **va) { unsigned long va_start_lowest; struct vmap_node *vn; - int i; repeat: - for (i = 0, va_start_lowest = 0; i < nr_vmap_nodes; i++) { - vn = &vmap_nodes[i]; + va_start_lowest = 0; + for_each_vmap_node(vn) { spin_lock(&vn->busy.lock); *va = __find_vmap_area_exceed_addr(addr, &vn->busy.root); @@ -4963,11 +4962,8 @@ static void show_purge_info(struct seq_file *m) { struct vmap_node *vn; struct vmap_area *va; - int i; - - for (i = 0; i < nr_vmap_nodes; i++) { - vn = &vmap_nodes[i]; + for_each_vmap_node(vn) { spin_lock(&vn->lazy.lock); list_for_each_entry(va, &vn->lazy.head, list) { seq_printf(m, "0x%pK-0x%pK %7ld unpurged vm_area\n", @@ -4983,11 +4979,8 @@ static int vmalloc_info_show(struct seq_file *m, void *p) struct vmap_node *vn; struct vmap_area *va; struct vm_struct *v; - int i; - - for (i = 0; i < nr_vmap_nodes; i++) { - vn = &vmap_nodes[i]; + for_each_vmap_node(vn) { spin_lock(&vn->busy.lock); list_for_each_entry(va, &vn->busy.head, list) { if (!va->vm) { @@ -5108,7 +5101,7 @@ static void __init vmap_init_free_space(void) static void vmap_init_nodes(void) { struct vmap_node *vn; - int i, n; + int i; #if BITS_PER_LONG == 64 /* @@ -5125,7 +5118,7 @@ static void vmap_init_nodes(void) * set of cores. Therefore a per-domain purging is supposed to * be added as well as a per-domain balancing. */ - n = clamp_t(unsigned int, num_possible_cpus(), 1, 128); + int n = clamp_t(unsigned int, num_possible_cpus(), 1, 128); if (n > 1) { vn = kmalloc_array(n, sizeof(*vn), GFP_NOWAIT | __GFP_NOWARN); @@ -5140,8 +5133,7 @@ static void vmap_init_nodes(void) } #endif - for (n = 0; n < nr_vmap_nodes; n++) { - vn = &vmap_nodes[n]; + for_each_vmap_node(vn) { vn->busy.root = RB_ROOT; INIT_LIST_HEAD(&vn->busy.head); spin_lock_init(&vn->busy.lock); @@ -5162,15 +5154,13 @@ static void vmap_init_nodes(void) static unsigned long vmap_node_shrink_count(struct shrinker *shrink, struct shrink_control *sc) { - unsigned long count; + unsigned long count = 0; struct vmap_node *vn; - int i, j; + int i; - for (count = 0, i = 0; i < nr_vmap_nodes; i++) { - vn = &vmap_nodes[i]; - - for (j = 0; j < MAX_VA_SIZE_PAGES; j++) - count += READ_ONCE(vn->pool[j].len); + for_each_vmap_node(vn) { + for (i = 0; i < MAX_VA_SIZE_PAGES; i++) + count += READ_ONCE(vn->pool[i].len); } return count ? count : SHRINK_EMPTY; @@ -5179,10 +5169,10 @@ vmap_node_shrink_count(struct shrinker *shrink, struct shrink_control *sc) static unsigned long vmap_node_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) { - int i; + struct vmap_node *vn; - for (i = 0; i < nr_vmap_nodes; i++) - decay_va_pool_node(&vmap_nodes[i], true); + for_each_vmap_node(vn) + decay_va_pool_node(vn, true); return SHRINK_STOP; } From 24c76f37ab3fb2f3a202eaa678334b216f8bd161 Mon Sep 17 00:00:00 2001 From: "Uladzislau Rezki (Sony)" Date: Tue, 8 Apr 2025 17:15:49 +0200 Subject: [PATCH 078/285] vmalloc: use for_each_vmap_node() in purge-vmap-area Update a __purge_vmap_area_lazy() to use introduced helper. This is last place in vmalloc code. Also this patch introduces an extra function which is node_to_id() that converts a vmap_node pointer to an index in array. __purge_vmap_area_lazy() requires that extra function. Link: https://lkml.kernel.org/r/20250408151549.77937-3-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) Cc: Christop Hellwig Signed-off-by: Andrew Morton --- mm/vmalloc.c | 19 +++++++++++++++---- 1 file changed, 15 insertions(+), 4 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index e80783fa4d2c..9eb3880887b6 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -923,6 +923,19 @@ id_to_node(unsigned int id) return &vmap_nodes[id % nr_vmap_nodes]; } +static inline unsigned int +node_to_id(struct vmap_node *node) +{ + /* Pointer arithmetic. */ + unsigned int id = node - vmap_nodes; + + if (likely(id < nr_vmap_nodes)) + return id; + + WARN_ONCE(1, "An address 0x%p is out-of-bounds.\n", node); + return 0; +} + /* * We use the value 0 to represent "no node", that is why * an encoded value will be the node-id incremented by 1. @@ -2259,9 +2272,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end, */ purge_nodes = CPU_MASK_NONE; - for (i = 0; i < nr_vmap_nodes; i++) { - vn = &vmap_nodes[i]; - + for_each_vmap_node(vn) { INIT_LIST_HEAD(&vn->purge_list); vn->skip_populate = full_pool_decay; decay_va_pool_node(vn, full_pool_decay); @@ -2280,7 +2291,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end, end = max(end, list_last_entry(&vn->purge_list, struct vmap_area, list)->va_end); - cpumask_set_cpu(i, &purge_nodes); + cpumask_set_cpu(node_to_id(vn), &purge_nodes); } nr_purge_nodes = cpumask_weight(&purge_nodes); From d82d3bf4115217bb3f43f9320bad5d68a35c278f Mon Sep 17 00:00:00 2001 From: Kevin Brodsky Date: Tue, 8 Apr 2025 10:52:11 +0100 Subject: [PATCH 079/285] mm: pass mm down to pagetable_{pte,pmd}_ctor Patch series "Always call constructor for kernel page tables", v2. There has been much confusion around exactly when page table constructors/destructors (pagetable_*_[cd]tor) are supposed to be called. They were initially introduced for user PTEs only (to support split page table locks), then at the PMD level for the same purpose. Accounting was added later on, starting at the PTE level and then moving to higher levels (PMD, PUD). Finally, with my earlier series "Account page tables at all levels" [1], the ctor/dtor is run for all levels, all the way to PGD. I thought this was the end of the story, and it hopefully is for user pgtables, but I was wrong for what concerns kernel pgtables. The current situation there makes very little sense: * At the PTE level, the ctor/dtor is not called (at least in the generic implementation). Specific helpers are used for kernel pgtables at this level (pte_{alloc,free}_kernel()) and those have never called the ctor/dtor, most likely because they were initially irrelevant in the kernel case. * At all other levels, the ctor/dtor is normally called. This is potentially wasteful at the PMD level (more on that later). This series aims to ensure that the ctor/dtor is always called for kernel pgtables, as it already is for user pgtables. Besides consistency, the main motivation is to guarantee that ctor/dtor hooks are systematically called; this makes it possible to insert hooks to protect page tables [2], for instance. There is however an extra challenge: split locks are not used for kernel pgtables, and it would therefore be wasteful to initialise them (ptlock_init()). It is worth clarifying exactly when split locks are used. They clearly are for user pgtables, but as illustrated in commit 61444cde9170 ("ARM: 8591/1: mm: use fully constructed struct pages for EFI pgd allocations"), they also are for special page tables like efi_mm. The one case where split locks are definitely unused is pgtables owned by init_mm; this is consistent with the behaviour of apply_to_pte_range(). The approach chosen in this series is therefore to pass the mm associated to the pgtables being constructed to pagetable_{pte,pmd}_ctor() (patch 1), and skip ptlock_init() if mm == &init_mm (patch 3 and 7). This makes it possible to call the PTE ctor/dtor from pte_{alloc,free}_kernel() without unintended consequences (patch 3). As a result the accounting functions are now called at all levels for kernel pgtables, and split locks are never initialised. In configurations where ptlocks are dynamically allocated (32-bit, PREEMPT_RT, etc.) and ARCH_ENABLE_SPLIT_PMD_PTLOCK is selected, this series results in the removal of a kmem_cache allocation for every kernel PMD. Additionally, for certain architectures that do not use such as s390, the same optimisation occurs at the PTE level. === Things get more complicated when it comes to special pgtable allocators (patch 8-12). All architectures need such allocators to create initial kernel pgtables; we are not concerned with those as the ctor cannot be called so early in the boot sequence. However, those allocators may also be used later in the boot sequence or during normal operations. There are two main use-cases: 1. Mapping EFI memory: efi_mm (arm, arm64, riscv) 2. arch_add_memory(): init_mm The ctor is already explicitly run (at the PTE/PMD level) in the first case, as required for pgtables that are not associated with init_mm. However the same allocators may also be used for the second use-case (or others), and this is where it gets messy. Patch 1 calls the ctor with NULL as mm in those situations, as the actual mm isn't available. Practically this means that ptlocks will be unconditionally initialised. This is fine on arm - create_mapping_late() is only used for the EFI mapping. On arm64, __create_pgd_mapping() is also used by arch_add_memory(); patch 8/9/11 ensure that ctors are called at all levels with the appropriate mm. The situation is similar on riscv, but propagating the mm down to the ctor would require significant refactoring. Since they are already called unconditionally, this series leaves riscv no worse off - patch 10 adds comments to clarify the situation. From a cursory look at other architectures implementing arch_add_memory(), s390 and x86 may also need a similar treatment to add constructor calls. This is to be taken care of in a future version or as a follow-up. === The complications in those special pgtable allocators beg the question: does it really make sense to treat efi_mm and init_mm differently in e.g. apply_to_pte_range()? Maybe what we really need is a way to tell if an mm corresponds to user memory or not, and never use split locks for non-user mm's. Feedback and suggestions welcome! This patch (of 12): In preparation for calling constructors for all kernel page tables while eliding unnecessary ptlock initialisation, let's pass down the associated mm to the PTE/PMD level ctors. (These are the two levels where ptlocks are used.) In most cases the mm is already around at the point of calling the ctor so we simply pass it down. This is however not the case for special page table allocators: * arch/arm/mm/mmu.c * arch/arm64/mm/mmu.c * arch/riscv/mm/init.c In those cases, the page tables being allocated are either for standard kernel memory (init_mm) or special page directories, which may not be associated to any mm. For now let's pass NULL as mm; this will be refined where possible in future patches. No functional change in this patch. Link: https://lore.kernel.org/linux-mm/20250103184415.2744423-1-kevin.brodsky@arm.com/ [1] Link: https://lore.kernel.org/linux-hardening/20250203101839.1223008-1-kevin.brodsky@arm.com/ [2] Link: https://lkml.kernel.org/r/20250408095222.860601-1-kevin.brodsky@arm.com Link: https://lkml.kernel.org/r/20250408095222.860601-2-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky Reviewed-by: Alexander Gordeev [s390] Cc: Albert Ou Cc: Andreas Larsson Cc: Catalin Marinas Cc: David S. Miller Cc: Geert Uytterhoeven Cc: Kevin Brodsky Cc: Linus Waleij Cc: Madhavan Srinivasan Cc: Mark Rutland Cc: Matthew Wilcox (Oracle) Cc: Michael Ellerman Cc: Mike Rapoport Cc: Palmer Dabbelt Cc: Paul Walmsley Cc: Peter Zijlstra Cc: Qi Zheng Cc: Ryan Roberts Cc: Will Deacon Cc: Yang Shi Cc: Cc: Dave Hansen Signed-off-by: Andrew Morton --- arch/arm/mm/mmu.c | 2 +- arch/arm64/mm/mmu.c | 4 ++-- arch/loongarch/include/asm/pgalloc.h | 2 +- arch/m68k/include/asm/mcf_pgalloc.h | 2 +- arch/m68k/include/asm/motorola_pgalloc.h | 10 +++++----- arch/m68k/mm/motorola.c | 6 +++--- arch/mips/include/asm/pgalloc.h | 2 +- arch/parisc/include/asm/pgalloc.h | 2 +- arch/powerpc/mm/book3s64/pgtable.c | 2 +- arch/powerpc/mm/pgtable-frag.c | 2 +- arch/riscv/mm/init.c | 4 ++-- arch/s390/include/asm/pgalloc.h | 2 +- arch/s390/mm/pgalloc.c | 2 +- arch/sparc/mm/init_64.c | 2 +- arch/sparc/mm/srmmu.c | 2 +- arch/x86/mm/pgtable.c | 2 +- include/asm-generic/pgalloc.h | 4 ++-- include/linux/mm.h | 6 ++++-- 18 files changed, 30 insertions(+), 28 deletions(-) diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c index f02f872ea8a9..edb7f56b7c91 100644 --- a/arch/arm/mm/mmu.c +++ b/arch/arm/mm/mmu.c @@ -735,7 +735,7 @@ static void *__init late_alloc(unsigned long sz) void *ptdesc = pagetable_alloc(GFP_PGTABLE_KERNEL & ~__GFP_HIGHMEM, get_order(sz)); - if (!ptdesc || !pagetable_pte_ctor(ptdesc)) + if (!ptdesc || !pagetable_pte_ctor(NULL, ptdesc)) BUG(); return ptdesc_to_virt(ptdesc); } diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index ea6695d53fb9..8c5c471cfb06 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -494,9 +494,9 @@ static phys_addr_t pgd_pgtable_alloc(int shift) * folded, and if so pagetable_pte_ctor() becomes nop. */ if (shift == PAGE_SHIFT) - BUG_ON(!pagetable_pte_ctor(ptdesc)); + BUG_ON(!pagetable_pte_ctor(NULL, ptdesc)); else if (shift == PMD_SHIFT) - BUG_ON(!pagetable_pmd_ctor(ptdesc)); + BUG_ON(!pagetable_pmd_ctor(NULL, ptdesc)); return pa; } diff --git a/arch/loongarch/include/asm/pgalloc.h b/arch/loongarch/include/asm/pgalloc.h index b58f587f0f0a..1c63a9d9a6d3 100644 --- a/arch/loongarch/include/asm/pgalloc.h +++ b/arch/loongarch/include/asm/pgalloc.h @@ -69,7 +69,7 @@ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address) if (!ptdesc) return NULL; - if (!pagetable_pmd_ctor(ptdesc)) { + if (!pagetable_pmd_ctor(mm, ptdesc)) { pagetable_free(ptdesc); return NULL; } diff --git a/arch/m68k/include/asm/mcf_pgalloc.h b/arch/m68k/include/asm/mcf_pgalloc.h index 4c648b51e7fd..465a71101b7d 100644 --- a/arch/m68k/include/asm/mcf_pgalloc.h +++ b/arch/m68k/include/asm/mcf_pgalloc.h @@ -48,7 +48,7 @@ static inline pgtable_t pte_alloc_one(struct mm_struct *mm) if (!ptdesc) return NULL; - if (!pagetable_pte_ctor(ptdesc)) { + if (!pagetable_pte_ctor(mm, ptdesc)) { pagetable_free(ptdesc); return NULL; } diff --git a/arch/m68k/include/asm/motorola_pgalloc.h b/arch/m68k/include/asm/motorola_pgalloc.h index 5abe7da8ac5a..1091fb0affbe 100644 --- a/arch/m68k/include/asm/motorola_pgalloc.h +++ b/arch/m68k/include/asm/motorola_pgalloc.h @@ -15,7 +15,7 @@ enum m68k_table_types { }; extern void init_pointer_table(void *table, int type); -extern void *get_pointer_table(int type); +extern void *get_pointer_table(struct mm_struct *mm, int type); extern int free_pointer_table(void *table, int type); /* @@ -26,7 +26,7 @@ extern int free_pointer_table(void *table, int type); static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm) { - return get_pointer_table(TABLE_PTE); + return get_pointer_table(mm, TABLE_PTE); } static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte) @@ -36,7 +36,7 @@ static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte) static inline pgtable_t pte_alloc_one(struct mm_struct *mm) { - return get_pointer_table(TABLE_PTE); + return get_pointer_table(mm, TABLE_PTE); } static inline void pte_free(struct mm_struct *mm, pgtable_t pgtable) @@ -53,7 +53,7 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pgtable, static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address) { - return get_pointer_table(TABLE_PMD); + return get_pointer_table(mm, TABLE_PMD); } static inline int pmd_free(struct mm_struct *mm, pmd_t *pmd) @@ -75,7 +75,7 @@ static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd) static inline pgd_t *pgd_alloc(struct mm_struct *mm) { - return get_pointer_table(TABLE_PGD); + return get_pointer_table(mm, TABLE_PGD); } diff --git a/arch/m68k/mm/motorola.c b/arch/m68k/mm/motorola.c index 73651e093c4d..6ab3ef39ba7a 100644 --- a/arch/m68k/mm/motorola.c +++ b/arch/m68k/mm/motorola.c @@ -139,7 +139,7 @@ void __init init_pointer_table(void *table, int type) return; } -void *get_pointer_table(int type) +void *get_pointer_table(struct mm_struct *mm, int type) { ptable_desc *dp = ptable_list[type].next; unsigned int mask = list_empty(&ptable_list[type]) ? 0 : PD_MARKBITS(dp); @@ -164,10 +164,10 @@ void *get_pointer_table(int type) * m68k doesn't have SPLIT_PTE_PTLOCKS for not having * SMP. */ - pagetable_pte_ctor(virt_to_ptdesc(page)); + pagetable_pte_ctor(mm, virt_to_ptdesc(page)); break; case TABLE_PMD: - pagetable_pmd_ctor(virt_to_ptdesc(page)); + pagetable_pmd_ctor(mm, virt_to_ptdesc(page)); break; case TABLE_PGD: pagetable_pgd_ctor(virt_to_ptdesc(page)); diff --git a/arch/mips/include/asm/pgalloc.h b/arch/mips/include/asm/pgalloc.h index bbca420c96d3..942af87f1cdd 100644 --- a/arch/mips/include/asm/pgalloc.h +++ b/arch/mips/include/asm/pgalloc.h @@ -62,7 +62,7 @@ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address) if (!ptdesc) return NULL; - if (!pagetable_pmd_ctor(ptdesc)) { + if (!pagetable_pmd_ctor(mm, ptdesc)) { pagetable_free(ptdesc); return NULL; } diff --git a/arch/parisc/include/asm/pgalloc.h b/arch/parisc/include/asm/pgalloc.h index 2ca74a56415c..3b84ee93edaa 100644 --- a/arch/parisc/include/asm/pgalloc.h +++ b/arch/parisc/include/asm/pgalloc.h @@ -39,7 +39,7 @@ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address) ptdesc = pagetable_alloc(gfp, PMD_TABLE_ORDER); if (!ptdesc) return NULL; - if (!pagetable_pmd_ctor(ptdesc)) { + if (!pagetable_pmd_ctor(mm, ptdesc)) { pagetable_free(ptdesc); return NULL; } diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c index 0e62d25062f8..0db01e10a3f8 100644 --- a/arch/powerpc/mm/book3s64/pgtable.c +++ b/arch/powerpc/mm/book3s64/pgtable.c @@ -417,7 +417,7 @@ static pmd_t *__alloc_for_pmdcache(struct mm_struct *mm) ptdesc = pagetable_alloc(gfp, 0); if (!ptdesc) return NULL; - if (!pagetable_pmd_ctor(ptdesc)) { + if (!pagetable_pmd_ctor(mm, ptdesc)) { pagetable_free(ptdesc); return NULL; } diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c index 713268ccb1a0..387e9b1fe12c 100644 --- a/arch/powerpc/mm/pgtable-frag.c +++ b/arch/powerpc/mm/pgtable-frag.c @@ -61,7 +61,7 @@ static pte_t *__alloc_for_ptecache(struct mm_struct *mm, int kernel) ptdesc = pagetable_alloc(PGALLOC_GFP | __GFP_ACCOUNT, 0); if (!ptdesc) return NULL; - if (!pagetable_pte_ctor(ptdesc)) { + if (!pagetable_pte_ctor(mm, ptdesc)) { pagetable_free(ptdesc); return NULL; } diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c index ab475ec6ca42..e5ef693fc778 100644 --- a/arch/riscv/mm/init.c +++ b/arch/riscv/mm/init.c @@ -442,7 +442,7 @@ static phys_addr_t __meminit alloc_pte_late(uintptr_t va) { struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 0); - BUG_ON(!ptdesc || !pagetable_pte_ctor(ptdesc)); + BUG_ON(!ptdesc || !pagetable_pte_ctor(NULL, ptdesc)); return __pa((pte_t *)ptdesc_address(ptdesc)); } @@ -522,7 +522,7 @@ static phys_addr_t __meminit alloc_pmd_late(uintptr_t va) { struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 0); - BUG_ON(!ptdesc || !pagetable_pmd_ctor(ptdesc)); + BUG_ON(!ptdesc || !pagetable_pmd_ctor(NULL, ptdesc)); return __pa((pmd_t *)ptdesc_address(ptdesc)); } diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h index 005497ffebda..5345398df653 100644 --- a/arch/s390/include/asm/pgalloc.h +++ b/arch/s390/include/asm/pgalloc.h @@ -97,7 +97,7 @@ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long vmaddr) if (!table) return NULL; crst_table_init(table, _SEGMENT_ENTRY_EMPTY); - if (!pagetable_pmd_ctor(virt_to_ptdesc(table))) { + if (!pagetable_pmd_ctor(mm, virt_to_ptdesc(table))) { crst_table_free(mm, table); return NULL; } diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c index e3a6f8ae156c..619d6917e3b7 100644 --- a/arch/s390/mm/pgalloc.c +++ b/arch/s390/mm/pgalloc.c @@ -145,7 +145,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm) ptdesc = pagetable_alloc(GFP_KERNEL, 0); if (!ptdesc) return NULL; - if (!pagetable_pte_ctor(ptdesc)) { + if (!pagetable_pte_ctor(mm, ptdesc)) { pagetable_free(ptdesc); return NULL; } diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c index 760818950464..5c8eabda1d17 100644 --- a/arch/sparc/mm/init_64.c +++ b/arch/sparc/mm/init_64.c @@ -2895,7 +2895,7 @@ pgtable_t pte_alloc_one(struct mm_struct *mm) if (!ptdesc) return NULL; - if (!pagetable_pte_ctor(ptdesc)) { + if (!pagetable_pte_ctor(mm, ptdesc)) { pagetable_free(ptdesc); return NULL; } diff --git a/arch/sparc/mm/srmmu.c b/arch/sparc/mm/srmmu.c index dd32711022f5..f8fb4911d360 100644 --- a/arch/sparc/mm/srmmu.c +++ b/arch/sparc/mm/srmmu.c @@ -350,7 +350,7 @@ pgtable_t pte_alloc_one(struct mm_struct *mm) page = pfn_to_page(__nocache_pa((unsigned long)ptep) >> PAGE_SHIFT); spin_lock(&mm->page_table_lock); if (page_ref_inc_return(page) == 2 && - !pagetable_pte_ctor(page_ptdesc(page))) { + !pagetable_pte_ctor(mm, page_ptdesc(page))) { page_ref_dec(page); ptep = NULL; } diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index f7ae44d3dd9e..9dbd25e52f10 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -205,7 +205,7 @@ static int preallocate_pmds(struct mm_struct *mm, pmd_t *pmds[], int count) if (!ptdesc) failed = true; - if (ptdesc && !pagetable_pmd_ctor(ptdesc)) { + if (ptdesc && !pagetable_pmd_ctor(mm, ptdesc)) { pagetable_free(ptdesc); ptdesc = NULL; failed = true; diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h index 892ece4558a2..e164ca66f0f6 100644 --- a/include/asm-generic/pgalloc.h +++ b/include/asm-generic/pgalloc.h @@ -70,7 +70,7 @@ static inline pgtable_t __pte_alloc_one_noprof(struct mm_struct *mm, gfp_t gfp) ptdesc = pagetable_alloc_noprof(gfp, 0); if (!ptdesc) return NULL; - if (!pagetable_pte_ctor(ptdesc)) { + if (!pagetable_pte_ctor(mm, ptdesc)) { pagetable_free(ptdesc); return NULL; } @@ -137,7 +137,7 @@ static inline pmd_t *pmd_alloc_one_noprof(struct mm_struct *mm, unsigned long ad ptdesc = pagetable_alloc_noprof(gfp, 0); if (!ptdesc) return NULL; - if (!pagetable_pmd_ctor(ptdesc)) { + if (!pagetable_pmd_ctor(mm, ptdesc)) { pagetable_free(ptdesc); return NULL; } diff --git a/include/linux/mm.h b/include/linux/mm.h index 1690f21e7808..fa22b17e337e 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3147,7 +3147,8 @@ static inline void pagetable_dtor_free(struct ptdesc *ptdesc) pagetable_free(ptdesc); } -static inline bool pagetable_pte_ctor(struct ptdesc *ptdesc) +static inline bool pagetable_pte_ctor(struct mm_struct *mm, + struct ptdesc *ptdesc) { if (!ptlock_init(ptdesc)) return false; @@ -3253,7 +3254,8 @@ static inline spinlock_t *pmd_lock(struct mm_struct *mm, pmd_t *pmd) return ptl; } -static inline bool pagetable_pmd_ctor(struct ptdesc *ptdesc) +static inline bool pagetable_pmd_ctor(struct mm_struct *mm, + struct ptdesc *ptdesc) { if (!pmd_ptlock_init(ptdesc)) return false; From 65ccffcee891930864456fa47447d96a630faf8a Mon Sep 17 00:00:00 2001 From: Kevin Brodsky Date: Tue, 8 Apr 2025 10:52:12 +0100 Subject: [PATCH 080/285] x86: pgtable: always use pte_free_kernel() Page table pages are normally freed using the appropriate helper for the given page table level. On x86, pud_free_pmd_page() and pmd_free_pte_page() are an exception to the rule: they call free_page() directly. Constructor/destructor calls are about to be introduced for kernel PTEs. To avoid missing dtor calls in those helpers, free the PTE pages using pte_free_kernel() instead of free_page(). While at it also use pmd_free() instead of calling pagetable_dtor() explicitly at the PMD level. Link: https://lkml.kernel.org/r/20250408095222.860601-3-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky Acked-by: Dave Hansen Cc: Albert Ou Cc: Andreas Larsson Cc: Catalin Marinas Cc: David S. Miller Cc: Geert Uytterhoeven Cc: Linus Waleij Cc: Madhavan Srinivasan Cc: Mark Rutland Cc: Matthew Wilcox (Oracle) Cc: Michael Ellerman Cc: Mike Rapoport Cc: Palmer Dabbelt Cc: Paul Walmsley Cc: Peter Zijlstra Cc: Qi Zheng Cc: Ryan Roberts Cc: Will Deacon Cc: Cc: Yang Shi Cc: Alexander Gordeev Signed-off-by: Andrew Morton --- arch/x86/mm/pgtable.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index 9dbd25e52f10..767fd634af4f 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -809,14 +809,13 @@ int pud_free_pmd_page(pud_t *pud, unsigned long addr) for (i = 0; i < PTRS_PER_PMD; i++) { if (!pmd_none(pmd_sv[i])) { pte = (pte_t *)pmd_page_vaddr(pmd_sv[i]); - free_page((unsigned long)pte); + pte_free_kernel(&init_mm, pte); } } free_page((unsigned long)pmd_sv); - pagetable_dtor(virt_to_ptdesc(pmd)); - free_page((unsigned long)pmd); + pmd_free(&init_mm, pmd); return 1; } @@ -839,7 +838,7 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr) /* INVLPG to clear all paging-structure caches */ flush_tlb_kernel_range(addr, addr + PAGE_SIZE-1); - free_page((unsigned long)pte); + pte_free_kernel(&init_mm, pte); return 1; } From 49f5996664201a3581ee5ea5949ee7d5fafb9d86 Mon Sep 17 00:00:00 2001 From: Kevin Brodsky Date: Tue, 8 Apr 2025 10:52:13 +0100 Subject: [PATCH 081/285] mm: call ctor/dtor for kernel PTEs Since [1], constructors/destructors are expected to be called for all page table pages, at all levels and for both user and kernel pgtables. There is however one glaring exception: kernel PTEs are managed via separate helpers (pte_alloc_kernel/pte_free_kernel), which do not call the [cd]tor, at least not in the generic implementation. The most obvious reason for this anomaly is that init_mm is special-cased not to use split page table locks. As a result calling ptlock_init() for PTEs associated with init_mm would be wasteful, potentially resulting in dynamic memory allocation. However, pgtable [cd]tors perform other actions - currently related to accounting/statistics, and potentially more functionally significant in the future. Now that pagetable_pte_ctor() is passed the associated mm, we can make it skip the call to ptlock_init() for init_mm; this allows us to call the ctor from pte_alloc_one_kernel() too. This is matched by a call to the pgtable destructor in pte_free_kernel(); no special-casing is needed on that path, as ptlock_free() is already called unconditionally. (ptlock_free() is a no-op unless a ptlock was allocated for the given PTP.) This patch ensures that all architectures that rely on call the [cd]tor for kernel PTEs. pte_free_kernel() cannot be overridden so changing the generic implementation is sufficient. pte_alloc_one_kernel() can be overridden using __HAVE_ARCH_PTE_ALLOC_ONE_KERNEL, and a few architectures implement it by calling the page allocator directly. We amend those so that they call the generic __pte_alloc_one_kernel() instead, if possible, ensuring that the ctor is called. A few architectures do not use ; those will be taken care of separately. [1] https://lore.kernel.org/linux-mm/20250103184415.2744423-1-kevin.brodsky@arm.com/ Link: https://lkml.kernel.org/r/20250408095222.860601-4-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky Reviewed-by: Alexander Gordeev # s390 Cc: Albert Ou Cc: Andreas Larsson Cc: Catalin Marinas Cc: David S. Miller Cc: Geert Uytterhoeven Cc: Linus Waleij Cc: Madhavan Srinivasan Cc: Mark Rutland Cc: Matthew Wilcox (Oracle) Cc: Michael Ellerman Cc: Mike Rapoport Cc: Palmer Dabbelt Cc: Paul Walmsley Cc: Peter Zijlstra Cc: Qi Zheng Cc: Ryan Roberts Cc: Will Deacon Cc: Cc: Yang Shi Cc: Dave Hansen Signed-off-by: Andrew Morton --- arch/csky/include/asm/pgalloc.h | 2 +- arch/microblaze/mm/pgtable.c | 2 +- arch/openrisc/mm/ioremap.c | 2 +- include/asm-generic/pgalloc.h | 7 ++++++- include/linux/mm.h | 2 +- 5 files changed, 10 insertions(+), 5 deletions(-) diff --git a/arch/csky/include/asm/pgalloc.h b/arch/csky/include/asm/pgalloc.h index 11055c574968..9ed2b15ffd94 100644 --- a/arch/csky/include/asm/pgalloc.h +++ b/arch/csky/include/asm/pgalloc.h @@ -29,7 +29,7 @@ static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm) pte_t *pte; unsigned long i; - pte = (pte_t *) __get_free_page(GFP_KERNEL); + pte = __pte_alloc_one_kernel(mm); if (!pte) return NULL; diff --git a/arch/microblaze/mm/pgtable.c b/arch/microblaze/mm/pgtable.c index 9f73265aad4e..e96dd1b7aba4 100644 --- a/arch/microblaze/mm/pgtable.c +++ b/arch/microblaze/mm/pgtable.c @@ -245,7 +245,7 @@ unsigned long iopa(unsigned long addr) __ref pte_t *pte_alloc_one_kernel(struct mm_struct *mm) { if (mem_init_done) - return (pte_t *)__get_free_page(GFP_KERNEL | __GFP_ZERO); + return __pte_alloc_one_kernel(mm); else return memblock_alloc_try_nid(PAGE_SIZE, PAGE_SIZE, MEMBLOCK_LOW_LIMIT, diff --git a/arch/openrisc/mm/ioremap.c b/arch/openrisc/mm/ioremap.c index 8e63e86251ca..3b352f97fecb 100644 --- a/arch/openrisc/mm/ioremap.c +++ b/arch/openrisc/mm/ioremap.c @@ -36,7 +36,7 @@ pte_t __ref *pte_alloc_one_kernel(struct mm_struct *mm) pte_t *pte; if (likely(mem_init_done)) { - pte = (pte_t *)get_zeroed_page(GFP_KERNEL); + pte = __pte_alloc_one_kernel(mm); } else { pte = memblock_alloc_or_panic(PAGE_SIZE, PAGE_SIZE); } diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h index e164ca66f0f6..3c8ec3bfea44 100644 --- a/include/asm-generic/pgalloc.h +++ b/include/asm-generic/pgalloc.h @@ -23,6 +23,11 @@ static inline pte_t *__pte_alloc_one_kernel_noprof(struct mm_struct *mm) if (!ptdesc) return NULL; + if (!pagetable_pte_ctor(mm, ptdesc)) { + pagetable_free(ptdesc); + return NULL; + } + return ptdesc_address(ptdesc); } #define __pte_alloc_one_kernel(...) alloc_hooks(__pte_alloc_one_kernel_noprof(__VA_ARGS__)) @@ -48,7 +53,7 @@ static inline pte_t *pte_alloc_one_kernel_noprof(struct mm_struct *mm) */ static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte) { - pagetable_free(virt_to_ptdesc(pte)); + pagetable_dtor_free(virt_to_ptdesc(pte)); } /** diff --git a/include/linux/mm.h b/include/linux/mm.h index fa22b17e337e..ce6832787f6d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3150,7 +3150,7 @@ static inline void pagetable_dtor_free(struct ptdesc *ptdesc) static inline bool pagetable_pte_ctor(struct mm_struct *mm, struct ptdesc *ptdesc) { - if (!ptlock_init(ptdesc)) + if (mm != &init_mm && !ptlock_init(ptdesc)) return false; __pagetable_ctor(ptdesc); return true; From 5a392e991dedebee51cb917b2f4fe600068fda49 Mon Sep 17 00:00:00 2001 From: Kevin Brodsky Date: Tue, 8 Apr 2025 10:52:14 +0100 Subject: [PATCH 082/285] m68k: mm: call ctor/dtor for kernel PTEs The generic implementation of pte_{alloc_one,free}_kernel now calls the [cd]tor. Align the m68k/ColdFire implementation of those functions by calling the [cd]tor explicitly. Link: https://lkml.kernel.org/r/20250408095222.860601-5-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky Cc: Albert Ou Cc: Andreas Larsson Cc: Catalin Marinas Cc: David S. Miller Cc: Geert Uytterhoeven Cc: Linus Waleij Cc: Madhavan Srinivasan Cc: Mark Rutland Cc: Matthew Wilcox (Oracle) Cc: Michael Ellerman Cc: Mike Rapoport Cc: Palmer Dabbelt Cc: Paul Walmsley Cc: Peter Zijlstra Cc: Qi Zheng Cc: Ryan Roberts Cc: Will Deacon Cc: Cc: Yang Shi Cc: Dave Hansen Cc: Alexander Gordeev Signed-off-by: Andrew Morton --- arch/m68k/include/asm/mcf_pgalloc.h | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/arch/m68k/include/asm/mcf_pgalloc.h b/arch/m68k/include/asm/mcf_pgalloc.h index 465a71101b7d..fc5454d37da3 100644 --- a/arch/m68k/include/asm/mcf_pgalloc.h +++ b/arch/m68k/include/asm/mcf_pgalloc.h @@ -7,7 +7,7 @@ static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte) { - pagetable_free(virt_to_ptdesc(pte)); + pagetable_dtor_free(virt_to_ptdesc(pte)); } extern const char bad_pmd_string[]; @@ -19,6 +19,10 @@ static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm) if (!ptdesc) return NULL; + if (!pagetable_pte_ctor(mm, ptdesc)) { + pagetable_free(ptdesc); + return NULL; + } return ptdesc_address(ptdesc); } From 8e8299bf386a0e41b1fd2106e81060d56d1b7b31 Mon Sep 17 00:00:00 2001 From: Kevin Brodsky Date: Tue, 8 Apr 2025 10:52:15 +0100 Subject: [PATCH 083/285] powerpc: mm: call ctor/dtor for kernel PTEs The generic implementation of pte_{alloc_one,free}_kernel now calls the [cd]tor, without initialising the ptlock needlessly as pagetable_pte_ctor() skips it for init_mm. On powerpc, all functions related to PTE allocation are implemented by common helpers, which are passed a boolean to differentiate user from kernel pgtables. This patch aligns the powerpc implementation with the generic one by calling pagetable_pte_[cd]tor() unconditionally in those helpers. Link: https://lkml.kernel.org/r/20250408095222.860601-6-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky Cc: Albert Ou Cc: Andreas Larsson Cc: Catalin Marinas Cc: David S. Miller Cc: Geert Uytterhoeven Cc: Linus Waleij Cc: Madhavan Srinivasan Cc: Mark Rutland Cc: Matthew Wilcox (Oracle) Cc: Michael Ellerman Cc: Mike Rapoport Cc: Palmer Dabbelt Cc: Paul Walmsley Cc: Peter Zijlstra Cc: Qi Zheng Cc: Ryan Roberts Cc: Will Deacon Cc: Cc: Yang Shi Cc: Dave Hansen Cc: Alexander Gordeev Signed-off-by: Andrew Morton --- arch/powerpc/mm/pgtable-frag.c | 30 +++++++++++++----------------- 1 file changed, 13 insertions(+), 17 deletions(-) diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c index 387e9b1fe12c..77e55eac16e4 100644 --- a/arch/powerpc/mm/pgtable-frag.c +++ b/arch/powerpc/mm/pgtable-frag.c @@ -56,19 +56,17 @@ static pte_t *__alloc_for_ptecache(struct mm_struct *mm, int kernel) { void *ret = NULL; struct ptdesc *ptdesc; + gfp_t gfp = PGALLOC_GFP; - if (!kernel) { - ptdesc = pagetable_alloc(PGALLOC_GFP | __GFP_ACCOUNT, 0); - if (!ptdesc) - return NULL; - if (!pagetable_pte_ctor(mm, ptdesc)) { - pagetable_free(ptdesc); - return NULL; - } - } else { - ptdesc = pagetable_alloc(PGALLOC_GFP, 0); - if (!ptdesc) - return NULL; + if (!kernel) + gfp |= __GFP_ACCOUNT; + + ptdesc = pagetable_alloc(gfp, 0); + if (!ptdesc) + return NULL; + if (!pagetable_pte_ctor(mm, ptdesc)) { + pagetable_free(ptdesc); + return NULL; } atomic_set(&ptdesc->pt_frag_refcount, 1); @@ -124,12 +122,10 @@ void pte_fragment_free(unsigned long *table, int kernel) BUG_ON(atomic_read(&ptdesc->pt_frag_refcount) <= 0); if (atomic_dec_and_test(&ptdesc->pt_frag_refcount)) { - if (kernel) - pagetable_free(ptdesc); - else if (folio_test_clear_active(ptdesc_folio(ptdesc))) - call_rcu(&ptdesc->pt_rcu_head, pte_free_now); - else + if (kernel || !folio_test_clear_active(ptdesc_folio(ptdesc))) pte_free_now(&ptdesc->pt_rcu_head); + else + call_rcu(&ptdesc->pt_rcu_head, pte_free_now); } } From 10a2e444e4ade64a1b12ccc698a77bd0ff5a57db Mon Sep 17 00:00:00 2001 From: Kevin Brodsky Date: Tue, 8 Apr 2025 10:52:16 +0100 Subject: [PATCH 084/285] sparc64: mm: call ctor/dtor for kernel PTEs The generic implementation of pte_{alloc_one,free}_kernel now calls the [cd]tor, without initialising the ptlock needlessly as pagetable_pte_ctor() skips it for init_mm. Align sparc64 with the generic implementation by ensuring pagetable_pte_[cd]tor() are called for kernel PTEs. As a result the kernel and user alloc/free functions have the same implementation, and since pgtable_t is defined as pte_t *, we can have both call a common helper. Link: https://lkml.kernel.org/r/20250408095222.860601-7-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky Cc: Albert Ou Cc: Andreas Larsson Cc: Catalin Marinas Cc: David S. Miller Cc: Geert Uytterhoeven Cc: Linus Waleij Cc: Madhavan Srinivasan Cc: Mark Rutland Cc: Matthew Wilcox (Oracle) Cc: Michael Ellerman Cc: Mike Rapoport Cc: Palmer Dabbelt Cc: Paul Walmsley Cc: Peter Zijlstra Cc: Qi Zheng Cc: Ryan Roberts Cc: Will Deacon Cc: Cc: Yang Shi Cc: Dave Hansen Cc: Alexander Gordeev Signed-off-by: Andrew Morton --- arch/sparc/mm/init_64.c | 27 +++++++++++++-------------- 1 file changed, 13 insertions(+), 14 deletions(-) diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c index 5c8eabda1d17..25ae4c897aae 100644 --- a/arch/sparc/mm/init_64.c +++ b/arch/sparc/mm/init_64.c @@ -2878,18 +2878,7 @@ void __flush_tlb_all(void) : : "r" (pstate)); } -pte_t *pte_alloc_one_kernel(struct mm_struct *mm) -{ - struct page *page = alloc_page(GFP_KERNEL | __GFP_ZERO); - pte_t *pte = NULL; - - if (page) - pte = (pte_t *) page_address(page); - - return pte; -} - -pgtable_t pte_alloc_one(struct mm_struct *mm) +static pte_t *__pte_alloc_one(struct mm_struct *mm) { struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL | __GFP_ZERO, 0); @@ -2902,9 +2891,14 @@ pgtable_t pte_alloc_one(struct mm_struct *mm) return ptdesc_address(ptdesc); } -void pte_free_kernel(struct mm_struct *mm, pte_t *pte) +pte_t *pte_alloc_one_kernel(struct mm_struct *mm) { - free_page((unsigned long)pte); + return __pte_alloc_one(mm); +} + +pgtable_t pte_alloc_one(struct mm_struct *mm) +{ + return __pte_alloc_one(mm); } static void __pte_free(pgtable_t pte) @@ -2915,6 +2909,11 @@ static void __pte_free(pgtable_t pte) pagetable_free(ptdesc); } +void pte_free_kernel(struct mm_struct *mm, pte_t *pte) +{ + __pte_free(pte); +} + void pte_free(struct mm_struct *mm, pgtable_t pte) { __pte_free(pte); From 8240d8d3c5fb65fbdace7ae4db909b51f2253da7 Mon Sep 17 00:00:00 2001 From: Kevin Brodsky Date: Tue, 8 Apr 2025 10:52:17 +0100 Subject: [PATCH 085/285] mm: skip ptlock_init() for kernel PMDs Split page table locks are not used for pgtables associated to init_mm, at any level. pte_alloc_kernel() does not call ptlock_init() as a result. There is however no separate alloc/free functions for kernel PMDs, and pmd_ptlock_init() is called unconditionally. When ALLOC_SPLIT_PTLOCKS is true (e.g. 32-bit architectures or if CONFIG_PREEMPT_RT is selected), this results in unnecessary dynamic memory allocation every time a kernel PMD is allocated. Now that pagetable_pmd_ctor() is passed the associated mm, we can easily remove this overhead by skipping pmd_ptlock_init() if the pgtable is associated to init_mm. No special-casing is needed on the dtor path, as ptlock_free() is already called unconditionally for all levels. (ptlock_free() is a no-op unless a ptlock was allocated for the given PTP.) Link: https://lkml.kernel.org/r/20250408095222.860601-8-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky Cc: Albert Ou Cc: Andreas Larsson Cc: Catalin Marinas Cc: David S. Miller Cc: Geert Uytterhoeven Cc: Linus Waleij Cc: Madhavan Srinivasan Cc: Mark Rutland Cc: Matthew Wilcox (Oracle) Cc: Michael Ellerman Cc: Mike Rapoport Cc: Palmer Dabbelt Cc: Paul Walmsley Cc: Peter Zijlstra Cc: Qi Zheng Cc: Ryan Roberts Cc: Will Deacon Cc: Cc: Yang Shi Cc: Dave Hansen Cc: Alexander Gordeev Signed-off-by: Andrew Morton --- include/linux/mm.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index ce6832787f6d..5eb0d77c4438 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3257,7 +3257,7 @@ static inline spinlock_t *pmd_lock(struct mm_struct *mm, pmd_t *pmd) static inline bool pagetable_pmd_ctor(struct mm_struct *mm, struct ptdesc *ptdesc) { - if (!pmd_ptlock_init(ptdesc)) + if (mm != &init_mm && !pmd_ptlock_init(ptdesc)) return false; ptdesc_pmd_pts_init(ptdesc); __pagetable_ctor(ptdesc); From c64f46ee1377961602832fc4e6768bb2f76642c2 Mon Sep 17 00:00:00 2001 From: Kevin Brodsky Date: Tue, 8 Apr 2025 10:52:18 +0100 Subject: [PATCH 086/285] arm64: mm: use enum to identify pgtable level instead of *_SHIFT Commit 90292aca9854 ("arm64: mm: use appropriate ctors for page tables") introduced pgtable ctor calls in pgd_pgtable_alloc(). To identify the pgtable level and call the appropriate ctor, the *_SHIFT value associated with the pgtable level is used. However, those values do not unambiguously identify a level, because if a given level is folded, the *_SHIFT value will be equal to that of the upper level (e.g. PMD_SHIFT == PUD_SHIFT if PMD is folded). As things stand, there is probably not much damaged done by calling the ctor for a different level, and ARCH_ENABLE_SPLIT_PMD_PTLOCK is only selected if PMD isn't folded (so we don't needlessly initialise pmd_ptlock). Still, this is pretty confusing, and it would get even more confusing when adding ctor calls for the remaining levels. Let's simplify all this by using an enum to identify the pgtable level instead; this way folding becomes irrelevant. This is inspired by one of the m68k pgtable allocators (arch/m68k/include/asm/motorola_pgalloc.h). Link: https://lkml.kernel.org/r/20250408095222.860601-9-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky Cc: Albert Ou Cc: Andreas Larsson Cc: Catalin Marinas Cc: David S. Miller Cc: Geert Uytterhoeven Cc: Linus Waleij Cc: Madhavan Srinivasan Cc: Mark Rutland Cc: Matthew Wilcox (Oracle) Cc: Michael Ellerman Cc: Mike Rapoport Cc: Palmer Dabbelt Cc: Paul Walmsley Cc: Peter Zijlstra Cc: Qi Zheng Cc: Ryan Roberts Cc: Will Deacon Cc: Cc: Yang Shi Cc: Dave Hansen Cc: Alexander Gordeev Signed-off-by: Andrew Morton --- arch/arm64/mm/mmu.c | 54 +++++++++++++++++++++++++++------------------ 1 file changed, 33 insertions(+), 21 deletions(-) diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 8c5c471cfb06..eca324b3a6fc 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -46,6 +46,13 @@ #define NO_CONT_MAPPINGS BIT(1) #define NO_EXEC_MAPPINGS BIT(2) /* assumes FEAT_HPDS is not used */ +enum pgtable_type { + TABLE_PTE, + TABLE_PMD, + TABLE_PUD, + TABLE_P4D, +}; + u64 kimage_voffset __ro_after_init; EXPORT_SYMBOL(kimage_voffset); @@ -107,7 +114,7 @@ pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn, } EXPORT_SYMBOL(phys_mem_access_prot); -static phys_addr_t __init early_pgtable_alloc(int shift) +static phys_addr_t __init early_pgtable_alloc(enum pgtable_type pgtable_type) { phys_addr_t phys; @@ -192,7 +199,7 @@ static void init_pte(pte_t *ptep, unsigned long addr, unsigned long end, static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr, unsigned long end, phys_addr_t phys, pgprot_t prot, - phys_addr_t (*pgtable_alloc)(int), + phys_addr_t (*pgtable_alloc)(enum pgtable_type), int flags) { unsigned long next; @@ -207,7 +214,7 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr, if (flags & NO_EXEC_MAPPINGS) pmdval |= PMD_TABLE_PXN; BUG_ON(!pgtable_alloc); - pte_phys = pgtable_alloc(PAGE_SHIFT); + pte_phys = pgtable_alloc(TABLE_PTE); ptep = pte_set_fixmap(pte_phys); init_clear_pgtable(ptep); ptep += pte_index(addr); @@ -243,7 +250,7 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr, static void init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end, phys_addr_t phys, pgprot_t prot, - phys_addr_t (*pgtable_alloc)(int), int flags) + phys_addr_t (*pgtable_alloc)(enum pgtable_type), int flags) { unsigned long next; @@ -277,7 +284,8 @@ static void init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end, static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr, unsigned long end, phys_addr_t phys, pgprot_t prot, - phys_addr_t (*pgtable_alloc)(int), int flags) + phys_addr_t (*pgtable_alloc)(enum pgtable_type), + int flags) { unsigned long next; pud_t pud = READ_ONCE(*pudp); @@ -294,7 +302,7 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr, if (flags & NO_EXEC_MAPPINGS) pudval |= PUD_TABLE_PXN; BUG_ON(!pgtable_alloc); - pmd_phys = pgtable_alloc(PMD_SHIFT); + pmd_phys = pgtable_alloc(TABLE_PMD); pmdp = pmd_set_fixmap(pmd_phys); init_clear_pgtable(pmdp); pmdp += pmd_index(addr); @@ -325,7 +333,7 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr, static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end, phys_addr_t phys, pgprot_t prot, - phys_addr_t (*pgtable_alloc)(int), + phys_addr_t (*pgtable_alloc)(enum pgtable_type), int flags) { unsigned long next; @@ -339,7 +347,7 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end, if (flags & NO_EXEC_MAPPINGS) p4dval |= P4D_TABLE_PXN; BUG_ON(!pgtable_alloc); - pud_phys = pgtable_alloc(PUD_SHIFT); + pud_phys = pgtable_alloc(TABLE_PUD); pudp = pud_set_fixmap(pud_phys); init_clear_pgtable(pudp); pudp += pud_index(addr); @@ -383,7 +391,7 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end, static void alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end, phys_addr_t phys, pgprot_t prot, - phys_addr_t (*pgtable_alloc)(int), + phys_addr_t (*pgtable_alloc)(enum pgtable_type), int flags) { unsigned long next; @@ -397,7 +405,7 @@ static void alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end, if (flags & NO_EXEC_MAPPINGS) pgdval |= PGD_TABLE_PXN; BUG_ON(!pgtable_alloc); - p4d_phys = pgtable_alloc(P4D_SHIFT); + p4d_phys = pgtable_alloc(TABLE_P4D); p4dp = p4d_set_fixmap(p4d_phys); init_clear_pgtable(p4dp); p4dp += p4d_index(addr); @@ -427,7 +435,7 @@ static void alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end, static void __create_pgd_mapping_locked(pgd_t *pgdir, phys_addr_t phys, unsigned long virt, phys_addr_t size, pgprot_t prot, - phys_addr_t (*pgtable_alloc)(int), + phys_addr_t (*pgtable_alloc)(enum pgtable_type), int flags) { unsigned long addr, end, next; @@ -455,7 +463,7 @@ static void __create_pgd_mapping_locked(pgd_t *pgdir, phys_addr_t phys, static void __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys, unsigned long virt, phys_addr_t size, pgprot_t prot, - phys_addr_t (*pgtable_alloc)(int), + phys_addr_t (*pgtable_alloc)(enum pgtable_type), int flags) { mutex_lock(&fixmap_lock); @@ -468,10 +476,11 @@ static void __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys, extern __alias(__create_pgd_mapping_locked) void create_kpti_ng_temp_pgd(pgd_t *pgdir, phys_addr_t phys, unsigned long virt, phys_addr_t size, pgprot_t prot, - phys_addr_t (*pgtable_alloc)(int), int flags); + phys_addr_t (*pgtable_alloc)(enum pgtable_type), + int flags); #endif -static phys_addr_t __pgd_pgtable_alloc(int shift) +static phys_addr_t __pgd_pgtable_alloc(enum pgtable_type pgtable_type) { /* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */ void *ptr = (void *)__get_free_page(GFP_PGTABLE_KERNEL & ~__GFP_ZERO); @@ -480,23 +489,26 @@ static phys_addr_t __pgd_pgtable_alloc(int shift) return __pa(ptr); } -static phys_addr_t pgd_pgtable_alloc(int shift) +static phys_addr_t pgd_pgtable_alloc(enum pgtable_type pgtable_type) { - phys_addr_t pa = __pgd_pgtable_alloc(shift); + phys_addr_t pa = __pgd_pgtable_alloc(pgtable_type); struct ptdesc *ptdesc = page_ptdesc(phys_to_page(pa)); /* * Call proper page table ctor in case later we need to * call core mm functions like apply_to_page_range() on * this pre-allocated page table. - * - * We don't select ARCH_ENABLE_SPLIT_PMD_PTLOCK if pmd is - * folded, and if so pagetable_pte_ctor() becomes nop. */ - if (shift == PAGE_SHIFT) + switch (pgtable_type) { + case TABLE_PTE: BUG_ON(!pagetable_pte_ctor(NULL, ptdesc)); - else if (shift == PMD_SHIFT) + break; + case TABLE_PMD: BUG_ON(!pagetable_pmd_ctor(NULL, ptdesc)); + break; + default: + break; + } return pa; } From 5e8eb9aeeda3a7aaf48efa1d34ae804e894e307f Mon Sep 17 00:00:00 2001 From: Kevin Brodsky Date: Tue, 8 Apr 2025 10:52:19 +0100 Subject: [PATCH 087/285] arm64: mm: always call PTE/PMD ctor in __create_pgd_mapping() TL;DR: always call the PTE/PMD ctor, passing the appropriate mm to skip ptlock_init() if unneeded. __create_pgd_mapping() is used for creating different kinds of mappings, and may allocate page table pages if passed an allocator callback. There are currently three such cases: 1. create_pgd_mapping(), which is used to create the EFI mapping 2. arch_add_memory() 3. map_entry_trampoline() 1. uses pgd_pgtable_alloc() as allocator callback, which calls the PTE/PMD ctor, while 2. and 3. use __pgd_pgtable_alloc(), which does not. The rationale is most likely that pgtables associated with init_mm do not make use of split page table locks, and it is therefore unnecessary to initialise them by calling the ctor. 2. operates on swapper_pg_dir so the allocated pgtables are clearly associated with init_mm, this is arguably the case for 3. too (the trampoline mapping is never modified so ptlocks are anyway irrelevant). 1. corresponds to efi_mm so ptlocks need to be initialised in that case. We are now moving towards calling the ctor for all page tables, even those associated with init_mm. pagetable_{pte,pmd}_ctor() have become aware of the associated mm so that the ptlock initialisation can be skipped for init_mm. This patch therefore amends the allocator callbacks so that the PTE/PMD ctor are always called, with an appropriate mm pointer to avoid unnecessary ptlock overhead. Modifying the prototype of the allocator callbacks to take the mm and propagating that pointer all the way down would be pretty invasive. Instead: * __pgd_pgtable_alloc() (cases 2. and 3. above) is replaced with pgd_pgtable_alloc_init_mm(), resulting in the ctors being called with &init_mm. This is the main functional change in this patch; the ptlock still isn't initialised, but other ctor actions (e.g. accounting-related) are now carried out for those allocated pgtables. * pgd_pgtable_alloc() (case 1. above) is replaced with pgd_pgtable_alloc_special_mm(), resulting in the ctors being called with NULL as mm. No functional change here; NULL essentially means "not init_mm", and the ptlock is still initialised. __pgd_pgtable_alloc() is now the common implementation of those two helpers. While at it we switch it to using pagetable_alloc() like standard pgtable allocator functions and remove the comment regarding ctor calls (ctors are now always expected to be called). Link: https://lkml.kernel.org/r/20250408095222.860601-10-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky Cc: Albert Ou Cc: Andreas Larsson Cc: Catalin Marinas Cc: David S. Miller Cc: Geert Uytterhoeven Cc: Linus Waleij Cc: Madhavan Srinivasan Cc: Mark Rutland Cc: Matthew Wilcox (Oracle) Cc: Michael Ellerman Cc: Mike Rapoport Cc: Palmer Dabbelt Cc: Paul Walmsley Cc: Peter Zijlstra Cc: Qi Zheng Cc: Ryan Roberts Cc: Will Deacon Cc: Cc: Yang Shi Cc: Dave Hansen Cc: Alexander Gordeev Signed-off-by: Andrew Morton --- arch/arm64/mm/mmu.c | 43 +++++++++++++++++++++++-------------------- 1 file changed, 23 insertions(+), 20 deletions(-) diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index eca324b3a6fc..51cfc891f6a1 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -480,31 +480,22 @@ void create_kpti_ng_temp_pgd(pgd_t *pgdir, phys_addr_t phys, unsigned long virt, int flags); #endif -static phys_addr_t __pgd_pgtable_alloc(enum pgtable_type pgtable_type) +static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, + enum pgtable_type pgtable_type) { /* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */ - void *ptr = (void *)__get_free_page(GFP_PGTABLE_KERNEL & ~__GFP_ZERO); + struct ptdesc *ptdesc = pagetable_alloc(GFP_PGTABLE_KERNEL & ~__GFP_ZERO, 0); + phys_addr_t pa; - BUG_ON(!ptr); - return __pa(ptr); -} + BUG_ON(!ptdesc); + pa = page_to_phys(ptdesc_page(ptdesc)); -static phys_addr_t pgd_pgtable_alloc(enum pgtable_type pgtable_type) -{ - phys_addr_t pa = __pgd_pgtable_alloc(pgtable_type); - struct ptdesc *ptdesc = page_ptdesc(phys_to_page(pa)); - - /* - * Call proper page table ctor in case later we need to - * call core mm functions like apply_to_page_range() on - * this pre-allocated page table. - */ switch (pgtable_type) { case TABLE_PTE: - BUG_ON(!pagetable_pte_ctor(NULL, ptdesc)); + BUG_ON(!pagetable_pte_ctor(mm, ptdesc)); break; case TABLE_PMD: - BUG_ON(!pagetable_pmd_ctor(NULL, ptdesc)); + BUG_ON(!pagetable_pmd_ctor(mm, ptdesc)); break; default: break; @@ -513,6 +504,18 @@ static phys_addr_t pgd_pgtable_alloc(enum pgtable_type pgtable_type) return pa; } +static phys_addr_t __maybe_unused +pgd_pgtable_alloc_init_mm(enum pgtable_type pgtable_type) +{ + return __pgd_pgtable_alloc(&init_mm, pgtable_type); +} + +static phys_addr_t +pgd_pgtable_alloc_special_mm(enum pgtable_type pgtable_type) +{ + return __pgd_pgtable_alloc(NULL, pgtable_type); +} + /* * This function can only be used to modify existing table entries, * without allocating new levels of table. Note that this permits the @@ -542,7 +545,7 @@ void __init create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys, flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; __create_pgd_mapping(mm->pgd, phys, virt, size, prot, - pgd_pgtable_alloc, flags); + pgd_pgtable_alloc_special_mm, flags); } static void update_mapping_prot(phys_addr_t phys, unsigned long virt, @@ -756,7 +759,7 @@ static int __init map_entry_trampoline(void) memset(tramp_pg_dir, 0, PGD_SIZE); __create_pgd_mapping(tramp_pg_dir, pa_start, TRAMP_VALIAS, entry_tramp_text_size(), prot, - __pgd_pgtable_alloc, NO_BLOCK_MAPPINGS); + pgd_pgtable_alloc_init_mm, NO_BLOCK_MAPPINGS); /* Map both the text and data into the kernel page table */ for (i = 0; i < DIV_ROUND_UP(entry_tramp_text_size(), PAGE_SIZE); i++) @@ -1362,7 +1365,7 @@ int arch_add_memory(int nid, u64 start, u64 size, flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; __create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start), - size, params->pgprot, __pgd_pgtable_alloc, + size, params->pgprot, pgd_pgtable_alloc_init_mm, flags); memblock_clear_nomap(start, size); From 0e3a16a760c6aececc7f8211f6a6807d1719c7bd Mon Sep 17 00:00:00 2001 From: Kevin Brodsky Date: Tue, 8 Apr 2025 10:52:20 +0100 Subject: [PATCH 088/285] riscv: mm: clarify ctor mm argument in alloc_{pte,pmd}_late pagetable_{pte,pmd}_ctor(mm, ptdesc) skip the ptlock initialisation if mm is &init_mm. To avoid unnecessary overhead, it is therefore preferable to pass the actual mm associated to the PTE/PMD. Unfortunately, this proves challenging for alloc_{pte,pmd}_late() as the associated mm is not available at the point where they are called - in fact not even top-level functions like create_pgd_mapping() are passed the mm. As a result they both call the ctor with NULL as mm; this is safe but potentially wasteful. This is not a new situation, but let's add a couple of comments to clarify it. Link: https://lkml.kernel.org/r/20250408095222.860601-11-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky Cc: Albert Ou Cc: Andreas Larsson Cc: Catalin Marinas Cc: David S. Miller Cc: Geert Uytterhoeven Cc: Linus Waleij Cc: Madhavan Srinivasan Cc: Mark Rutland Cc: Matthew Wilcox (Oracle) Cc: Michael Ellerman Cc: Mike Rapoport Cc: Palmer Dabbelt Cc: Paul Walmsley Cc: Peter Zijlstra Cc: Qi Zheng Cc: Ryan Roberts Cc: Will Deacon Cc: Cc: Yang Shi Cc: Dave Hansen Cc: Alexander Gordeev Signed-off-by: Andrew Morton --- arch/riscv/mm/init.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c index e5ef693fc778..59a982f88908 100644 --- a/arch/riscv/mm/init.c +++ b/arch/riscv/mm/init.c @@ -442,6 +442,11 @@ static phys_addr_t __meminit alloc_pte_late(uintptr_t va) { struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 0); + /* + * We do not know which mm the PTE page is associated to at this point. + * Passing NULL to the ctor is the safe option, though it may result + * in unnecessary work (e.g. initialising the ptlock for init_mm). + */ BUG_ON(!ptdesc || !pagetable_pte_ctor(NULL, ptdesc)); return __pa((pte_t *)ptdesc_address(ptdesc)); } @@ -522,6 +527,7 @@ static phys_addr_t __meminit alloc_pmd_late(uintptr_t va) { struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 0); + /* See comment in alloc_pte_late() regarding NULL passed the ctor */ BUG_ON(!ptdesc || !pagetable_pmd_ctor(NULL, ptdesc)); return __pa((pmd_t *)ptdesc_address(ptdesc)); } From cb5d2be83862d67eea4d08493a3ad5e4e4cb2f89 Mon Sep 17 00:00:00 2001 From: Kevin Brodsky Date: Tue, 8 Apr 2025 10:52:21 +0100 Subject: [PATCH 089/285] arm64: mm: call PUD/P4D ctor in __create_pgd_mapping() Constructors for PUD/P4D-level pgtables were recently introduced. They should be called for all pgtables; make sure they are called for special kernel mappings created by __create_pgd_mapping() too. Link: https://lkml.kernel.org/r/20250408095222.860601-12-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky Cc: Albert Ou Cc: Andreas Larsson Cc: Catalin Marinas Cc: David S. Miller Cc: Geert Uytterhoeven Cc: Linus Waleij Cc: Madhavan Srinivasan Cc: Mark Rutland Cc: Matthew Wilcox (Oracle) Cc: Michael Ellerman Cc: Mike Rapoport Cc: Palmer Dabbelt Cc: Paul Walmsley Cc: Peter Zijlstra Cc: Qi Zheng Cc: Ryan Roberts Cc: Will Deacon Cc: Cc: Yang Shi Cc: Dave Hansen Cc: Alexander Gordeev Signed-off-by: Andrew Morton --- arch/arm64/mm/mmu.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 51cfc891f6a1..8fcf59ba39db 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -497,7 +497,11 @@ static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, case TABLE_PMD: BUG_ON(!pagetable_pmd_ctor(mm, ptdesc)); break; - default: + case TABLE_PUD: + pagetable_pud_ctor(ptdesc); + break; + case TABLE_P4D: + pagetable_p4d_ctor(ptdesc); break; } From 8472cc4503eb8efac9a9a264814bab3b89ddbd87 Mon Sep 17 00:00:00 2001 From: Kevin Brodsky Date: Tue, 8 Apr 2025 10:52:22 +0100 Subject: [PATCH 090/285] riscv: mm: call PUD/P4D ctor in special kernel pgtable alloc Constructors for PUD/P4D-level pgtables were recently introduced. They should be called for all pgtables; make sure they are called for special kernel mappings created by create_pgd_mapping() too. While at it also switch to using pagetable_alloc() like in alloc_{pte,pmd}_late(). Link: https://lkml.kernel.org/r/20250408095222.860601-13-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky Cc: Albert Ou Cc: Andreas Larsson Cc: Catalin Marinas Cc: David S. Miller Cc: Geert Uytterhoeven Cc: Linus Waleij Cc: Madhavan Srinivasan Cc: Mark Rutland Cc: Matthew Wilcox (Oracle) Cc: Michael Ellerman Cc: Mike Rapoport Cc: Palmer Dabbelt Cc: Paul Walmsley Cc: Peter Zijlstra Cc: Qi Zheng Cc: Ryan Roberts Cc: Will Deacon Cc: Cc: Yang Shi Cc: Dave Hansen Cc: Alexander Gordeev Signed-off-by: Andrew Morton --- arch/riscv/mm/init.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c index 59a982f88908..8d0374d7ce8e 100644 --- a/arch/riscv/mm/init.c +++ b/arch/riscv/mm/init.c @@ -590,11 +590,11 @@ static phys_addr_t __init alloc_pud_fixmap(uintptr_t va) static phys_addr_t __meminit alloc_pud_late(uintptr_t va) { - unsigned long vaddr; + struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL, 0); - vaddr = __get_free_page(GFP_KERNEL); - BUG_ON(!vaddr); - return __pa(vaddr); + BUG_ON(!ptdesc); + pagetable_pud_ctor(ptdesc); + return __pa((pud_t *)ptdesc_address(ptdesc)); } static p4d_t *__init get_p4d_virt_early(phys_addr_t pa) @@ -628,11 +628,11 @@ static phys_addr_t __init alloc_p4d_fixmap(uintptr_t va) static phys_addr_t __meminit alloc_p4d_late(uintptr_t va) { - unsigned long vaddr; + struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL, 0); - vaddr = __get_free_page(GFP_KERNEL); - BUG_ON(!vaddr); - return __pa(vaddr); + BUG_ON(!ptdesc); + pagetable_p4d_ctor(ptdesc); + return __pa((p4d_t *)ptdesc_address(ptdesc)); } static void __meminit create_pud_mapping(pud_t *pudp, uintptr_t va, phys_addr_t pa, phys_addr_t sz, From 5bb9ed6cdfeb75883652fd0ed3e3885083a92b4b Mon Sep 17 00:00:00 2001 From: Alice Ryhl Date: Tue, 8 Apr 2025 09:22:38 +0000 Subject: [PATCH 091/285] mm: rust: add abstraction for struct mm_struct MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Patch series "Rust support for mm_struct, vm_area_struct, and mmap", v16. This updates the vm_area_struct support to use the approach we discussed at LPC where there are several different Rust wrappers for vm_area_struct depending on the kind of access you have to the vma. Each case allows a different set of operations on the vma. This includes an MM MAINTAINERS entry as proposed by Lorenzo: https://lore.kernel.org/all/33e64b12-aa07-4e78-933a-b07c37ff1d84@lucifer.local/ This patch (of 9): These abstractions allow you to reference a `struct mm_struct` using both mmgrab and mmget refcounts. This is done using two Rust types: * Mm - represents an mm_struct where you don't know anything about the value of mm_users. * MmWithUser - represents an mm_struct where you know at compile time that mm_users is non-zero. This allows us to encode in the type system whether a method requires that mm_users is non-zero or not. For instance, you can always call `mmget_not_zero` but you can only call `mmap_read_lock` when mm_users is non-zero. The struct is called Mm to keep consistency with the C side. The ability to obtain `current->mm` is added later in this series. The mm module is defined to only exist when CONFIG_MMU is set. This avoids various errors due to missing types and functions when CONFIG_MMU is disabled. More fine-grained cfgs can be considered in the future. See the thread at [1] for more info. Link: https://lkml.kernel.org/r/20250408-vma-v16-9-d8b446e885d9@google.com Link: https://lkml.kernel.org/r/20250408-vma-v16-1-d8b446e885d9@google.com Link: https://lore.kernel.org/all/202503091916.QousmtcY-lkp@intel.com/ Signed-off-by: Alice Ryhl Acked-by: Lorenzo Stoakes Acked-by: Liam R. Howlett Acked-by: Balbir Singh Reviewed-by: Andreas Hindborg Reviewed-by: Gary Guo Cc: Alex Gaynor Cc: Arnd Bergmann Cc: Benno Lossin Cc: Björn Roy Baron Cc: Boqun Feng Cc: Greg Kroah-Hartman Cc: Jann Horn Cc: John Hubbard Cc: Matthew Wilcox (Oracle) Cc: Miguel Ojeda Cc: Suren Baghdasaryan Cc: Trevor Gross Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- rust/helpers/helpers.c | 1 + rust/helpers/mm.c | 39 ++++++++ rust/kernel/lib.rs | 1 + rust/kernel/mm.rs | 210 +++++++++++++++++++++++++++++++++++++++++ 4 files changed, 251 insertions(+) create mode 100644 rust/helpers/mm.c create mode 100644 rust/kernel/mm.rs diff --git a/rust/helpers/helpers.c b/rust/helpers/helpers.c index 1e7c84df7252..0aea103c16be 100644 --- a/rust/helpers/helpers.c +++ b/rust/helpers/helpers.c @@ -20,6 +20,7 @@ #include "io.c" #include "jump_label.c" #include "kunit.c" +#include "mm.c" #include "mutex.c" #include "page.c" #include "platform.c" diff --git a/rust/helpers/mm.c b/rust/helpers/mm.c new file mode 100644 index 000000000000..7201747a5d31 --- /dev/null +++ b/rust/helpers/mm.c @@ -0,0 +1,39 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include +#include + +void rust_helper_mmgrab(struct mm_struct *mm) +{ + mmgrab(mm); +} + +void rust_helper_mmdrop(struct mm_struct *mm) +{ + mmdrop(mm); +} + +void rust_helper_mmget(struct mm_struct *mm) +{ + mmget(mm); +} + +bool rust_helper_mmget_not_zero(struct mm_struct *mm) +{ + return mmget_not_zero(mm); +} + +void rust_helper_mmap_read_lock(struct mm_struct *mm) +{ + mmap_read_lock(mm); +} + +bool rust_helper_mmap_read_trylock(struct mm_struct *mm) +{ + return mmap_read_trylock(mm); +} + +void rust_helper_mmap_read_unlock(struct mm_struct *mm) +{ + mmap_read_unlock(mm); +} diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs index de07aadd1ff5..42ab6cf4053f 100644 --- a/rust/kernel/lib.rs +++ b/rust/kernel/lib.rs @@ -61,6 +61,7 @@ pub mod jump_label; pub mod kunit; pub mod list; pub mod miscdevice; +pub mod mm; #[cfg(CONFIG_NET)] pub mod net; pub mod of; diff --git a/rust/kernel/mm.rs b/rust/kernel/mm.rs new file mode 100644 index 000000000000..eda7a479cff7 --- /dev/null +++ b/rust/kernel/mm.rs @@ -0,0 +1,210 @@ +// SPDX-License-Identifier: GPL-2.0 + +// Copyright (C) 2024 Google LLC. + +//! Memory management. +//! +//! This module deals with managing the address space of userspace processes. Each process has an +//! instance of [`Mm`], which keeps track of multiple VMAs (virtual memory areas). Each VMA +//! corresponds to a region of memory that the userspace process can access, and the VMA lets you +//! control what happens when userspace reads or writes to that region of memory. +//! +//! C header: [`include/linux/mm.h`](srctree/include/linux/mm.h) +#![cfg(CONFIG_MMU)] + +use crate::{ + bindings, + types::{ARef, AlwaysRefCounted, NotThreadSafe, Opaque}, +}; +use core::{ops::Deref, ptr::NonNull}; + +/// A wrapper for the kernel's `struct mm_struct`. +/// +/// This represents the address space of a userspace process, so each process has one `Mm` +/// instance. It may hold many VMAs internally. +/// +/// There is a counter called `mm_users` that counts the users of the address space; this includes +/// the userspace process itself, but can also include kernel threads accessing the address space. +/// Once `mm_users` reaches zero, this indicates that the address space can be destroyed. To access +/// the address space, you must prevent `mm_users` from reaching zero while you are accessing it. +/// The [`MmWithUser`] type represents an address space where this is guaranteed, and you can +/// create one using [`mmget_not_zero`]. +/// +/// The `ARef` smart pointer holds an `mmgrab` refcount. Its destructor may sleep. +/// +/// # Invariants +/// +/// Values of this type are always refcounted using `mmgrab`. +/// +/// [`mmget_not_zero`]: Mm::mmget_not_zero +#[repr(transparent)] +pub struct Mm { + mm: Opaque, +} + +// SAFETY: It is safe to call `mmdrop` on another thread than where `mmgrab` was called. +unsafe impl Send for Mm {} +// SAFETY: All methods on `Mm` can be called in parallel from several threads. +unsafe impl Sync for Mm {} + +// SAFETY: By the type invariants, this type is always refcounted. +unsafe impl AlwaysRefCounted for Mm { + #[inline] + fn inc_ref(&self) { + // SAFETY: The pointer is valid since self is a reference. + unsafe { bindings::mmgrab(self.as_raw()) }; + } + + #[inline] + unsafe fn dec_ref(obj: NonNull) { + // SAFETY: The caller is giving up their refcount. + unsafe { bindings::mmdrop(obj.cast().as_ptr()) }; + } +} + +/// A wrapper for the kernel's `struct mm_struct`. +/// +/// This type is like [`Mm`], but with non-zero `mm_users`. It can only be used when `mm_users` can +/// be proven to be non-zero at compile-time, usually because the relevant code holds an `mmget` +/// refcount. It can be used to access the associated address space. +/// +/// The `ARef` smart pointer holds an `mmget` refcount. Its destructor may sleep. +/// +/// # Invariants +/// +/// Values of this type are always refcounted using `mmget`. The value of `mm_users` is non-zero. +#[repr(transparent)] +pub struct MmWithUser { + mm: Mm, +} + +// SAFETY: It is safe to call `mmput` on another thread than where `mmget` was called. +unsafe impl Send for MmWithUser {} +// SAFETY: All methods on `MmWithUser` can be called in parallel from several threads. +unsafe impl Sync for MmWithUser {} + +// SAFETY: By the type invariants, this type is always refcounted. +unsafe impl AlwaysRefCounted for MmWithUser { + #[inline] + fn inc_ref(&self) { + // SAFETY: The pointer is valid since self is a reference. + unsafe { bindings::mmget(self.as_raw()) }; + } + + #[inline] + unsafe fn dec_ref(obj: NonNull) { + // SAFETY: The caller is giving up their refcount. + unsafe { bindings::mmput(obj.cast().as_ptr()) }; + } +} + +// Make all `Mm` methods available on `MmWithUser`. +impl Deref for MmWithUser { + type Target = Mm; + + #[inline] + fn deref(&self) -> &Mm { + &self.mm + } +} + +// These methods are safe to call even if `mm_users` is zero. +impl Mm { + /// Returns a raw pointer to the inner `mm_struct`. + #[inline] + pub fn as_raw(&self) -> *mut bindings::mm_struct { + self.mm.get() + } + + /// Obtain a reference from a raw pointer. + /// + /// # Safety + /// + /// The caller must ensure that `ptr` points at an `mm_struct`, and that it is not deallocated + /// during the lifetime 'a. + #[inline] + pub unsafe fn from_raw<'a>(ptr: *const bindings::mm_struct) -> &'a Mm { + // SAFETY: Caller promises that the pointer is valid for 'a. Layouts are compatible due to + // repr(transparent). + unsafe { &*ptr.cast() } + } + + /// Calls `mmget_not_zero` and returns a handle if it succeeds. + #[inline] + pub fn mmget_not_zero(&self) -> Option> { + // SAFETY: The pointer is valid since self is a reference. + let success = unsafe { bindings::mmget_not_zero(self.as_raw()) }; + + if success { + // SAFETY: We just created an `mmget` refcount. + Some(unsafe { ARef::from_raw(NonNull::new_unchecked(self.as_raw().cast())) }) + } else { + None + } + } +} + +// These methods require `mm_users` to be non-zero. +impl MmWithUser { + /// Obtain a reference from a raw pointer. + /// + /// # Safety + /// + /// The caller must ensure that `ptr` points at an `mm_struct`, and that `mm_users` remains + /// non-zero for the duration of the lifetime 'a. + #[inline] + pub unsafe fn from_raw<'a>(ptr: *const bindings::mm_struct) -> &'a MmWithUser { + // SAFETY: Caller promises that the pointer is valid for 'a. The layout is compatible due + // to repr(transparent). + unsafe { &*ptr.cast() } + } + + /// Lock the mmap read lock. + #[inline] + pub fn mmap_read_lock(&self) -> MmapReadGuard<'_> { + // SAFETY: The pointer is valid since self is a reference. + unsafe { bindings::mmap_read_lock(self.as_raw()) }; + + // INVARIANT: We just acquired the read lock. + MmapReadGuard { + mm: self, + _nts: NotThreadSafe, + } + } + + /// Try to lock the mmap read lock. + #[inline] + pub fn mmap_read_trylock(&self) -> Option> { + // SAFETY: The pointer is valid since self is a reference. + let success = unsafe { bindings::mmap_read_trylock(self.as_raw()) }; + + if success { + // INVARIANT: We just acquired the read lock. + Some(MmapReadGuard { + mm: self, + _nts: NotThreadSafe, + }) + } else { + None + } + } +} + +/// A guard for the mmap read lock. +/// +/// # Invariants +/// +/// This `MmapReadGuard` guard owns the mmap read lock. +pub struct MmapReadGuard<'a> { + mm: &'a MmWithUser, + // `mmap_read_lock` and `mmap_read_unlock` must be called on the same thread + _nts: NotThreadSafe, +} + +impl Drop for MmapReadGuard<'_> { + #[inline] + fn drop(&mut self) { + // SAFETY: We hold the read lock by the type invariants. + unsafe { bindings::mmap_read_unlock(self.mm.as_raw()) }; + } +} From 040f404b731207935ed644b14bcc2bb8b8488d00 Mon Sep 17 00:00:00 2001 From: Alice Ryhl Date: Tue, 8 Apr 2025 09:22:39 +0000 Subject: [PATCH 092/285] mm: rust: add vm_area_struct methods that require read access MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This adds a type called VmaRef which is used when referencing a vma that you have read access to. Here, read access means that you hold either the mmap read lock or the vma read lock (or stronger). Additionally, a vma_lookup method is added to the mmap read guard, which enables you to obtain a &VmaRef in safe Rust code. This patch only provides a way to lock the mmap read lock, but a follow-up patch also provides a way to just lock the vma read lock. Link: https://lkml.kernel.org/r/20250408-vma-v16-2-d8b446e885d9@google.com Signed-off-by: Alice Ryhl Acked-by: Lorenzo Stoakes Acked-by: Liam R. Howlett Reviewed-by: Jann Horn Reviewed-by: Andreas Hindborg Reviewed-by: Gary Guo Cc: Alex Gaynor Cc: Arnd Bergmann Cc: Balbir Singh Cc: Benno Lossin Cc: Björn Roy Baron Cc: Boqun Feng Cc: Greg Kroah-Hartman Cc: John Hubbard Cc: Matthew Wilcox (Oracle) Cc: Miguel Ojeda Cc: Suren Baghdasaryan Cc: Trevor Gross Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- rust/helpers/mm.c | 6 ++ rust/kernel/mm.rs | 23 +++++ rust/kernel/mm/virt.rs | 210 +++++++++++++++++++++++++++++++++++++++++ 3 files changed, 239 insertions(+) create mode 100644 rust/kernel/mm/virt.rs diff --git a/rust/helpers/mm.c b/rust/helpers/mm.c index 7201747a5d31..7b72eb065a3e 100644 --- a/rust/helpers/mm.c +++ b/rust/helpers/mm.c @@ -37,3 +37,9 @@ void rust_helper_mmap_read_unlock(struct mm_struct *mm) { mmap_read_unlock(mm); } + +struct vm_area_struct *rust_helper_vma_lookup(struct mm_struct *mm, + unsigned long addr) +{ + return vma_lookup(mm, addr); +} diff --git a/rust/kernel/mm.rs b/rust/kernel/mm.rs index eda7a479cff7..f1689ccb3740 100644 --- a/rust/kernel/mm.rs +++ b/rust/kernel/mm.rs @@ -18,6 +18,8 @@ use crate::{ }; use core::{ops::Deref, ptr::NonNull}; +pub mod virt; + /// A wrapper for the kernel's `struct mm_struct`. /// /// This represents the address space of a userspace process, so each process has one `Mm` @@ -201,6 +203,27 @@ pub struct MmapReadGuard<'a> { _nts: NotThreadSafe, } +impl<'a> MmapReadGuard<'a> { + /// Look up a vma at the given address. + #[inline] + pub fn vma_lookup(&self, vma_addr: usize) -> Option<&virt::VmaRef> { + // SAFETY: By the type invariants we hold the mmap read guard, so we can safely call this + // method. Any value is okay for `vma_addr`. + let vma = unsafe { bindings::vma_lookup(self.mm.as_raw(), vma_addr) }; + + if vma.is_null() { + None + } else { + // SAFETY: We just checked that a vma was found, so the pointer references a valid vma. + // + // Furthermore, the returned vma is still under the protection of the read lock guard + // and can be used while the mmap read lock is still held. That the vma is not used + // after the MmapReadGuard gets dropped is enforced by the borrow-checker. + unsafe { Some(virt::VmaRef::from_raw(vma)) } + } + } +} + impl Drop for MmapReadGuard<'_> { #[inline] fn drop(&mut self) { diff --git a/rust/kernel/mm/virt.rs b/rust/kernel/mm/virt.rs new file mode 100644 index 000000000000..a66be649f0b8 --- /dev/null +++ b/rust/kernel/mm/virt.rs @@ -0,0 +1,210 @@ +// SPDX-License-Identifier: GPL-2.0 + +// Copyright (C) 2024 Google LLC. + +//! Virtual memory. +//! +//! This module deals with managing a single VMA in the address space of a userspace process. Each +//! VMA corresponds to a region of memory that the userspace process can access, and the VMA lets +//! you control what happens when userspace reads or writes to that region of memory. +//! +//! The module has several different Rust types that all correspond to the C type called +//! `vm_area_struct`. The different structs represent what kind of access you have to the VMA, e.g. +//! [`VmaRef`] is used when you hold the mmap or vma read lock. Using the appropriate struct +//! ensures that you can't, for example, accidentally call a function that requires holding the +//! write lock when you only hold the read lock. + +use crate::{bindings, mm::MmWithUser, types::Opaque}; + +/// A wrapper for the kernel's `struct vm_area_struct` with read access. +/// +/// It represents an area of virtual memory. +/// +/// # Invariants +/// +/// The caller must hold the mmap read lock or the vma read lock. +#[repr(transparent)] +pub struct VmaRef { + vma: Opaque, +} + +// Methods you can call when holding the mmap or vma read lock (or stronger). They must be usable +// no matter what the vma flags are. +impl VmaRef { + /// Access a virtual memory area given a raw pointer. + /// + /// # Safety + /// + /// Callers must ensure that `vma` is valid for the duration of 'a, and that the mmap or vma + /// read lock (or stronger) is held for at least the duration of 'a. + #[inline] + pub unsafe fn from_raw<'a>(vma: *const bindings::vm_area_struct) -> &'a Self { + // SAFETY: The caller ensures that the invariants are satisfied for the duration of 'a. + unsafe { &*vma.cast() } + } + + /// Returns a raw pointer to this area. + #[inline] + pub fn as_ptr(&self) -> *mut bindings::vm_area_struct { + self.vma.get() + } + + /// Access the underlying `mm_struct`. + #[inline] + pub fn mm(&self) -> &MmWithUser { + // SAFETY: By the type invariants, this `vm_area_struct` is valid and we hold the mmap/vma + // read lock or stronger. This implies that the underlying mm has a non-zero value of + // `mm_users`. + unsafe { MmWithUser::from_raw((*self.as_ptr()).vm_mm) } + } + + /// Returns the flags associated with the virtual memory area. + /// + /// The possible flags are a combination of the constants in [`flags`]. + #[inline] + pub fn flags(&self) -> vm_flags_t { + // SAFETY: By the type invariants, the caller holds at least the mmap read lock, so this + // access is not a data race. + unsafe { (*self.as_ptr()).__bindgen_anon_2.vm_flags } + } + + /// Returns the (inclusive) start address of the virtual memory area. + #[inline] + pub fn start(&self) -> usize { + // SAFETY: By the type invariants, the caller holds at least the mmap read lock, so this + // access is not a data race. + unsafe { (*self.as_ptr()).__bindgen_anon_1.__bindgen_anon_1.vm_start } + } + + /// Returns the (exclusive) end address of the virtual memory area. + #[inline] + pub fn end(&self) -> usize { + // SAFETY: By the type invariants, the caller holds at least the mmap read lock, so this + // access is not a data race. + unsafe { (*self.as_ptr()).__bindgen_anon_1.__bindgen_anon_1.vm_end } + } + + /// Zap pages in the given page range. + /// + /// This clears page table mappings for the range at the leaf level, leaving all other page + /// tables intact, and freeing any memory referenced by the VMA in this range. That is, + /// anonymous memory is completely freed, file-backed memory has its reference count on page + /// cache folio's dropped, any dirty data will still be written back to disk as usual. + /// + /// It may seem odd that we clear at the leaf level, this is however a product of the page + /// table structure used to map physical memory into a virtual address space - each virtual + /// address actually consists of a bitmap of array indices into page tables, which form a + /// hierarchical page table level structure. + /// + /// As a result, each page table level maps a multiple of page table levels below, and thus + /// span ever increasing ranges of pages. At the leaf or PTE level, we map the actual physical + /// memory. + /// + /// It is here where a zap operates, as it the only place we can be certain of clearing without + /// impacting any other virtual mappings. It is an implementation detail as to whether the + /// kernel goes further in freeing unused page tables, but for the purposes of this operation + /// we must only assume that the leaf level is cleared. + #[inline] + pub fn zap_page_range_single(&self, address: usize, size: usize) { + let (end, did_overflow) = address.overflowing_add(size); + if did_overflow || address < self.start() || self.end() < end { + // TODO: call WARN_ONCE once Rust version of it is added + return; + } + + // SAFETY: By the type invariants, the caller has read access to this VMA, which is + // sufficient for this method call. This method has no requirements on the vma flags. The + // address range is checked to be within the vma. + unsafe { + bindings::zap_page_range_single(self.as_ptr(), address, size, core::ptr::null_mut()) + }; + } +} + +/// The integer type used for vma flags. +#[doc(inline)] +pub use bindings::vm_flags_t; + +/// All possible flags for [`VmaRef`]. +pub mod flags { + use super::vm_flags_t; + use crate::bindings; + + /// No flags are set. + pub const NONE: vm_flags_t = bindings::VM_NONE as _; + + /// Mapping allows reads. + pub const READ: vm_flags_t = bindings::VM_READ as _; + + /// Mapping allows writes. + pub const WRITE: vm_flags_t = bindings::VM_WRITE as _; + + /// Mapping allows execution. + pub const EXEC: vm_flags_t = bindings::VM_EXEC as _; + + /// Mapping is shared. + pub const SHARED: vm_flags_t = bindings::VM_SHARED as _; + + /// Mapping may be updated to allow reads. + pub const MAYREAD: vm_flags_t = bindings::VM_MAYREAD as _; + + /// Mapping may be updated to allow writes. + pub const MAYWRITE: vm_flags_t = bindings::VM_MAYWRITE as _; + + /// Mapping may be updated to allow execution. + pub const MAYEXEC: vm_flags_t = bindings::VM_MAYEXEC as _; + + /// Mapping may be updated to be shared. + pub const MAYSHARE: vm_flags_t = bindings::VM_MAYSHARE as _; + + /// Page-ranges managed without `struct page`, just pure PFN. + pub const PFNMAP: vm_flags_t = bindings::VM_PFNMAP as _; + + /// Memory mapped I/O or similar. + pub const IO: vm_flags_t = bindings::VM_IO as _; + + /// Do not copy this vma on fork. + pub const DONTCOPY: vm_flags_t = bindings::VM_DONTCOPY as _; + + /// Cannot expand with mremap(). + pub const DONTEXPAND: vm_flags_t = bindings::VM_DONTEXPAND as _; + + /// Lock the pages covered when they are faulted in. + pub const LOCKONFAULT: vm_flags_t = bindings::VM_LOCKONFAULT as _; + + /// Is a VM accounted object. + pub const ACCOUNT: vm_flags_t = bindings::VM_ACCOUNT as _; + + /// Should the VM suppress accounting. + pub const NORESERVE: vm_flags_t = bindings::VM_NORESERVE as _; + + /// Huge TLB Page VM. + pub const HUGETLB: vm_flags_t = bindings::VM_HUGETLB as _; + + /// Synchronous page faults. (DAX-specific) + pub const SYNC: vm_flags_t = bindings::VM_SYNC as _; + + /// Architecture-specific flag. + pub const ARCH_1: vm_flags_t = bindings::VM_ARCH_1 as _; + + /// Wipe VMA contents in child on fork. + pub const WIPEONFORK: vm_flags_t = bindings::VM_WIPEONFORK as _; + + /// Do not include in the core dump. + pub const DONTDUMP: vm_flags_t = bindings::VM_DONTDUMP as _; + + /// Not soft dirty clean area. + pub const SOFTDIRTY: vm_flags_t = bindings::VM_SOFTDIRTY as _; + + /// Can contain `struct page` and pure PFN pages. + pub const MIXEDMAP: vm_flags_t = bindings::VM_MIXEDMAP as _; + + /// MADV_HUGEPAGE marked this vma. + pub const HUGEPAGE: vm_flags_t = bindings::VM_HUGEPAGE as _; + + /// MADV_NOHUGEPAGE marked this vma. + pub const NOHUGEPAGE: vm_flags_t = bindings::VM_NOHUGEPAGE as _; + + /// KSM may merge identical pages. + pub const MERGEABLE: vm_flags_t = bindings::VM_MERGEABLE as _; +} From bf3d331bb80749542cf299f94c471f611fb113b1 Mon Sep 17 00:00:00 2001 From: Alice Ryhl Date: Tue, 8 Apr 2025 09:22:40 +0000 Subject: [PATCH 093/285] mm: rust: add vm_insert_page MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The vm_insert_page method is only usable on vmas with the VM_MIXEDMAP flag, so we introduce a new type to keep track of such vmas. The approach used in this patch assumes that we will not need to encode many flag combinations in the type. I don't think we need to encode more than VM_MIXEDMAP and VM_PFNMAP as things are now. However, if that becomes necessary, using generic parameters in a single type would scale better as the number of flags increases. Link: https://lkml.kernel.org/r/20250408-vma-v16-3-d8b446e885d9@google.com Signed-off-by: Alice Ryhl Acked-by: Lorenzo Stoakes Acked-by: Liam R. Howlett Reviewed-by: Andreas Hindborg Reviewed-by: Gary Guo Cc: Alex Gaynor Cc: Arnd Bergmann Cc: Balbir Singh Cc: Benno Lossin Cc: Björn Roy Baron Cc: Boqun Feng Cc: Greg Kroah-Hartman Cc: Jann Horn Cc: John Hubbard Cc: Matthew Wilcox (Oracle) Cc: Miguel Ojeda Cc: Suren Baghdasaryan Cc: Trevor Gross Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- rust/kernel/mm/virt.rs | 79 +++++++++++++++++++++++++++++++++++++++++- 1 file changed, 78 insertions(+), 1 deletion(-) diff --git a/rust/kernel/mm/virt.rs b/rust/kernel/mm/virt.rs index a66be649f0b8..3e2eabcc2145 100644 --- a/rust/kernel/mm/virt.rs +++ b/rust/kernel/mm/virt.rs @@ -14,7 +14,15 @@ //! ensures that you can't, for example, accidentally call a function that requires holding the //! write lock when you only hold the read lock. -use crate::{bindings, mm::MmWithUser, types::Opaque}; +use crate::{ + bindings, + error::{to_result, Result}, + mm::MmWithUser, + page::Page, + types::Opaque, +}; + +use core::ops::Deref; /// A wrapper for the kernel's `struct vm_area_struct` with read access. /// @@ -119,6 +127,75 @@ impl VmaRef { bindings::zap_page_range_single(self.as_ptr(), address, size, core::ptr::null_mut()) }; } + + /// If the [`VM_MIXEDMAP`] flag is set, returns a [`VmaMixedMap`] to this VMA, otherwise + /// returns `None`. + /// + /// This can be used to access methods that require [`VM_MIXEDMAP`] to be set. + /// + /// [`VM_MIXEDMAP`]: flags::MIXEDMAP + #[inline] + pub fn as_mixedmap_vma(&self) -> Option<&VmaMixedMap> { + if self.flags() & flags::MIXEDMAP != 0 { + // SAFETY: We just checked that `VM_MIXEDMAP` is set. All other requirements are + // satisfied by the type invariants of `VmaRef`. + Some(unsafe { VmaMixedMap::from_raw(self.as_ptr()) }) + } else { + None + } + } +} + +/// A wrapper for the kernel's `struct vm_area_struct` with read access and [`VM_MIXEDMAP`] set. +/// +/// It represents an area of virtual memory. +/// +/// This struct is identical to [`VmaRef`] except that it must only be used when the +/// [`VM_MIXEDMAP`] flag is set on the vma. +/// +/// # Invariants +/// +/// The caller must hold the mmap read lock or the vma read lock. The `VM_MIXEDMAP` flag must be +/// set. +/// +/// [`VM_MIXEDMAP`]: flags::MIXEDMAP +#[repr(transparent)] +pub struct VmaMixedMap { + vma: VmaRef, +} + +// Make all `VmaRef` methods available on `VmaMixedMap`. +impl Deref for VmaMixedMap { + type Target = VmaRef; + + #[inline] + fn deref(&self) -> &VmaRef { + &self.vma + } +} + +impl VmaMixedMap { + /// Access a virtual memory area given a raw pointer. + /// + /// # Safety + /// + /// Callers must ensure that `vma` is valid for the duration of 'a, and that the mmap read lock + /// (or stronger) is held for at least the duration of 'a. The `VM_MIXEDMAP` flag must be set. + #[inline] + pub unsafe fn from_raw<'a>(vma: *const bindings::vm_area_struct) -> &'a Self { + // SAFETY: The caller ensures that the invariants are satisfied for the duration of 'a. + unsafe { &*vma.cast() } + } + + /// Maps a single page at the given address within the virtual memory area. + /// + /// This operation does not take ownership of the page. + #[inline] + pub fn vm_insert_page(&self, address: usize, page: &Page) -> Result { + // SAFETY: By the type invariant of `Self` caller has read access and has verified that + // `VM_MIXEDMAP` is set. By invariant on `Page` the page has order 0. + to_result(unsafe { bindings::vm_insert_page(self.as_ptr(), address, page.as_ptr()) }) + } } /// The integer type used for vma flags. From 3105f8f391ce00153c553bfe89efbb30b120d858 Mon Sep 17 00:00:00 2001 From: Alice Ryhl Date: Tue, 8 Apr 2025 09:22:41 +0000 Subject: [PATCH 094/285] mm: rust: add lock_vma_under_rcu MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Currently, the binder driver always uses the mmap lock to make changes to its vma. Because the mmap lock is global to the process, this can involve significant contention. However, the kernel has a feature called per-vma locks, which can significantly reduce contention. For example, you can take a vma lock in parallel with an mmap write lock. This is important because contention on the mmap lock has been a long-term recurring challenge for the Binder driver. This patch introduces support for using `lock_vma_under_rcu` from Rust. The Rust Binder driver will be able to use this to reduce contention on the mmap lock. Link: https://lkml.kernel.org/r/20250408-vma-v16-4-d8b446e885d9@google.com Signed-off-by: Alice Ryhl Acked-by: Lorenzo Stoakes Acked-by: Liam R. Howlett Reviewed-by: Jann Horn Reviewed-by: Andreas Hindborg Reviewed-by: Gary Guo Cc: Alex Gaynor Cc: Arnd Bergmann Cc: Balbir Singh Cc: Benno Lossin Cc: Björn Roy Baron Cc: Boqun Feng Cc: Greg Kroah-Hartman Cc: John Hubbard Cc: Matthew Wilcox (Oracle) Cc: Miguel Ojeda Cc: Suren Baghdasaryan Cc: Trevor Gross Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- rust/helpers/mm.c | 5 ++++ rust/kernel/mm.rs | 60 +++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 65 insertions(+) diff --git a/rust/helpers/mm.c b/rust/helpers/mm.c index 7b72eb065a3e..81b510c96fd2 100644 --- a/rust/helpers/mm.c +++ b/rust/helpers/mm.c @@ -43,3 +43,8 @@ struct vm_area_struct *rust_helper_vma_lookup(struct mm_struct *mm, { return vma_lookup(mm, addr); } + +void rust_helper_vma_end_read(struct vm_area_struct *vma) +{ + vma_end_read(vma); +} diff --git a/rust/kernel/mm.rs b/rust/kernel/mm.rs index f1689ccb3740..c160fb52603f 100644 --- a/rust/kernel/mm.rs +++ b/rust/kernel/mm.rs @@ -19,6 +19,7 @@ use crate::{ use core::{ops::Deref, ptr::NonNull}; pub mod virt; +use virt::VmaRef; /// A wrapper for the kernel's `struct mm_struct`. /// @@ -161,6 +162,36 @@ impl MmWithUser { unsafe { &*ptr.cast() } } + /// Attempt to access a vma using the vma read lock. + /// + /// This is an optimistic trylock operation, so it may fail if there is contention. In that + /// case, you should fall back to taking the mmap read lock. + /// + /// When per-vma locks are disabled, this always returns `None`. + #[inline] + pub fn lock_vma_under_rcu(&self, vma_addr: usize) -> Option> { + #[cfg(CONFIG_PER_VMA_LOCK)] + { + // SAFETY: Calling `bindings::lock_vma_under_rcu` is always okay given an mm where + // `mm_users` is non-zero. + let vma = unsafe { bindings::lock_vma_under_rcu(self.as_raw(), vma_addr) }; + if !vma.is_null() { + return Some(VmaReadGuard { + // SAFETY: If `lock_vma_under_rcu` returns a non-null ptr, then it points at a + // valid vma. The vma is stable for as long as the vma read lock is held. + vma: unsafe { VmaRef::from_raw(vma) }, + _nts: NotThreadSafe, + }); + } + } + + // Silence warnings about unused variables. + #[cfg(not(CONFIG_PER_VMA_LOCK))] + let _ = vma_addr; + + None + } + /// Lock the mmap read lock. #[inline] pub fn mmap_read_lock(&self) -> MmapReadGuard<'_> { @@ -231,3 +262,32 @@ impl Drop for MmapReadGuard<'_> { unsafe { bindings::mmap_read_unlock(self.mm.as_raw()) }; } } + +/// A guard for the vma read lock. +/// +/// # Invariants +/// +/// This `VmaReadGuard` guard owns the vma read lock. +pub struct VmaReadGuard<'a> { + vma: &'a VmaRef, + // `vma_end_read` must be called on the same thread as where the lock was taken + _nts: NotThreadSafe, +} + +// Make all `VmaRef` methods available on `VmaReadGuard`. +impl Deref for VmaReadGuard<'_> { + type Target = VmaRef; + + #[inline] + fn deref(&self) -> &VmaRef { + self.vma + } +} + +impl Drop for VmaReadGuard<'_> { + #[inline] + fn drop(&mut self) { + // SAFETY: We hold the read lock by the type invariants. + unsafe { bindings::vma_end_read(self.vma.as_ptr()) }; + } +} From 114ba9b9e8191925438365089c0d300f4f1b6fb1 Mon Sep 17 00:00:00 2001 From: Alice Ryhl Date: Tue, 8 Apr 2025 09:22:42 +0000 Subject: [PATCH 095/285] mm: rust: add mmput_async support MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds an MmWithUserAsync type that uses mmput_async when dropped but is otherwise identical to MmWithUser. This has to be done using a separate type because the thing we are changing is the destructor. Rust Binder needs this to avoid a certain deadlock. See commit 9a9ab0d96362 ("binder: fix race between mmput() and do_exit()") for details. It's also needed in the shrinker to avoid cleaning up the mm in the shrinker's context. Link: https://lkml.kernel.org/r/20250408-vma-v16-5-d8b446e885d9@google.com Signed-off-by: Alice Ryhl Acked-by: Lorenzo Stoakes Acked-by: Liam R. Howlett Reviewed-by: Andreas Hindborg Reviewed-by: Gary Guo Cc: Alex Gaynor Cc: Arnd Bergmann Cc: Balbir Singh Cc: Benno Lossin Cc: Björn Roy Baron Cc: Boqun Feng Cc: Greg Kroah-Hartman Cc: Jann Horn Cc: John Hubbard Cc: Matthew Wilcox (Oracle) Cc: Miguel Ojeda Cc: Suren Baghdasaryan Cc: Trevor Gross Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- rust/kernel/mm.rs | 51 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 51 insertions(+) diff --git a/rust/kernel/mm.rs b/rust/kernel/mm.rs index c160fb52603f..615907a0f3b4 100644 --- a/rust/kernel/mm.rs +++ b/rust/kernel/mm.rs @@ -111,6 +111,50 @@ impl Deref for MmWithUser { } } +/// A wrapper for the kernel's `struct mm_struct`. +/// +/// This type is identical to `MmWithUser` except that it uses `mmput_async` when dropping a +/// refcount. This means that the destructor of `ARef` is safe to call in atomic +/// context. +/// +/// # Invariants +/// +/// Values of this type are always refcounted using `mmget`. The value of `mm_users` is non-zero. +#[repr(transparent)] +pub struct MmWithUserAsync { + mm: MmWithUser, +} + +// SAFETY: It is safe to call `mmput_async` on another thread than where `mmget` was called. +unsafe impl Send for MmWithUserAsync {} +// SAFETY: All methods on `MmWithUserAsync` can be called in parallel from several threads. +unsafe impl Sync for MmWithUserAsync {} + +// SAFETY: By the type invariants, this type is always refcounted. +unsafe impl AlwaysRefCounted for MmWithUserAsync { + #[inline] + fn inc_ref(&self) { + // SAFETY: The pointer is valid since self is a reference. + unsafe { bindings::mmget(self.as_raw()) }; + } + + #[inline] + unsafe fn dec_ref(obj: NonNull) { + // SAFETY: The caller is giving up their refcount. + unsafe { bindings::mmput_async(obj.cast().as_ptr()) }; + } +} + +// Make all `MmWithUser` methods available on `MmWithUserAsync`. +impl Deref for MmWithUserAsync { + type Target = MmWithUser; + + #[inline] + fn deref(&self) -> &MmWithUser { + &self.mm + } +} + // These methods are safe to call even if `mm_users` is zero. impl Mm { /// Returns a raw pointer to the inner `mm_struct`. @@ -162,6 +206,13 @@ impl MmWithUser { unsafe { &*ptr.cast() } } + /// Use `mmput_async` when dropping this refcount. + #[inline] + pub fn into_mmput_async(me: ARef) -> ARef { + // SAFETY: The layouts and invariants are compatible. + unsafe { ARef::from_raw(ARef::into_raw(me).cast()) } + } + /// Attempt to access a vma using the vma read lock. /// /// This is an optimistic trylock operation, so it may fail if there is contention. In that From dcb81aeab406e417bc0b4cf68de6eb07a1d2e6ce Mon Sep 17 00:00:00 2001 From: Alice Ryhl Date: Tue, 8 Apr 2025 09:22:43 +0000 Subject: [PATCH 096/285] mm: rust: add VmaNew for f_ops->mmap() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This type will be used when setting up a new vma in an f_ops->mmap() hook. Using a separate type from VmaRef allows us to have a separate set of operations that you are only able to use during the mmap() hook. For example, the VM_MIXEDMAP flag must not be changed after the initial setup that happens during the f_ops->mmap() hook. To avoid setting invalid flag values, the methods for clearing VM_MAYWRITE and similar involve a check of VM_WRITE, and return an error if VM_WRITE is set. Trying to use `try_clear_maywrite` without checking the return value results in a compilation error because the `Result` type is marked #[must_use]. For now, there's only a method for VM_MIXEDMAP and not VM_PFNMAP. When we add a VM_PFNMAP method, we will need some way to prevent you from setting both VM_MIXEDMAP and VM_PFNMAP on the same vma. Link: https://lkml.kernel.org/r/20250408-vma-v16-6-d8b446e885d9@google.com Signed-off-by: Alice Ryhl Acked-by: Lorenzo Stoakes Acked-by: Liam R. Howlett Reviewed-by: Jann Horn Reviewed-by: Andreas Hindborg Cc: Alex Gaynor Cc: Arnd Bergmann Cc: Balbir Singh Cc: Benno Lossin Cc: Björn Roy Baron Cc: Boqun Feng Cc: Gary Guo Cc: Greg Kroah-Hartman Cc: John Hubbard Cc: Matthew Wilcox (Oracle) Cc: Miguel Ojeda Cc: Suren Baghdasaryan Cc: Trevor Gross Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- rust/kernel/mm/virt.rs | 186 ++++++++++++++++++++++++++++++++++++++++- 1 file changed, 185 insertions(+), 1 deletion(-) diff --git a/rust/kernel/mm/virt.rs b/rust/kernel/mm/virt.rs index 3e2eabcc2145..31803674aecc 100644 --- a/rust/kernel/mm/virt.rs +++ b/rust/kernel/mm/virt.rs @@ -16,7 +16,7 @@ use crate::{ bindings, - error::{to_result, Result}, + error::{code::EINVAL, to_result, Result}, mm::MmWithUser, page::Page, types::Opaque, @@ -198,6 +198,190 @@ impl VmaMixedMap { } } +/// A configuration object for setting up a VMA in an `f_ops->mmap()` hook. +/// +/// The `f_ops->mmap()` hook is called when a new VMA is being created, and the hook is able to +/// configure the VMA in various ways to fit the driver that owns it. Using `VmaNew` indicates that +/// you are allowed to perform operations on the VMA that can only be performed before the VMA is +/// fully initialized. +/// +/// # Invariants +/// +/// For the duration of 'a, the referenced vma must be undergoing initialization in an +/// `f_ops->mmap()` hook. +pub struct VmaNew { + vma: VmaRef, +} + +// Make all `VmaRef` methods available on `VmaNew`. +impl Deref for VmaNew { + type Target = VmaRef; + + #[inline] + fn deref(&self) -> &VmaRef { + &self.vma + } +} + +impl VmaNew { + /// Access a virtual memory area given a raw pointer. + /// + /// # Safety + /// + /// Callers must ensure that `vma` is undergoing initial vma setup for the duration of 'a. + #[inline] + pub unsafe fn from_raw<'a>(vma: *mut bindings::vm_area_struct) -> &'a Self { + // SAFETY: The caller ensures that the invariants are satisfied for the duration of 'a. + unsafe { &*vma.cast() } + } + + /// Internal method for updating the vma flags. + /// + /// # Safety + /// + /// This must not be used to set the flags to an invalid value. + #[inline] + unsafe fn update_flags(&self, set: vm_flags_t, unset: vm_flags_t) { + let mut flags = self.flags(); + flags |= set; + flags &= !unset; + + // SAFETY: This is not a data race: the vma is undergoing initial setup, so it's not yet + // shared. Additionally, `VmaNew` is `!Sync`, so it cannot be used to write in parallel. + // The caller promises that this does not set the flags to an invalid value. + unsafe { (*self.as_ptr()).__bindgen_anon_2.__vm_flags = flags }; + } + + /// Set the `VM_MIXEDMAP` flag on this vma. + /// + /// This enables the vma to contain both `struct page` and pure PFN pages. Returns a reference + /// that can be used to call `vm_insert_page` on the vma. + #[inline] + pub fn set_mixedmap(&self) -> &VmaMixedMap { + // SAFETY: We don't yet provide a way to set VM_PFNMAP, so this cannot put the flags in an + // invalid state. + unsafe { self.update_flags(flags::MIXEDMAP, 0) }; + + // SAFETY: We just set `VM_MIXEDMAP` on the vma. + unsafe { VmaMixedMap::from_raw(self.vma.as_ptr()) } + } + + /// Set the `VM_IO` flag on this vma. + /// + /// This is used for memory mapped IO and similar. The flag tells other parts of the kernel to + /// avoid looking at the pages. For memory mapped IO this is useful as accesses to the pages + /// could have side effects. + #[inline] + pub fn set_io(&self) { + // SAFETY: Setting the VM_IO flag is always okay. + unsafe { self.update_flags(flags::IO, 0) }; + } + + /// Set the `VM_DONTEXPAND` flag on this vma. + /// + /// This prevents the vma from being expanded with `mremap()`. + #[inline] + pub fn set_dontexpand(&self) { + // SAFETY: Setting the VM_DONTEXPAND flag is always okay. + unsafe { self.update_flags(flags::DONTEXPAND, 0) }; + } + + /// Set the `VM_DONTCOPY` flag on this vma. + /// + /// This prevents the vma from being copied on fork. This option is only permanent if `VM_IO` + /// is set. + #[inline] + pub fn set_dontcopy(&self) { + // SAFETY: Setting the VM_DONTCOPY flag is always okay. + unsafe { self.update_flags(flags::DONTCOPY, 0) }; + } + + /// Set the `VM_DONTDUMP` flag on this vma. + /// + /// This prevents the vma from being included in core dumps. This option is only permanent if + /// `VM_IO` is set. + #[inline] + pub fn set_dontdump(&self) { + // SAFETY: Setting the VM_DONTDUMP flag is always okay. + unsafe { self.update_flags(flags::DONTDUMP, 0) }; + } + + /// Returns whether `VM_READ` is set. + /// + /// This flag indicates whether userspace is mapping this vma as readable. + #[inline] + pub fn readable(&self) -> bool { + (self.flags() & flags::READ) != 0 + } + + /// Try to clear the `VM_MAYREAD` flag, failing if `VM_READ` is set. + /// + /// This flag indicates whether userspace is allowed to make this vma readable with + /// `mprotect()`. + /// + /// Note that this operation is irreversible. Once `VM_MAYREAD` has been cleared, it can never + /// be set again. + #[inline] + pub fn try_clear_mayread(&self) -> Result { + if self.readable() { + return Err(EINVAL); + } + // SAFETY: Clearing `VM_MAYREAD` is okay when `VM_READ` is not set. + unsafe { self.update_flags(0, flags::MAYREAD) }; + Ok(()) + } + + /// Returns whether `VM_WRITE` is set. + /// + /// This flag indicates whether userspace is mapping this vma as writable. + #[inline] + pub fn writable(&self) -> bool { + (self.flags() & flags::WRITE) != 0 + } + + /// Try to clear the `VM_MAYWRITE` flag, failing if `VM_WRITE` is set. + /// + /// This flag indicates whether userspace is allowed to make this vma writable with + /// `mprotect()`. + /// + /// Note that this operation is irreversible. Once `VM_MAYWRITE` has been cleared, it can never + /// be set again. + #[inline] + pub fn try_clear_maywrite(&self) -> Result { + if self.writable() { + return Err(EINVAL); + } + // SAFETY: Clearing `VM_MAYWRITE` is okay when `VM_WRITE` is not set. + unsafe { self.update_flags(0, flags::MAYWRITE) }; + Ok(()) + } + + /// Returns whether `VM_EXEC` is set. + /// + /// This flag indicates whether userspace is mapping this vma as executable. + #[inline] + pub fn executable(&self) -> bool { + (self.flags() & flags::EXEC) != 0 + } + + /// Try to clear the `VM_MAYEXEC` flag, failing if `VM_EXEC` is set. + /// + /// This flag indicates whether userspace is allowed to make this vma executable with + /// `mprotect()`. + /// + /// Note that this operation is irreversible. Once `VM_MAYEXEC` has been cleared, it can never + /// be set again. + #[inline] + pub fn try_clear_mayexec(&self) -> Result { + if self.executable() { + return Err(EINVAL); + } + // SAFETY: Clearing `VM_MAYEXEC` is okay when `VM_EXEC` is not set. + unsafe { self.update_flags(0, flags::MAYEXEC) }; + Ok(()) + } +} + /// The integer type used for vma flags. #[doc(inline)] pub use bindings::vm_flags_t; From f8c78198816f04619da6eebfc5b67741bff56325 Mon Sep 17 00:00:00 2001 From: Alice Ryhl Date: Tue, 8 Apr 2025 09:22:44 +0000 Subject: [PATCH 097/285] rust: miscdevice: add mmap support MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add the ability to write a file_operations->mmap hook in Rust when using the miscdevice abstraction. The `vma` argument to the `mmap` hook uses the `VmaNew` type from the previous commit; this type provides the correct set of operations for a file_operations->mmap hook. Link: https://lkml.kernel.org/r/20250408-vma-v16-7-d8b446e885d9@google.com Signed-off-by: Alice Ryhl Acked-by: Greg Kroah-Hartman Acked-by: Lorenzo Stoakes Acked-by: Liam R. Howlett Reviewed-by: Andreas Hindborg Reviewed-by: Gary Guo Cc: Alex Gaynor Cc: Arnd Bergmann Cc: Balbir Singh Cc: Benno Lossin Cc: Björn Roy Baron Cc: Boqun Feng Cc: Jann Horn Cc: John Hubbard Cc: Matthew Wilcox (Oracle) Cc: Miguel Ojeda Cc: Suren Baghdasaryan Cc: Trevor Gross Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- rust/kernel/miscdevice.rs | 45 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 45 insertions(+) diff --git a/rust/kernel/miscdevice.rs b/rust/kernel/miscdevice.rs index fa9ecc42602a..9d9771247c38 100644 --- a/rust/kernel/miscdevice.rs +++ b/rust/kernel/miscdevice.rs @@ -14,6 +14,7 @@ use crate::{ error::{to_result, Error, Result, VTABLE_DEFAULT_ERROR}, ffi::{c_int, c_long, c_uint, c_ulong}, fs::File, + mm::virt::VmaNew, prelude::*, seq_file::SeqFile, str::CStr, @@ -119,6 +120,22 @@ pub trait MiscDevice: Sized { drop(device); } + /// Handle for mmap. + /// + /// This function is invoked when a user space process invokes the `mmap` system call on + /// `file`. The function is a callback that is part of the VMA initializer. The kernel will do + /// initial setup of the VMA before calling this function. The function can then interact with + /// the VMA initialization by calling methods of `vma`. If the function does not return an + /// error, the kernel will complete initialization of the VMA according to the properties of + /// `vma`. + fn mmap( + _device: ::Borrowed<'_>, + _file: &File, + _vma: &VmaNew, + ) -> Result { + build_error!(VTABLE_DEFAULT_ERROR) + } + /// Handler for ioctls. /// /// The `cmd` argument is usually manipulated using the utilties in [`kernel::ioctl`]. @@ -223,6 +240,33 @@ impl MiscdeviceVTable { 0 } + /// # Safety + /// + /// `file` must be a valid file that is associated with a `MiscDeviceRegistration`. + /// `vma` must be a vma that is currently being mmap'ed with this file. + unsafe extern "C" fn mmap( + file: *mut bindings::file, + vma: *mut bindings::vm_area_struct, + ) -> c_int { + // SAFETY: The mmap call of a file can access the private data. + let private = unsafe { (*file).private_data }; + // SAFETY: This is a Rust Miscdevice, so we call `into_foreign` in `open` and + // `from_foreign` in `release`, and `fops_mmap` is guaranteed to be called between those + // two operations. + let device = unsafe { ::borrow(private) }; + // SAFETY: The caller provides a vma that is undergoing initial VMA setup. + let area = unsafe { VmaNew::from_raw(vma) }; + // SAFETY: + // * The file is valid for the duration of this call. + // * There is no active fdget_pos region on the file on this thread. + let file = unsafe { File::from_raw_file(file) }; + + match T::mmap(device, file, area) { + Ok(()) => 0, + Err(err) => err.to_errno(), + } + } + /// # Safety /// /// `file` must be a valid file that is associated with a `MiscDeviceRegistration`. @@ -291,6 +335,7 @@ impl MiscdeviceVTable { const VTABLE: bindings::file_operations = bindings::file_operations { open: Some(Self::open), release: Some(Self::release), + mmap: if T::HAS_MMAP { Some(Self::mmap) } else { None }, unlocked_ioctl: if T::HAS_IOCTL { Some(Self::ioctl) } else { From 6acb75ad7b9e1548ae7f7532312295f74e48c973 Mon Sep 17 00:00:00 2001 From: Alice Ryhl Date: Tue, 8 Apr 2025 09:22:45 +0000 Subject: [PATCH 098/285] task: rust: rework how current is accessed MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Introduce a new type called `CurrentTask` that lets you perform various operations that are only safe on the `current` task. Use the new type to provide a way to access the current mm without incrementing its refcount. With this change, you can write stuff such as let vma = current!().mm().lock_vma_under_rcu(addr); without incrementing any refcounts. This replaces the existing abstractions for accessing the current pid namespace. With the old approach, every field access to current involves both a macro and a unsafe helper function. The new approach simplifies that to a single safe function on the `CurrentTask` type. This makes it less heavy-weight to add additional current accessors in the future. That said, creating a `CurrentTask` type like the one in this patch requires that we are careful to ensure that it cannot escape the current task or otherwise access things after they are freed. To do this, I declared that it cannot escape the current "task context" where I defined a "task context" as essentially the region in which `current` remains unchanged. So e.g., release_task() or begin_new_exec() would leave the task context. If a userspace thread returns to userspace and later makes another syscall, then I consider the two syscalls to be different task contexts. This allows values stored in that task to be modified between syscalls, even if they're guaranteed to be immutable during a syscall. Ensuring correctness of `CurrentTask` is slightly tricky if we also want the ability to have a safe `kthread_use_mm()` implementation in Rust. To support that safely, there are two patterns we need to ensure are safe: // Case 1: current!() called inside the scope. let mm; kthread_use_mm(some_mm, || { mm = current!().mm(); }); drop(some_mm); mm.do_something(); // UAF and: // Case 2: current!() called before the scope. let mm; let task = current!(); kthread_use_mm(some_mm, || { mm = task.mm(); }); drop(some_mm); mm.do_something(); // UAF The existing `current!()` abstraction already natively prevents the first case: The `&CurrentTask` would be tied to the inner scope, so the borrow-checker ensures that no reference derived from it can escape the scope. Fixing the second case is a bit more tricky. The solution is to essentially pretend that the contents of the scope execute on an different thread, which means that only thread-safe types can cross the boundary. Since `CurrentTask` is marked `NotThreadSafe`, attempts to move it to another thread will fail, and this includes our fake pretend thread boundary. This has the disadvantage that other types that aren't thread-safe for reasons unrelated to `current` also cannot be moved across the `kthread_use_mm()` boundary. I consider this an acceptable tradeoff. Link: https://lkml.kernel.org/r/20250408-vma-v16-8-d8b446e885d9@google.com Signed-off-by: Alice Ryhl Acked-by: Lorenzo Stoakes Acked-by: Liam R. Howlett Reviewed-by: Boqun Feng Reviewed-by: Andreas Hindborg Reviewed-by: Gary Guo Cc: Alex Gaynor Cc: Arnd Bergmann Cc: Balbir Singh Cc: Benno Lossin Cc: Björn Roy Baron Cc: Greg Kroah-Hartman Cc: Jann Horn Cc: John Hubbard Cc: Matthew Wilcox (Oracle) Cc: Miguel Ojeda Cc: Suren Baghdasaryan Cc: Trevor Gross Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- rust/kernel/task.rs | 247 +++++++++++++++++++++++--------------------- 1 file changed, 129 insertions(+), 118 deletions(-) diff --git a/rust/kernel/task.rs b/rust/kernel/task.rs index 9e6f6854948d..927413d85484 100644 --- a/rust/kernel/task.rs +++ b/rust/kernel/task.rs @@ -7,6 +7,7 @@ use crate::{ bindings, ffi::{c_int, c_long, c_uint}, + mm::MmWithUser, pid_namespace::PidNamespace, types::{ARef, NotThreadSafe, Opaque}, }; @@ -33,22 +34,20 @@ pub const TASK_NORMAL: c_uint = bindings::TASK_NORMAL as c_uint; #[macro_export] macro_rules! current { () => { - // SAFETY: Deref + addr-of below create a temporary `TaskRef` that cannot outlive the - // caller. + // SAFETY: This expression creates a temporary value that is dropped at the end of the + // caller's scope. The following mechanisms ensure that the resulting `&CurrentTask` cannot + // leave current task context: + // + // * To return to userspace, the caller must leave the current scope. + // * Operations such as `begin_new_exec()` are necessarily unsafe and the caller of + // `begin_new_exec()` is responsible for safety. + // * Rust abstractions for things such as a `kthread_use_mm()` scope must require the + // closure to be `Send`, so the `NotThreadSafe` field of `CurrentTask` ensures that the + // `&CurrentTask` cannot cross the scope in either direction. unsafe { &*$crate::task::Task::current() } }; } -/// Returns the currently running task's pid namespace. -#[macro_export] -macro_rules! current_pid_ns { - () => { - // SAFETY: Deref + addr-of below create a temporary `PidNamespaceRef` that cannot outlive - // the caller. - unsafe { &*$crate::task::Task::current_pid_ns() } - }; -} - /// Wraps the kernel's `struct task_struct`. /// /// # Invariants @@ -87,7 +86,7 @@ macro_rules! current_pid_ns { /// impl State { /// fn new() -> Self { /// Self { -/// creator: current!().into(), +/// creator: ARef::from(&**current!()), /// index: 0, /// } /// } @@ -107,6 +106,44 @@ unsafe impl Send for Task {} // synchronised by C code (e.g., `signal_pending`). unsafe impl Sync for Task {} +/// Represents the [`Task`] in the `current` global. +/// +/// This type exists to provide more efficient operations that are only valid on the current task. +/// For example, to retrieve the pid-namespace of a task, you must use rcu protection unless it is +/// the current task. +/// +/// # Invariants +/// +/// Each value of this type must only be accessed from the task context it was created within. +/// +/// Of course, every thread is in a different task context, but for the purposes of this invariant, +/// these operations also permanently leave the task context: +/// +/// * Returning to userspace from system call context. +/// * Calling `release_task()`. +/// * Calling `begin_new_exec()` in a binary format loader. +/// +/// Other operations temporarily create a new sub-context: +/// +/// * Calling `kthread_use_mm()` creates a new context, and `kthread_unuse_mm()` returns to the +/// old context. +/// +/// This means that a `CurrentTask` obtained before a `kthread_use_mm()` call may be used again +/// once `kthread_unuse_mm()` is called, but it must not be used between these two calls. +/// Conversely, a `CurrentTask` obtained between a `kthread_use_mm()`/`kthread_unuse_mm()` pair +/// must not be used after `kthread_unuse_mm()`. +#[repr(transparent)] +pub struct CurrentTask(Task, NotThreadSafe); + +// Make all `Task` methods available on `CurrentTask`. +impl Deref for CurrentTask { + type Target = Task; + #[inline] + fn deref(&self) -> &Task { + &self.0 + } +} + /// The type of process identifiers (PIDs). pub type Pid = bindings::pid_t; @@ -133,119 +170,29 @@ impl Task { /// /// # Safety /// - /// Callers must ensure that the returned object doesn't outlive the current task/thread. - pub unsafe fn current() -> impl Deref { - struct TaskRef<'a> { - task: &'a Task, - _not_send: NotThreadSafe, + /// Callers must ensure that the returned object is only used to access a [`CurrentTask`] + /// within the task context that was active when this function was called. For more details, + /// see the invariants section for [`CurrentTask`]. + pub unsafe fn current() -> impl Deref { + struct TaskRef { + task: *const CurrentTask, } - impl Deref for TaskRef<'_> { - type Target = Task; + impl Deref for TaskRef { + type Target = CurrentTask; fn deref(&self) -> &Self::Target { - self.task + // SAFETY: The returned reference borrows from this `TaskRef`, so it cannot outlive + // the `TaskRef`, which the caller of `Task::current()` has promised will not + // outlive the task/thread for which `self.task` is the `current` pointer. Thus, it + // is okay to return a `CurrentTask` reference here. + unsafe { &*self.task } } } - let current = Task::current_raw(); TaskRef { - // SAFETY: If the current thread is still running, the current task is valid. Given - // that `TaskRef` is not `Send`, we know it cannot be transferred to another thread - // (where it could potentially outlive the caller). - task: unsafe { &*current.cast() }, - _not_send: NotThreadSafe, - } - } - - /// Returns a PidNamespace reference for the currently executing task's/thread's pid namespace. - /// - /// This function can be used to create an unbounded lifetime by e.g., storing the returned - /// PidNamespace in a global variable which would be a bug. So the recommended way to get the - /// current task's/thread's pid namespace is to use the [`current_pid_ns`] macro because it is - /// safe. - /// - /// # Safety - /// - /// Callers must ensure that the returned object doesn't outlive the current task/thread. - pub unsafe fn current_pid_ns() -> impl Deref { - struct PidNamespaceRef<'a> { - task: &'a PidNamespace, - _not_send: NotThreadSafe, - } - - impl Deref for PidNamespaceRef<'_> { - type Target = PidNamespace; - - fn deref(&self) -> &Self::Target { - self.task - } - } - - // The lifetime of `PidNamespace` is bound to `Task` and `struct pid`. - // - // The `PidNamespace` of a `Task` doesn't ever change once the `Task` is alive. A - // `unshare(CLONE_NEWPID)` or `setns(fd_pidns/pidfd, CLONE_NEWPID)` will not have an effect - // on the calling `Task`'s pid namespace. It will only effect the pid namespace of children - // created by the calling `Task`. This invariant guarantees that after having acquired a - // reference to a `Task`'s pid namespace it will remain unchanged. - // - // When a task has exited and been reaped `release_task()` will be called. This will set - // the `PidNamespace` of the task to `NULL`. So retrieving the `PidNamespace` of a task - // that is dead will return `NULL`. Note, that neither holding the RCU lock nor holding a - // referencing count to - // the `Task` will prevent `release_task()` being called. - // - // In order to retrieve the `PidNamespace` of a `Task` the `task_active_pid_ns()` function - // can be used. There are two cases to consider: - // - // (1) retrieving the `PidNamespace` of the `current` task - // (2) retrieving the `PidNamespace` of a non-`current` task - // - // From system call context retrieving the `PidNamespace` for case (1) is always safe and - // requires neither RCU locking nor a reference count to be held. Retrieving the - // `PidNamespace` after `release_task()` for current will return `NULL` but no codepath - // like that is exposed to Rust. - // - // Retrieving the `PidNamespace` from system call context for (2) requires RCU protection. - // Accessing `PidNamespace` outside of RCU protection requires a reference count that - // must've been acquired while holding the RCU lock. Note that accessing a non-`current` - // task means `NULL` can be returned as the non-`current` task could have already passed - // through `release_task()`. - // - // To retrieve (1) the `current_pid_ns!()` macro should be used which ensure that the - // returned `PidNamespace` cannot outlive the calling scope. The associated - // `current_pid_ns()` function should not be called directly as it could be abused to - // created an unbounded lifetime for `PidNamespace`. The `current_pid_ns!()` macro allows - // Rust to handle the common case of accessing `current`'s `PidNamespace` without RCU - // protection and without having to acquire a reference count. - // - // For (2) the `task_get_pid_ns()` method must be used. This will always acquire a - // reference on `PidNamespace` and will return an `Option` to force the caller to - // explicitly handle the case where `PidNamespace` is `None`, something that tends to be - // forgotten when doing the equivalent operation in `C`. Missing RCU primitives make it - // difficult to perform operations that are otherwise safe without holding a reference - // count as long as RCU protection is guaranteed. But it is not important currently. But we - // do want it in the future. - // - // Note for (2) the required RCU protection around calling `task_active_pid_ns()` - // synchronizes against putting the last reference of the associated `struct pid` of - // `task->thread_pid`. The `struct pid` stored in that field is used to retrieve the - // `PidNamespace` of the caller. When `release_task()` is called `task->thread_pid` will be - // `NULL`ed and `put_pid()` on said `struct pid` will be delayed in `free_pid()` via - // `call_rcu()` allowing everyone with an RCU protected access to the `struct pid` acquired - // from `task->thread_pid` to finish. - // - // SAFETY: The current task's pid namespace is valid as long as the current task is running. - let pidns = unsafe { bindings::task_active_pid_ns(Task::current_raw()) }; - PidNamespaceRef { - // SAFETY: If the current thread is still running, the current task and its associated - // pid namespace are valid. `PidNamespaceRef` is not `Send`, so we know it cannot be - // transferred to another thread (where it could potentially outlive the current - // `Task`). The caller needs to ensure that the PidNamespaceRef doesn't outlive the - // current task/thread. - task: unsafe { PidNamespace::from_ptr(pidns) }, - _not_send: NotThreadSafe, + // CAST: The layout of `struct task_struct` and `CurrentTask` is identical. + task: Task::current_raw().cast(), } } @@ -328,6 +275,70 @@ impl Task { } } +impl CurrentTask { + /// Access the address space of the current task. + /// + /// This function does not touch the refcount of the mm. + #[inline] + pub fn mm(&self) -> Option<&MmWithUser> { + // SAFETY: The `mm` field of `current` is not modified from other threads, so reading it is + // not a data race. + let mm = unsafe { (*self.as_ptr()).mm }; + + if mm.is_null() { + return None; + } + + // SAFETY: If `current->mm` is non-null, then it references a valid mm with a non-zero + // value of `mm_users`. Furthermore, the returned `&MmWithUser` borrows from this + // `CurrentTask`, so it cannot escape the scope in which the current pointer was obtained. + // + // This is safe even if `kthread_use_mm()`/`kthread_unuse_mm()` are used. There are two + // relevant cases: + // * If the `&CurrentTask` was created before `kthread_use_mm()`, then it cannot be + // accessed during the `kthread_use_mm()`/`kthread_unuse_mm()` scope due to the + // `NotThreadSafe` field of `CurrentTask`. + // * If the `&CurrentTask` was created within a `kthread_use_mm()`/`kthread_unuse_mm()` + // scope, then the `&CurrentTask` cannot escape that scope, so the returned `&MmWithUser` + // also cannot escape that scope. + // In either case, it's not possible to read `current->mm` and keep using it after the + // scope is ended with `kthread_unuse_mm()`. + Some(unsafe { MmWithUser::from_raw(mm) }) + } + + /// Access the pid namespace of the current task. + /// + /// This function does not touch the refcount of the namespace or use RCU protection. + /// + /// To access the pid namespace of another task, see [`Task::get_pid_ns`]. + #[doc(alias = "task_active_pid_ns")] + #[inline] + pub fn active_pid_ns(&self) -> Option<&PidNamespace> { + // SAFETY: It is safe to call `task_active_pid_ns` without RCU protection when calling it + // on the current task. + let active_ns = unsafe { bindings::task_active_pid_ns(self.as_ptr()) }; + + if active_ns.is_null() { + return None; + } + + // The lifetime of `PidNamespace` is bound to `Task` and `struct pid`. + // + // The `PidNamespace` of a `Task` doesn't ever change once the `Task` is alive. + // + // From system call context retrieving the `PidNamespace` for the current task is always + // safe and requires neither RCU locking nor a reference count to be held. Retrieving the + // `PidNamespace` after `release_task()` for current will return `NULL` but no codepath + // like that is exposed to Rust. + // + // SAFETY: If `current`'s pid ns is non-null, then it references a valid pid ns. + // Furthermore, the returned `&PidNamespace` borrows from this `CurrentTask`, so it cannot + // escape the scope in which the current pointer was obtained, e.g. it cannot live past a + // `release_task()` call. + Some(unsafe { PidNamespace::from_ptr(active_ns) }) + } +} + // SAFETY: The type invariants guarantee that `Task` is always refcounted. unsafe impl crate::types::AlwaysRefCounted for Task { fn inc_ref(&self) { From af8251dd457dd5bff101b10f7e83c2cf74001736 Mon Sep 17 00:00:00 2001 From: Alice Ryhl Date: Tue, 8 Apr 2025 09:22:46 +0000 Subject: [PATCH 099/285] mm: rust: add MEMORY MANAGEMENT [RUST] MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit We have introduced Rust bindings for core mm abstractions as part of this series, so add an entry in MAINTAINERS to be explicit about who maintains this. Patches are anticipated to be taken through the mm tree as usual with other mm code. Link: https://rust-for-linux.com/rust-kernel-policy#how-is-rust-introduced-in-a-subsystem Link: https://lore.kernel.org/all/33e64b12-aa07-4e78-933a-b07c37ff1d84@lucifer.local/ Link: https://lkml.kernel.org/r/20250408-vma-v16-9-d8b446e885d9@google.com Signed-off-by: Alice Ryhl Acked-by: Lorenzo Stoakes Acked-by: Liam R. Howlett Cc: Alex Gaynor Cc: Andreas Hindborg Cc: Arnd Bergmann Cc: Balbir Singh Cc: Benno Lossin Cc: Björn Roy Baron Cc: Boqun Feng Cc: Gary Guo Cc: Greg Kroah-Hartman Cc: Jann Horn Cc: John Hubbard Cc: Matthew Wilcox (Oracle) Cc: Miguel Ojeda Cc: Suren Baghdasaryan Cc: Trevor Gross Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- MAINTAINERS | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index 88102bb64dfe..9eea95043c12 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -15637,6 +15637,19 @@ F: include/uapi/linux/userfaultfd.h F: mm/userfaultfd.c F: tools/testing/selftests/mm/uffd-*.[ch] +MEMORY MANAGEMENT - RUST +M: Alice Ryhl +R: Lorenzo Stoakes +R: Liam R. Howlett +L: linux-mm@kvack.org +L: rust-for-linux@vger.kernel.org +S: Maintained +W: http://www.linux-mm.org +T: git git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm +F: rust/helpers/mm.c +F: rust/kernel/mm.rs +F: rust/kernel/mm/ + MEMORY MAPPING M: Andrew Morton M: Liam R. Howlett From 879bca0a2c4f40b08d09a95a2a0c3c6513060b5c Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Tue, 8 Apr 2025 10:29:31 +0100 Subject: [PATCH 100/285] mm/vma: fix incorrectly disallowed anonymous VMA merges Patch series "fix incorrectly disallowed anonymous VMA merges", v2. It appears that we have been incorrectly rejecting merge cases for 15 years, apparently by mistake. Imagine a range of anonymous mapped momemory divided into two VMAs like this, with incompatible protection bits: RW RWX unfaulted faulted |-----------|-----------| | prev | vma | |-----------|-----------| mprotect(RW) Now imagine mprotect()'ing vma so it is RW. This appears as if it should merge, it does not. Neither does this case, again mprotect()'ing vma RW: RWX RW faulted unfaulted |-----------|-----------| | vma | next | |-----------|-----------| mprotect(RW) Nor: RW RWX RW unfaulted faulted unfaulted |-----------|-----------|-----------| | prev | vma | next | |-----------|-----------|-----------| mprotect(RW) What's going on here? In commit 5beb49305251 ("mm: change anon_vma linking to fix multi-process server scalability issue"), from 2010, Rik von Riel took careful care to account for these cases - commenting that '[this is] easily overlooked: when mprotect shifts the boundary, make sure the expanding vma has anon_vma set if the shrinking vma had, to cover any anon pages imported.' However, commit 965f55dea0e3 ("mmap: avoid merging cloned VMAs") introduced a little over a year later, appears to have accidentally disallowed this. By adjusting the is_mergeable_anon_vma() function to avoid lock contention across large trees of forked anon_vma's, this commit wrongly assumed the VMA being checked (the ostensible merge 'target') should be faulted, that is, have an anon_vma, and thus an anon_vma_chain list established, but only of length 1. This appears to have been unintentional, as disallowing empty target VMAs like this across the board makes no sense. We already have logic that accounts for this case, the same logic Rik introduced in 2010, now via dup_anon_vma() (and ultimately anon_vma_clone()), so there is no problem permitting this. This series fixes this mistake and also ensures that scalability concerns remain addressed by explicitly checking that whatever VMA is being merged has not been forked. A full set of self tests which reproduce the issue are provided, as well as updating userland VMA tests to assert this behaviour. The self tests additionally assert scalability concerns are addressed. This patch (of 3): anon_vma_chain's were introduced by Rik von Riel in commit 5beb49305251 ("mm: change anon_vma linking to fix multi-process server scalability issue"). This patch was introduced in March 2010. As part of this change, careful attention was made to the instance of mprotect() causing a VMA merge, with one faulted (i.e. having anon_vma set) and another not: /* * Easily overlooked: when mprotect shifts the boundary, * make sure the expanding vma has anon_vma set if the * shrinking vma had, to cover any anon pages imported. */ In the modern VMA code, this is handled in dup_anon_vma() (and ultimately anon_vma_clone()). This case is one of the three configurations of adjacent VMA anon_vma state that we might encounter on merge (where dst is the VMA which will be merged into and src the one being merged into dst): 1. dst->anon_vma, src->anon_vma - These must be equal, no-op. 2. dst->anon_vma, !src->anon_vma - We simply use dst->anon_vma, no-op. 3. !dst->anon_vma, src->anon_vma - The case in question here. In case 3, the instance addressed here - we duplicate the AVC connections from src and place into dst. However, in practice, we very often do NOT do this. This appears to be due to an inadvertent consequence of the change introduced by commit 965f55dea0e3 ("mmap: avoid merging cloned VMAs"), introduced in May 2011. This implies that this merge case was functional only for a little over a year, and has since been broken for ~15 years. Here, lock scalability concerns lead to us restricting anonymous merges only to those VMAs with 1 entry in their vma->anon_vma_chain, that is, a VMA that is not connected to any parent process's anon_vma. The mergeability test looks like this: static inline bool is_mergeable_anon_vma(struct anon_vma *anon_vma1, struct anon_vma *anon_vma2, struct vm_area_struct *vma) { if ((!anon_vma1 || !anon_vma2) && (!vma || !vma->anon_vma || list_is_singular(&vma->anon_vma_chain))) return true; return anon_vma1 == anon_vma2; } However, we have a problem here - typically the vma passed here is the destination VMA. For instance in vma_merge_existing_range() we invoke: can_vma_merge_left() -> [ check that there is an immediately adjacent prior VMA ] -> can_vma_merge_after() -> is_mergeable_vma() for general attribute check -> is_mergeable_anon_vma([ proposed anon_vma ], prev->anon_vma, prev) So if we were considering a target unfaulted 'prev': unfaulted faulted |-----------|-----------| | prev | vma | |-----------|-----------| This would call is_mergeable_anon_vma(NULL, vma->anon_vma, prev). The list_is_singular() check for vma->anon_vma_chain, an empty list on fault, would cause this merge to _fail_ even though all else indicates a merge. Equally a simple merge into a next VMA would hit the same problem: faulted unfaulted |-----------|-----------| | vma | next | |-----------|-----------| can_vma_merge_right() -> [ check that there is an immediately adjacent succeeding VMA ] -> can_vma_merge_before() -> is_mergeable_vma() for general attribute check -> is_mergeable_anon_vma([ proposed anon_vma ], next->anon_vma, next) For a 3-way merge, we'd also hit the same problem if it was configured like this for instance: unfaulted faulted unfaulted |-----------|-----------|-----------| | prev | vma | next | |-----------|-----------|-----------| As we'd call can_vma_merge_left() for prev, and can_vma_merge_right() for next, both of which would fail. vma_merge_new_range() (and relatedly, vma_expand()) are not impacted, as the new VMA would never already be faulted (it is a proposed new range). Because we already handle each of the aforementioned merge cases, and can absolutely therefore deal with an existing VMA merge with !dst->anon_vma, src->anon_vma, there is absolutely no reason to disallow this kind of merge. It seems that the intention of this patch is to ensure that, in the instance of merging unfaulted VMAs with faulted ones, we never wish to do so with those with multiple AVCs due to the fact that anon_vma lock's are held across both parent and child anon_vma's (actually, the 'root' parent anon_vma's lock is used). In fact, the original commit alludes to this - "find_mergeable_anon_vma() already considers this case". In find_mergeable_anon_vma() however, we check the anon_vma which will be merged from, if it is set, then we check list_is_singular(vma->anon_vma_chain). So to match this logic, update is_mergeable_anon_vma() to perform this scalability check on the VMA whose anon_vma we ultimately merge into. This matches existing behaviour with forked VMAs, only we no longer wrongly disallow ALL empty target merges. So we both allow merge cases and ensure the scalability check is correctly applied. We may wish to revisit these lock scalability concerns at a later date and ensure they are still valid. Additionally, correct userland VMA tests which were mistakenly not asserting these cases correctly previously to now correctly assert this, and to ensure vmg->anon_vma state is always consistent to account for newly introduced asserts. Link: https://lkml.kernel.org/r/cover.1744104124.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/18c756fc9eaf7ad082a710c91133b8346f8cd9a8.1744104124.git.lorenzo.stoakes@oracle.com Fixes: 965f55dea0e3 ("mmap: avoid merging cloned VMAs") Signed-off-by: Lorenzo Stoakes Reviewed-by: Yeoreum Yun Cc: David Hildenbrand Cc: Jann Horn Cc: Liam Howlett Cc: Matthew Wilcox (Oracle) Cc: Rik van Riel Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Wei Yang Signed-off-by: Andrew Morton --- mm/vma.c | 85 ++++++++++++++++++++++++---------- tools/testing/vma/vma.c | 100 +++++++++++++++++++++------------------- 2 files changed, 113 insertions(+), 72 deletions(-) diff --git a/mm/vma.c b/mm/vma.c index 839d12f02c88..8a6c5e835759 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -57,6 +57,22 @@ struct mmap_state { .state = VMA_MERGE_START, \ } +/* + * If, at any point, the VMA had unCoW'd mappings from parents, it will maintain + * more than one anon_vma_chain connecting it to more than one anon_vma. A merge + * would mean a wider range of folios sharing the root anon_vma lock, and thus + * potential lock contention, we do not wish to encourage merging such that this + * scales to a problem. + */ +static bool vma_had_uncowed_parents(struct vm_area_struct *vma) +{ + /* + * The list_is_singular() test is to avoid merging VMA cloned from + * parents. This can improve scalability caused by anon_vma lock. + */ + return vma && vma->anon_vma && !list_is_singular(&vma->anon_vma_chain); +} + static inline bool is_mergeable_vma(struct vma_merge_struct *vmg, bool merge_next) { struct vm_area_struct *vma = merge_next ? vmg->next : vmg->prev; @@ -82,24 +98,28 @@ static inline bool is_mergeable_vma(struct vma_merge_struct *vmg, bool merge_nex return true; } -static inline bool is_mergeable_anon_vma(struct anon_vma *anon_vma1, - struct anon_vma *anon_vma2, struct vm_area_struct *vma) +static bool is_mergeable_anon_vma(struct vma_merge_struct *vmg, bool merge_next) { - /* - * The list_is_singular() test is to avoid merging VMA cloned from - * parents. This can improve scalability caused by anon_vma lock. - */ - if ((!anon_vma1 || !anon_vma2) && (!vma || - list_is_singular(&vma->anon_vma_chain))) - return true; - return anon_vma1 == anon_vma2; -} + struct vm_area_struct *tgt = merge_next ? vmg->next : vmg->prev; + struct vm_area_struct *src = vmg->middle; /* exisitng merge case. */ + struct anon_vma *tgt_anon = tgt->anon_vma; + struct anon_vma *src_anon = vmg->anon_vma; -/* Are the anon_vma's belonging to each VMA compatible with one another? */ -static inline bool are_anon_vmas_compatible(struct vm_area_struct *vma1, - struct vm_area_struct *vma2) -{ - return is_mergeable_anon_vma(vma1->anon_vma, vma2->anon_vma, NULL); + /* + * We _can_ have !src, vmg->anon_vma via copy_vma(). In this instance we + * will remove the existing VMA's anon_vma's so there's no scalability + * concerns. + */ + VM_WARN_ON(src && src_anon != src->anon_vma); + + /* Case 1 - we will dup_anon_vma() from src into tgt. */ + if (!tgt_anon && src_anon) + return !vma_had_uncowed_parents(src); + /* Case 2 - we will simply use tgt's anon_vma. */ + if (tgt_anon && !src_anon) + return !vma_had_uncowed_parents(tgt); + /* Case 3 - the anon_vma's are already shared. */ + return src_anon == tgt_anon; } /* @@ -164,7 +184,7 @@ static bool can_vma_merge_before(struct vma_merge_struct *vmg) pgoff_t pglen = PHYS_PFN(vmg->end - vmg->start); if (is_mergeable_vma(vmg, /* merge_next = */ true) && - is_mergeable_anon_vma(vmg->anon_vma, vmg->next->anon_vma, vmg->next)) { + is_mergeable_anon_vma(vmg, /* merge_next = */ true)) { if (vmg->next->vm_pgoff == vmg->pgoff + pglen) return true; } @@ -184,7 +204,7 @@ static bool can_vma_merge_before(struct vma_merge_struct *vmg) static bool can_vma_merge_after(struct vma_merge_struct *vmg) { if (is_mergeable_vma(vmg, /* merge_next = */ false) && - is_mergeable_anon_vma(vmg->anon_vma, vmg->prev->anon_vma, vmg->prev)) { + is_mergeable_anon_vma(vmg, /* merge_next = */ false)) { if (vmg->prev->vm_pgoff + vma_pages(vmg->prev) == vmg->pgoff) return true; } @@ -400,8 +420,10 @@ static bool can_vma_merge_left(struct vma_merge_struct *vmg) static bool can_vma_merge_right(struct vma_merge_struct *vmg, bool can_merge_left) { - if (!vmg->next || vmg->end != vmg->next->vm_start || - !can_vma_merge_before(vmg)) + struct vm_area_struct *next = vmg->next; + struct vm_area_struct *prev; + + if (!next || vmg->end != next->vm_start || !can_vma_merge_before(vmg)) return false; if (!can_merge_left) @@ -414,7 +436,9 @@ static bool can_vma_merge_right(struct vma_merge_struct *vmg, * * We therefore check this in addition to mergeability to either side. */ - return are_anon_vmas_compatible(vmg->prev, vmg->next); + prev = vmg->prev; + return !prev->anon_vma || !next->anon_vma || + prev->anon_vma == next->anon_vma; } /* @@ -554,7 +578,9 @@ static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, } /* - * dup_anon_vma() - Helper function to duplicate anon_vma + * dup_anon_vma() - Helper function to duplicate anon_vma on VMA merge in the + * instance that the destination VMA has no anon_vma but the source does. + * * @dst: The destination VMA * @src: The source VMA * @dup: Pointer to the destination VMA when successful. @@ -565,9 +591,18 @@ static int dup_anon_vma(struct vm_area_struct *dst, struct vm_area_struct *src, struct vm_area_struct **dup) { /* - * Easily overlooked: when mprotect shifts the boundary, make sure the - * expanding vma has anon_vma set if the shrinking vma had, to cover any - * anon pages imported. + * There are three cases to consider for correctly propagating + * anon_vma's on merge. + * + * The first is trivial - neither VMA has anon_vma, we need not do + * anything. + * + * The second where both have anon_vma is also a no-op, as they must + * then be the same, so there is simply nothing to copy. + * + * Here we cover the third - if the destination VMA has no anon_vma, + * that is it is unfaulted, we need to ensure that the newly merged + * range is referenced by the anon_vma's of the source. */ if (src->anon_vma && !dst->anon_vma) { int ret; diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c index 11f761769b5b..7cfd6e31db10 100644 --- a/tools/testing/vma/vma.c +++ b/tools/testing/vma/vma.c @@ -185,6 +185,15 @@ static void vmg_set_range(struct vma_merge_struct *vmg, unsigned long start, vmg->__adjust_next_start = false; } +/* Helper function to set both the VMG range and its anon_vma. */ +static void vmg_set_range_anon_vma(struct vma_merge_struct *vmg, unsigned long start, + unsigned long end, pgoff_t pgoff, vm_flags_t flags, + struct anon_vma *anon_vma) +{ + vmg_set_range(vmg, start, end, pgoff, flags); + vmg->anon_vma = anon_vma; +} + /* * Helper function to try to merge a new VMA. * @@ -265,6 +274,22 @@ static void dummy_close(struct vm_area_struct *) { } +static void __vma_set_dummy_anon_vma(struct vm_area_struct *vma, + struct anon_vma_chain *avc, + struct anon_vma *anon_vma) +{ + vma->anon_vma = anon_vma; + INIT_LIST_HEAD(&vma->anon_vma_chain); + list_add(&avc->same_vma, &vma->anon_vma_chain); + avc->anon_vma = vma->anon_vma; +} + +static void vma_set_dummy_anon_vma(struct vm_area_struct *vma, + struct anon_vma_chain *avc) +{ + __vma_set_dummy_anon_vma(vma, avc, &dummy_anon_vma); +} + static bool test_simple_merge(void) { struct vm_area_struct *vma; @@ -953,6 +978,7 @@ static bool test_merge_existing(void) const struct vm_operations_struct vm_ops = { .close = dummy_close, }; + struct anon_vma_chain avc = {}; /* * Merge right case - partial span. @@ -968,10 +994,10 @@ static bool test_merge_existing(void) vma->vm_ops = &vm_ops; /* This should have no impact. */ vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, flags); vma_next->vm_ops = &vm_ops; /* This should have no impact. */ - vmg_set_range(&vmg, 0x3000, 0x6000, 3, flags); + vmg_set_range_anon_vma(&vmg, 0x3000, 0x6000, 3, flags, &dummy_anon_vma); vmg.middle = vma; vmg.prev = vma; - vma->anon_vma = &dummy_anon_vma; + vma_set_dummy_anon_vma(vma, &avc); ASSERT_EQ(merge_existing(&vmg), vma_next); ASSERT_EQ(vmg.state, VMA_MERGE_SUCCESS); ASSERT_EQ(vma_next->vm_start, 0x3000); @@ -1001,9 +1027,9 @@ static bool test_merge_existing(void) vma = alloc_and_link_vma(&mm, 0x2000, 0x6000, 2, flags); vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, flags); vma_next->vm_ops = &vm_ops; /* This should have no impact. */ - vmg_set_range(&vmg, 0x2000, 0x6000, 2, flags); + vmg_set_range_anon_vma(&vmg, 0x2000, 0x6000, 2, flags, &dummy_anon_vma); vmg.middle = vma; - vma->anon_vma = &dummy_anon_vma; + vma_set_dummy_anon_vma(vma, &avc); ASSERT_EQ(merge_existing(&vmg), vma_next); ASSERT_EQ(vmg.state, VMA_MERGE_SUCCESS); ASSERT_EQ(vma_next->vm_start, 0x2000); @@ -1030,11 +1056,10 @@ static bool test_merge_existing(void) vma_prev->vm_ops = &vm_ops; /* This should have no impact. */ vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, flags); vma->vm_ops = &vm_ops; /* This should have no impact. */ - vmg_set_range(&vmg, 0x3000, 0x6000, 3, flags); + vmg_set_range_anon_vma(&vmg, 0x3000, 0x6000, 3, flags, &dummy_anon_vma); vmg.prev = vma_prev; vmg.middle = vma; - vma->anon_vma = &dummy_anon_vma; - + vma_set_dummy_anon_vma(vma, &avc); ASSERT_EQ(merge_existing(&vmg), vma_prev); ASSERT_EQ(vmg.state, VMA_MERGE_SUCCESS); ASSERT_EQ(vma_prev->vm_start, 0); @@ -1064,10 +1089,10 @@ static bool test_merge_existing(void) vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); vma_prev->vm_ops = &vm_ops; /* This should have no impact. */ vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, flags); - vmg_set_range(&vmg, 0x3000, 0x7000, 3, flags); + vmg_set_range_anon_vma(&vmg, 0x3000, 0x7000, 3, flags, &dummy_anon_vma); vmg.prev = vma_prev; vmg.middle = vma; - vma->anon_vma = &dummy_anon_vma; + vma_set_dummy_anon_vma(vma, &avc); ASSERT_EQ(merge_existing(&vmg), vma_prev); ASSERT_EQ(vmg.state, VMA_MERGE_SUCCESS); ASSERT_EQ(vma_prev->vm_start, 0); @@ -1094,10 +1119,10 @@ static bool test_merge_existing(void) vma_prev->vm_ops = &vm_ops; /* This should have no impact. */ vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, flags); vma_next = alloc_and_link_vma(&mm, 0x7000, 0x9000, 7, flags); - vmg_set_range(&vmg, 0x3000, 0x7000, 3, flags); + vmg_set_range_anon_vma(&vmg, 0x3000, 0x7000, 3, flags, &dummy_anon_vma); vmg.prev = vma_prev; vmg.middle = vma; - vma->anon_vma = &dummy_anon_vma; + vma_set_dummy_anon_vma(vma, &avc); ASSERT_EQ(merge_existing(&vmg), vma_prev); ASSERT_EQ(vmg.state, VMA_MERGE_SUCCESS); ASSERT_EQ(vma_prev->vm_start, 0); @@ -1180,12 +1205,9 @@ static bool test_anon_vma_non_mergeable(void) .mm = &mm, .vmi = &vmi, }; - struct anon_vma_chain dummy_anon_vma_chain1 = { - .anon_vma = &dummy_anon_vma, - }; - struct anon_vma_chain dummy_anon_vma_chain2 = { - .anon_vma = &dummy_anon_vma, - }; + struct anon_vma_chain dummy_anon_vma_chain_1 = {}; + struct anon_vma_chain dummy_anon_vma_chain_2 = {}; + struct anon_vma dummy_anon_vma_2; /* * In the case of modified VMA merge, merging both left and right VMAs @@ -1209,24 +1231,11 @@ static bool test_anon_vma_non_mergeable(void) * * However, when prev is compared to next, the merge should fail. */ - - INIT_LIST_HEAD(&vma_prev->anon_vma_chain); - list_add(&dummy_anon_vma_chain1.same_vma, &vma_prev->anon_vma_chain); - ASSERT_TRUE(list_is_singular(&vma_prev->anon_vma_chain)); - vma_prev->anon_vma = &dummy_anon_vma; - ASSERT_TRUE(is_mergeable_anon_vma(NULL, vma_prev->anon_vma, vma_prev)); - - INIT_LIST_HEAD(&vma_next->anon_vma_chain); - list_add(&dummy_anon_vma_chain2.same_vma, &vma_next->anon_vma_chain); - ASSERT_TRUE(list_is_singular(&vma_next->anon_vma_chain)); - vma_next->anon_vma = (struct anon_vma *)2; - ASSERT_TRUE(is_mergeable_anon_vma(NULL, vma_next->anon_vma, vma_next)); - - ASSERT_FALSE(is_mergeable_anon_vma(vma_prev->anon_vma, vma_next->anon_vma, NULL)); - - vmg_set_range(&vmg, 0x3000, 0x7000, 3, flags); + vmg_set_range_anon_vma(&vmg, 0x3000, 0x7000, 3, flags, NULL); vmg.prev = vma_prev; vmg.middle = vma; + vma_set_dummy_anon_vma(vma_prev, &dummy_anon_vma_chain_1); + __vma_set_dummy_anon_vma(vma_next, &dummy_anon_vma_chain_2, &dummy_anon_vma_2); ASSERT_EQ(merge_existing(&vmg), vma_prev); ASSERT_EQ(vmg.state, VMA_MERGE_SUCCESS); @@ -1253,17 +1262,12 @@ static bool test_anon_vma_non_mergeable(void) vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); vma_next = alloc_and_link_vma(&mm, 0x7000, 0x9000, 7, flags); - INIT_LIST_HEAD(&vma_prev->anon_vma_chain); - list_add(&dummy_anon_vma_chain1.same_vma, &vma_prev->anon_vma_chain); - vma_prev->anon_vma = (struct anon_vma *)1; - - INIT_LIST_HEAD(&vma_next->anon_vma_chain); - list_add(&dummy_anon_vma_chain2.same_vma, &vma_next->anon_vma_chain); - vma_next->anon_vma = (struct anon_vma *)2; - - vmg_set_range(&vmg, 0x3000, 0x7000, 3, flags); + vmg_set_range_anon_vma(&vmg, 0x3000, 0x7000, 3, flags, NULL); vmg.prev = vma_prev; + vma_set_dummy_anon_vma(vma_prev, &dummy_anon_vma_chain_1); + __vma_set_dummy_anon_vma(vma_next, &dummy_anon_vma_chain_2, &dummy_anon_vma_2); + vmg.anon_vma = NULL; ASSERT_EQ(merge_new(&vmg), vma_prev); ASSERT_EQ(vmg.state, VMA_MERGE_SUCCESS); ASSERT_EQ(vma_prev->vm_start, 0); @@ -1363,8 +1367,8 @@ static bool test_dup_anon_vma(void) vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); vma = alloc_and_link_vma(&mm, 0x3000, 0x5000, 3, flags); vma_next = alloc_and_link_vma(&mm, 0x5000, 0x8000, 5, flags); - - vma->anon_vma = &dummy_anon_vma; + vmg.anon_vma = &dummy_anon_vma; + vma_set_dummy_anon_vma(vma, &dummy_anon_vma_chain); vmg_set_range(&vmg, 0x3000, 0x5000, 3, flags); vmg.prev = vma_prev; vmg.middle = vma; @@ -1392,7 +1396,7 @@ static bool test_dup_anon_vma(void) vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); vma = alloc_and_link_vma(&mm, 0x3000, 0x8000, 3, flags); - vma->anon_vma = &dummy_anon_vma; + vma_set_dummy_anon_vma(vma, &dummy_anon_vma_chain); vmg_set_range(&vmg, 0x3000, 0x5000, 3, flags); vmg.prev = vma_prev; vmg.middle = vma; @@ -1420,7 +1424,7 @@ static bool test_dup_anon_vma(void) vma = alloc_and_link_vma(&mm, 0, 0x5000, 0, flags); vma_next = alloc_and_link_vma(&mm, 0x5000, 0x8000, 5, flags); - vma->anon_vma = &dummy_anon_vma; + vma_set_dummy_anon_vma(vma, &dummy_anon_vma_chain); vmg_set_range(&vmg, 0x3000, 0x5000, 3, flags); vmg.prev = vma; vmg.middle = vma; @@ -1447,6 +1451,7 @@ static bool test_vmi_prealloc_fail(void) .mm = &mm, .vmi = &vmi, }; + struct anon_vma_chain avc = {}; struct vm_area_struct *vma_prev, *vma; /* @@ -1459,9 +1464,10 @@ static bool test_vmi_prealloc_fail(void) vma = alloc_and_link_vma(&mm, 0x3000, 0x5000, 3, flags); vma->anon_vma = &dummy_anon_vma; - vmg_set_range(&vmg, 0x3000, 0x5000, 3, flags); + vmg_set_range_anon_vma(&vmg, 0x3000, 0x5000, 3, flags, &dummy_anon_vma); vmg.prev = vma_prev; vmg.middle = vma; + vma_set_dummy_anon_vma(vma, &avc); fail_prealloc = true; From bd23f293a0d56fd18d51e5364ecf8c277c6e9531 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Tue, 8 Apr 2025 10:29:32 +0100 Subject: [PATCH 101/285] tools/testing: add PROCMAP_QUERY helper functions in mm self tests The PROCMAP_QUERY ioctl() is very useful - it allows for binary access to /proc/$pid/[s]maps data and thus convenient lookup of data contained there. This patch exposes this for convenient use by mm self tests so the state of VMAs can easily be queried. Link: https://lkml.kernel.org/r/ce83d877093d1fc594762cf4b82f0c27963030ee.1744104124.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Yeoreum Yun Reviewed-by: Wei Yang Cc: David Hildenbrand Cc: Jann Horn Cc: Liam Howlett Cc: Matthew Wilcox (Oracle) Cc: Rik van Riel Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/vm_util.c | 62 ++++++++++++++++++++++++++++ tools/testing/selftests/mm/vm_util.h | 21 ++++++++++ 2 files changed, 83 insertions(+) diff --git a/tools/testing/selftests/mm/vm_util.c b/tools/testing/selftests/mm/vm_util.c index a36734fb62f3..1357e2d6a7b6 100644 --- a/tools/testing/selftests/mm/vm_util.c +++ b/tools/testing/selftests/mm/vm_util.c @@ -1,5 +1,6 @@ // SPDX-License-Identifier: GPL-2.0 #include +#include #include #include #include @@ -424,3 +425,64 @@ bool check_vmflag_io(void *addr) flags += flaglen; } } + +/* + * Open an fd at /proc/$pid/maps and configure procmap_out ready for + * PROCMAP_QUERY query. Returns 0 on success, or an error code otherwise. + */ +int open_procmap(pid_t pid, struct procmap_fd *procmap_out) +{ + char path[256]; + int ret = 0; + + memset(procmap_out, '\0', sizeof(*procmap_out)); + sprintf(path, "/proc/%d/maps", pid); + procmap_out->query.size = sizeof(procmap_out->query); + procmap_out->fd = open(path, O_RDONLY); + if (procmap_out < 0) + ret = -errno; + + return ret; +} + +/* Perform PROCMAP_QUERY. Returns 0 on success, or an error code otherwise. */ +int query_procmap(struct procmap_fd *procmap) +{ + int ret = 0; + + if (ioctl(procmap->fd, PROCMAP_QUERY, &procmap->query) == -1) + ret = -errno; + + return ret; +} + +/* + * Try to find the VMA at specified address, returns true if found, false if not + * found, and the test is failed if any other error occurs. + * + * On success, procmap->query is populated with the results. + */ +bool find_vma_procmap(struct procmap_fd *procmap, void *address) +{ + int err; + + procmap->query.query_flags = 0; + procmap->query.query_addr = (unsigned long)address; + err = query_procmap(procmap); + if (!err) + return true; + + if (err != -ENOENT) + ksft_exit_fail_msg("%s: Error %d on ioctl(PROCMAP_QUERY)\n", + __func__, err); + return false; +} + +/* + * Close fd used by PROCMAP_QUERY mechanism. Returns 0 on success, or an error + * code otherwise. + */ +int close_procmap(struct procmap_fd *procmap) +{ + return close(procmap->fd); +} diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h index 6effafdc4d8a..9211ba640d9c 100644 --- a/tools/testing/selftests/mm/vm_util.h +++ b/tools/testing/selftests/mm/vm_util.h @@ -6,6 +6,7 @@ #include /* ffsl() */ #include /* _SC_PAGESIZE */ #include "../kselftest.h" +#include #define BIT_ULL(nr) (1ULL << (nr)) #define PM_SOFT_DIRTY BIT_ULL(55) @@ -19,6 +20,15 @@ extern unsigned int __page_size; extern unsigned int __page_shift; +/* + * Represents an open fd and PROCMAP_QUERY state for binary (via ioctl) + * /proc/$pid/[s]maps lookup. + */ +struct procmap_fd { + int fd; + struct procmap_query query; +}; + static inline unsigned int psize(void) { if (!__page_size) @@ -73,6 +83,17 @@ int uffd_register_with_ioctls(int uffd, void *addr, uint64_t len, bool miss, bool wp, bool minor, uint64_t *ioctls); unsigned long get_free_hugepages(void); bool check_vmflag_io(void *addr); +int open_procmap(pid_t pid, struct procmap_fd *procmap_out); +int query_procmap(struct procmap_fd *procmap); +bool find_vma_procmap(struct procmap_fd *procmap, void *address); +int close_procmap(struct procmap_fd *procmap); + +static inline int open_self_procmap(struct procmap_fd *procmap_out) +{ + pid_t pid = getpid(); + + return open_procmap(pid, procmap_out); +} /* * On ppc64 this will only work with radix 2M hugepage size From 10d288964d48e52354f002f7e0f64d1df9496ab1 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Tue, 8 Apr 2025 10:29:33 +0100 Subject: [PATCH 102/285] tools/testing/selftests: assert that anon merge cases behave as expected Prior to the recently applied commit that permits this merge, mprotect()'ing a faulted VMA, adjacent to an unfaulted VMA, such that the two share characteristics would fail to merge due to what appear to be unintended consequences of commit 965f55dea0e3 ("mmap: avoid merging cloned VMAs"). Now we have fixed this bug, assert that we can indeed merge anonymous VMAs this way. Also assert that forked source/target VMAs are equally rejected. Previously, all empty target anon merges with one VMA faulted and the other unfaulted would be rejected incorrectly, now we ensure that unforked merge, but forked do not. Additionally, add the new test file to the MEMORY MAPPING section in MAINTAINERS, as these tests are explicitly memory mapping related. Link: https://lkml.kernel.org/r/2b69330274a3b71721f7042c5eabe91143934415.1744104124.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Yeoreum Yun Cc: David Hildenbrand Cc: Jann Horn Cc: Liam Howlett Cc: Matthew Wilcox (Oracle) Cc: Rik van Riel Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Wei Yang Signed-off-by: Andrew Morton --- MAINTAINERS | 1 + tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/merge.c | 455 ++++++++++++++++++++++ tools/testing/selftests/mm/run_vmtests.sh | 4 + 5 files changed, 462 insertions(+) create mode 100644 tools/testing/selftests/mm/merge.c diff --git a/MAINTAINERS b/MAINTAINERS index 9eea95043c12..022a19cb9aa0 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -15670,6 +15670,7 @@ F: mm/mseal.c F: mm/vma.c F: mm/vma.h F: mm/vma_internal.h +F: tools/testing/selftests/mm/merge.c F: tools/testing/vma/ MEMORY MAPPING - LOCKING diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore index c5241b193db8..91db34941a14 100644 --- a/tools/testing/selftests/mm/.gitignore +++ b/tools/testing/selftests/mm/.gitignore @@ -58,3 +58,4 @@ hugetlb_dio pkey_sighandler_tests_32 pkey_sighandler_tests_64 guard-regions +merge diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index 8270895039d1..ad4d6043a60f 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -98,6 +98,7 @@ TEST_GEN_FILES += hugetlb_madv_vs_map TEST_GEN_FILES += hugetlb_dio TEST_GEN_FILES += droppable TEST_GEN_FILES += guard-regions +TEST_GEN_FILES += merge ifneq ($(ARCH),arm64) TEST_GEN_FILES += soft-dirty diff --git a/tools/testing/selftests/mm/merge.c b/tools/testing/selftests/mm/merge.c new file mode 100644 index 000000000000..c76646cdf6e6 --- /dev/null +++ b/tools/testing/selftests/mm/merge.c @@ -0,0 +1,455 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +#define _GNU_SOURCE +#include "../kselftest_harness.h" +#include +#include +#include +#include +#include +#include "vm_util.h" + +FIXTURE(merge) +{ + unsigned int page_size; + char *carveout; + struct procmap_fd procmap; +}; + +FIXTURE_SETUP(merge) +{ + self->page_size = psize(); + /* Carve out PROT_NONE region to map over. */ + self->carveout = mmap(NULL, 12 * self->page_size, PROT_NONE, + MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(self->carveout, MAP_FAILED); + /* Setup PROCMAP_QUERY interface. */ + ASSERT_EQ(open_self_procmap(&self->procmap), 0); +} + +FIXTURE_TEARDOWN(merge) +{ + ASSERT_EQ(munmap(self->carveout, 12 * self->page_size), 0); + ASSERT_EQ(close_procmap(&self->procmap), 0); +} + +TEST_F(merge, mprotect_unfaulted_left) +{ + unsigned int page_size = self->page_size; + char *carveout = self->carveout; + struct procmap_fd *procmap = &self->procmap; + char *ptr; + + /* + * Map 10 pages of R/W memory within. MAP_NORESERVE so we don't hit + * merge failure due to lack of VM_ACCOUNT flag by mistake. + * + * |-----------------------| + * | unfaulted | + * |-----------------------| + */ + ptr = mmap(&carveout[page_size], 10 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED | MAP_NORESERVE, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + /* + * Now make the first 5 pages read-only, splitting the VMA: + * + * RO RW + * |-----------|-----------| + * | unfaulted | unfaulted | + * |-----------|-----------| + */ + ASSERT_EQ(mprotect(ptr, 5 * page_size, PROT_READ), 0); + /* + * Fault in the first of the last 5 pages so it gets an anon_vma and + * thus the whole VMA becomes 'faulted': + * + * RO RW + * |-----------|-----------| + * | unfaulted | faulted | + * |-----------|-----------| + */ + ptr[5 * page_size] = 'x'; + /* + * Now mprotect() the RW region read-only, we should merge (though for + * ~15 years we did not! :): + * + * RO + * |-----------------------| + * | faulted | + * |-----------------------| + */ + ASSERT_EQ(mprotect(&ptr[5 * page_size], 5 * page_size, PROT_READ), 0); + + /* Assert that the merge succeeded using PROCMAP_QUERY. */ + ASSERT_TRUE(find_vma_procmap(procmap, ptr)); + ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr); + ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 10 * page_size); +} + +TEST_F(merge, mprotect_unfaulted_right) +{ + unsigned int page_size = self->page_size; + char *carveout = self->carveout; + struct procmap_fd *procmap = &self->procmap; + char *ptr; + + /* + * |-----------------------| + * | unfaulted | + * |-----------------------| + */ + ptr = mmap(&carveout[page_size], 10 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED | MAP_NORESERVE, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + /* + * Now make the last 5 pages read-only, splitting the VMA: + * + * RW RO + * |-----------|-----------| + * | unfaulted | unfaulted | + * |-----------|-----------| + */ + ASSERT_EQ(mprotect(&ptr[5 * page_size], 5 * page_size, PROT_READ), 0); + /* + * Fault in the first of the first 5 pages so it gets an anon_vma and + * thus the whole VMA becomes 'faulted': + * + * RW RO + * |-----------|-----------| + * | faulted | unfaulted | + * |-----------|-----------| + */ + ptr[0] = 'x'; + /* + * Now mprotect() the RW region read-only, we should merge: + * + * RO + * |-----------------------| + * | faulted | + * |-----------------------| + */ + ASSERT_EQ(mprotect(ptr, 5 * page_size, PROT_READ), 0); + + /* Assert that the merge succeeded using PROCMAP_QUERY. */ + ASSERT_TRUE(find_vma_procmap(procmap, ptr)); + ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr); + ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 10 * page_size); +} + +TEST_F(merge, mprotect_unfaulted_both) +{ + unsigned int page_size = self->page_size; + char *carveout = self->carveout; + struct procmap_fd *procmap = &self->procmap; + char *ptr; + + /* + * |-----------------------| + * | unfaulted | + * |-----------------------| + */ + ptr = mmap(&carveout[2 * page_size], 9 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED | MAP_NORESERVE, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + /* + * Now make the first and last 3 pages read-only, splitting the VMA: + * + * RO RW RO + * |-----------|-----------|-----------| + * | unfaulted | unfaulted | unfaulted | + * |-----------|-----------|-----------| + */ + ASSERT_EQ(mprotect(ptr, 3 * page_size, PROT_READ), 0); + ASSERT_EQ(mprotect(&ptr[6 * page_size], 3 * page_size, PROT_READ), 0); + /* + * Fault in the first of the middle 3 pages so it gets an anon_vma and + * thus the whole VMA becomes 'faulted': + * + * RO RW RO + * |-----------|-----------|-----------| + * | unfaulted | faulted | unfaulted | + * |-----------|-----------|-----------| + */ + ptr[3 * page_size] = 'x'; + /* + * Now mprotect() the RW region read-only, we should merge: + * + * RO + * |-----------------------| + * | faulted | + * |-----------------------| + */ + ASSERT_EQ(mprotect(&ptr[3 * page_size], 3 * page_size, PROT_READ), 0); + + /* Assert that the merge succeeded using PROCMAP_QUERY. */ + ASSERT_TRUE(find_vma_procmap(procmap, ptr)); + ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr); + ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 9 * page_size); +} + +TEST_F(merge, mprotect_faulted_left_unfaulted_right) +{ + unsigned int page_size = self->page_size; + char *carveout = self->carveout; + struct procmap_fd *procmap = &self->procmap; + char *ptr; + + /* + * |-----------------------| + * | unfaulted | + * |-----------------------| + */ + ptr = mmap(&carveout[2 * page_size], 9 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED | MAP_NORESERVE, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + /* + * Now make the last 3 pages read-only, splitting the VMA: + * + * RW RO + * |-----------------------|-----------| + * | unfaulted | unfaulted | + * |-----------------------|-----------| + */ + ASSERT_EQ(mprotect(&ptr[6 * page_size], 3 * page_size, PROT_READ), 0); + /* + * Fault in the first of the first 6 pages so it gets an anon_vma and + * thus the whole VMA becomes 'faulted': + * + * RW RO + * |-----------------------|-----------| + * | unfaulted | unfaulted | + * |-----------------------|-----------| + */ + ptr[0] = 'x'; + /* + * Now make the first 3 pages read-only, splitting the VMA: + * + * RO RW RO + * |-----------|-----------|-----------| + * | faulted | faulted | unfaulted | + * |-----------|-----------|-----------| + */ + ASSERT_EQ(mprotect(ptr, 3 * page_size, PROT_READ), 0); + /* + * Now mprotect() the RW region read-only, we should merge: + * + * RO + * |-----------------------| + * | faulted | + * |-----------------------| + */ + ASSERT_EQ(mprotect(&ptr[3 * page_size], 3 * page_size, PROT_READ), 0); + + /* Assert that the merge succeeded using PROCMAP_QUERY. */ + ASSERT_TRUE(find_vma_procmap(procmap, ptr)); + ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr); + ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 9 * page_size); +} + +TEST_F(merge, mprotect_unfaulted_left_faulted_right) +{ + unsigned int page_size = self->page_size; + char *carveout = self->carveout; + struct procmap_fd *procmap = &self->procmap; + char *ptr; + + /* + * |-----------------------| + * | unfaulted | + * |-----------------------| + */ + ptr = mmap(&carveout[2 * page_size], 9 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED | MAP_NORESERVE, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + /* + * Now make the first 3 pages read-only, splitting the VMA: + * + * RO RW + * |-----------|-----------------------| + * | unfaulted | unfaulted | + * |-----------|-----------------------| + */ + ASSERT_EQ(mprotect(ptr, 3 * page_size, PROT_READ), 0); + /* + * Fault in the first of the last 6 pages so it gets an anon_vma and + * thus the whole VMA becomes 'faulted': + * + * RO RW + * |-----------|-----------------------| + * | unfaulted | faulted | + * |-----------|-----------------------| + */ + ptr[3 * page_size] = 'x'; + /* + * Now make the last 3 pages read-only, splitting the VMA: + * + * RO RW RO + * |-----------|-----------|-----------| + * | unfaulted | faulted | faulted | + * |-----------|-----------|-----------| + */ + ASSERT_EQ(mprotect(&ptr[6 * page_size], 3 * page_size, PROT_READ), 0); + /* + * Now mprotect() the RW region read-only, we should merge: + * + * RO + * |-----------------------| + * | faulted | + * |-----------------------| + */ + ASSERT_EQ(mprotect(&ptr[3 * page_size], 3 * page_size, PROT_READ), 0); + + /* Assert that the merge succeeded using PROCMAP_QUERY. */ + ASSERT_TRUE(find_vma_procmap(procmap, ptr)); + ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr); + ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 9 * page_size); +} + +TEST_F(merge, forked_target_vma) +{ + unsigned int page_size = self->page_size; + char *carveout = self->carveout; + struct procmap_fd *procmap = &self->procmap; + pid_t pid; + char *ptr, *ptr2; + int i; + + /* + * |-----------| + * | unfaulted | + * |-----------| + */ + ptr = mmap(&carveout[page_size], 5 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* + * Fault in process. + * + * |-----------| + * | faulted | + * |-----------| + */ + ptr[0] = 'x'; + + pid = fork(); + ASSERT_NE(pid, -1); + + if (pid != 0) { + wait(NULL); + return; + } + + /* Child process below: */ + + /* Reopen for child. */ + ASSERT_EQ(close_procmap(&self->procmap), 0); + ASSERT_EQ(open_self_procmap(&self->procmap), 0); + + /* unCOWing everything does not cause the AVC to go away. */ + for (i = 0; i < 5 * page_size; i += page_size) + ptr[i] = 'x'; + + /* + * Map in adjacent VMA in child. + * + * forked + * |-----------|-----------| + * | faulted | unfaulted | + * |-----------|-----------| + * ptr ptr2 + */ + ptr2 = mmap(&ptr[5 * page_size], 5 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); + ASSERT_NE(ptr2, MAP_FAILED); + + /* Make sure not merged. */ + ASSERT_TRUE(find_vma_procmap(procmap, ptr)); + ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr); + ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 5 * page_size); +} + +TEST_F(merge, forked_source_vma) +{ + unsigned int page_size = self->page_size; + char *carveout = self->carveout; + struct procmap_fd *procmap = &self->procmap; + pid_t pid; + char *ptr, *ptr2; + int i; + + /* + * |-----------|------------| + * | unfaulted | | + * |-----------|------------| + */ + ptr = mmap(&carveout[page_size], 5 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED | MAP_NORESERVE, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* + * Fault in process. + * + * |-----------|------------| + * | faulted | | + * |-----------|------------| + */ + ptr[0] = 'x'; + + pid = fork(); + ASSERT_NE(pid, -1); + + if (pid != 0) { + wait(NULL); + return; + } + + /* Child process below: */ + + /* Reopen for child. */ + ASSERT_EQ(close_procmap(&self->procmap), 0); + ASSERT_EQ(open_self_procmap(&self->procmap), 0); + + /* unCOWing everything does not cause the AVC to go away. */ + for (i = 0; i < 5 * page_size; i += page_size) + ptr[i] = 'x'; + + /* + * Map in adjacent VMA in child, ptr2 after ptr, but incompatible. + * + * forked RW RWX + * |-----------|-----------| + * | faulted | unfaulted | + * |-----------|-----------| + * ptr ptr2 + */ + ptr2 = mmap(&carveout[6 * page_size], 5 * page_size, PROT_READ | PROT_WRITE | PROT_EXEC, + MAP_ANON | MAP_PRIVATE | MAP_FIXED | MAP_NORESERVE, -1, 0); + ASSERT_NE(ptr2, MAP_FAILED); + + /* Make sure not merged. */ + ASSERT_TRUE(find_vma_procmap(procmap, ptr2)); + ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr2); + ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr2 + 5 * page_size); + + /* + * Now mprotect forked region to RWX so it becomes the source for the + * merge to unfaulted region: + * + * forked RWX RWX + * |-----------|-----------| + * | faulted | unfaulted | + * |-----------|-----------| + * ptr ptr2 + * + * This should NOT result in a merge, as ptr was forked. + */ + ASSERT_EQ(mprotect(ptr, 5 * page_size, PROT_READ | PROT_WRITE | PROT_EXEC), 0); + /* Again, make sure not merged. */ + ASSERT_TRUE(find_vma_procmap(procmap, ptr2)); + ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr2); + ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr2 + 5 * page_size); +} + +TEST_HARNESS_MAIN diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh index 9aff33b10999..188b125bf1f6 100755 --- a/tools/testing/selftests/mm/run_vmtests.sh +++ b/tools/testing/selftests/mm/run_vmtests.sh @@ -79,6 +79,8 @@ separated by spaces: test prctl(PR_SET_MDWE, ...) - page_frag test handling of page fragment allocation and freeing +- vma_merge + test VMA merge cases behave as expected example: ./run_vmtests.sh -t "hmm mmap ksm" EOF @@ -421,6 +423,8 @@ CATEGORY="madv_guard" run_test ./guard-regions # MADV_POPULATE_READ and MADV_POPULATE_WRITE tests CATEGORY="madv_populate" run_test ./madv_populate +CATEGORY="vma_merge" run_test ./merge + if [ -x ./memfd_secret ] then (echo 0 > /proc/sys/kernel/yama/ptrace_scope 2>&1) | tap_prefix From 10d483f198cf35504fcffc677ac116190839a34f Mon Sep 17 00:00:00 2001 From: Baolin Wang Date: Wed, 9 Apr 2025 17:38:58 +0800 Subject: [PATCH 103/285] mm: huge_memory: add folio_mark_accessed() when zapping file THP When investigating performance issues during file folio unmap, I noticed some behavioral differences in handling non-PMD-sized folios and PMD-sized folios. For non-PMD-sized file folios, it will call folio_mark_accessed() to mark the folio as having seen activity, but this is not done for PMD-sized folios. This might not cause obvious issues, but a potential problem could be that, it might lead to reclaim of hot file folios under memory pressure, as quoted from Johannes: : Sometimes file contents are only accessed through relatively short-lived : mappings. But they can nevertheless be accessed a lot and be hot. It's : important to not lose that information on unmap, and end up kicking out a : frequently used cache page. Therefore, we should also add folio_mark_accessed() for PMD-sized file folios when unmapping. [baolin.wang@linux.alibaba.com: add comment] Link: https://lkml.kernel.org/r/23fdc11d-e983-4627-89a8-79e9ecf9a45a@linux.alibaba.com Link: https://lkml.kernel.org/r/fc117f60d7b686f87067f36a0ef7cdbc3a78109c.1744190345.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang Acked-by: Johannes Weiner Acked-by: Zi Yan Acked-by: David Hildenbrand Reviewed-by: Oscar Salvador Cc: Barry Song <21cnbao@gmail.com> Cc: Matthew Wilcox (Oracle) Cc: Ryan Roberts Signed-off-by: Andrew Morton --- mm/huge_memory.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 1cd975503131..5576a08a593d 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2259,6 +2259,14 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, zap_deposited_table(tlb->mm, pmd); add_mm_counter(tlb->mm, mm_counter_file(folio), -HPAGE_PMD_NR); + + /* + * Use flush_needed to indicate whether the PMD entry + * is present, instead of checking pmd_present() again. + */ + if (flush_needed && pmd_young(orig_pmd) && + likely(vma_has_recency(vma))) + folio_mark_accessed(folio); } spin_unlock(ptl); From 066c770437835d2bd2072bd2c88a71fcbbd5ccb3 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Wed, 9 Apr 2025 17:00:19 -0700 Subject: [PATCH 104/285] mm/madvise: define and use madvise_behavior struct for madvise_do_behavior() Patch series "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE", v3. When process_madvise() is called to do MADV_DONTNEED[_LOCKED] or MADV_FREE with multiple address ranges, tlb flushes happen for each of the given address ranges. Because such tlb flushes are for the same process, doing those in a batch is more efficient while still being safe. Modify process_madvise() entry level code path to do such batched tlb flushes, while the internal unmap logic do only gathering of the tlb entries to flush. In more detail, modify the entry functions to initialize an mmu_gather object and pass it to the internal logic. And make the internal logic do only gathering of the tlb entries to flush into the received mmu_gather object. After all internal function calls are done, the entry functions flush the gathered tlb entries at once. Because process_madvise() and madvise() share the internal unmap logic, make same change to madvise() entry code together, to make code consistent and cleaner. It is only for keeping the code clean, and shouldn't degrade madvise(). It could rather provide a potential tlb flushes reduction benefit for a case that there are multiple vmas for the given address range. It is only a side effect from an effort to keep code clean, so we don't measure it separately. Similar optimizations might be applicable to other madvise behavior such as MADV_COLD and MADV_PAGEOUT. Those are simply out of the scope of this patch series, though. Patches Sequence ================ The first patch defines a new data structure for managing information that is required for batched tlb flushes (mmu_gather and behavior), and update code paths for MADV_DONTNEED[_LOCKED] and MADV_FREE handling internal logic to receive it. The second patch batches tlb flushes for MADV_FREE handling for both madvise() and process_madvise(). Remaining two patches are for MADV_DONTNEED[_LOCKED] tlb flushes batching. The third patch splits zap_page_range_single() for batching of MADV_DONTNEED[_LOCKED] handling. The fourth patch batches tlb flushes for the hint using the sub-logic that the third patch split out, and the helpers for batched tlb flushes that introduced for the MADV_FREE case, by the second patch. Test Results ============ I measured the latency to apply MADV_DONTNEED advice to 256 MiB memory using multiple process_madvise() calls. I apply the advice in 4 KiB sized regions granularity, but with varying batch size per process_madvise() call (vlen) from 1 to 1024. The source code for the measurement is available at GitHub[1]. To reduce measurement errors, I did the measurement five times. The measurement results are as below. 'sz_batch' column shows the batch size of process_madvise() calls. 'Before' and 'After' columns show the average of latencies in nanoseconds that measured five times on kernels that built without and with the tlb flushes batching of this series (patches 3 and 4), respectively. For the baseline, mm-new tree of 2025-04-09[2] has been used, after reverting the second version of this patch series and adding a temporal fix for !CONFIG_DEBUG_VM build failure[3]. 'B-stdev' and 'A-stdev' columns show ratios of latency measurements standard deviation to average in percent for 'Before' and 'After', respectively. 'Latency_reduction' shows the reduction of the latency that the 'After' has achieved compared to 'Before', in percent. Higher 'Latency_reduction' values mean more efficiency improvements. sz_batch Before B-stdev After A-stdev Latency_reduction 1 146386348 2.78 111327360.6 3.13 23.95 2 108222130 1.54 72131173.6 2.39 33.35 4 93617846.8 2.76 51859294.4 2.50 44.61 8 80555150.4 2.38 44328790 1.58 44.97 16 77272777 1.62 37489433.2 1.16 51.48 32 76478465.2 2.75 33570506 3.48 56.10 64 75810266.6 1.15 27037652.6 1.61 64.34 128 73222748 3.86 25517629.4 3.30 65.15 256 72534970.8 2.31 25002180.4 0.94 65.53 512 71809392 5.12 24152285.4 2.41 66.37 1024 73281170.2 4.53 24183615 2.09 67.00 Unexpectedly the latency has reduced (improved) even with batch size one. I think some of compiler optimizations have affected that, like also observed with the first version of this patch series. So, please focus on the proportion between the improvement and the batch size. As expected, tlb flushes batching provides latency reduction that proportional to the batch size. The efficiency gain ranges from about 33 percent with batch size 2, and up to 67 percent with batch size 1,024. Please note that this is a very simple microbenchmark, so real efficiency gain on real workload could be very different. This patch (of 4): To implement batched tlb flushes for MADV_DONTNEED[_LOCKED] and MADV_FREE, an mmu_gather object in addition to the behavior integer need to be passed to the internal logics. Using a struct can make it easy without increasing the number of parameters of all code paths towards the internal logic. Define a struct for the purpose and use it on the code path that starts from madvise_do_behavior() and ends on madvise_dontneed_free(). Note that this changes madvise_walk_vmas() visitor type signature, too. Specifically, it changes its 'arg' type from 'unsigned long' to the new struct pointer. Link: https://lkml.kernel.org/r/20250410000022.1901-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250410000022.1901-2-sj@kernel.org Signed-off-by: SeongJae Park Reviewed-by: Lorenzo Stoakes Cc: David Hildenbrand Cc: Liam R. Howlett Cc: Rik van Riel Cc: SeongJae Park Cc: Shakeel Butt Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/madvise.c | 37 +++++++++++++++++++++++++------------ 1 file changed, 25 insertions(+), 12 deletions(-) diff --git a/mm/madvise.c b/mm/madvise.c index b17f684322ad..26fa868b41af 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -48,6 +48,11 @@ struct madvise_walk_private { bool pageout; }; +struct madvise_behavior { + int behavior; + struct mmu_gather *tlb; +}; + /* * Any behaviour which results in changes to the vma->vm_flags needs to * take mmap_lock for writing. Others, which simply traverse vmas, need @@ -893,8 +898,9 @@ static bool madvise_dontneed_free_valid_vma(struct vm_area_struct *vma, static long madvise_dontneed_free(struct vm_area_struct *vma, struct vm_area_struct **prev, unsigned long start, unsigned long end, - int behavior) + struct madvise_behavior *madv_behavior) { + int behavior = madv_behavior->behavior; struct mm_struct *mm = vma->vm_mm; *prev = vma; @@ -1249,8 +1255,10 @@ static long madvise_guard_remove(struct vm_area_struct *vma, static int madvise_vma_behavior(struct vm_area_struct *vma, struct vm_area_struct **prev, unsigned long start, unsigned long end, - unsigned long behavior) + void *behavior_arg) { + struct madvise_behavior *arg = behavior_arg; + int behavior = arg->behavior; int error; struct anon_vma_name *anon_name; unsigned long new_flags = vma->vm_flags; @@ -1270,7 +1278,7 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, case MADV_FREE: case MADV_DONTNEED: case MADV_DONTNEED_LOCKED: - return madvise_dontneed_free(vma, prev, start, end, behavior); + return madvise_dontneed_free(vma, prev, start, end, arg); case MADV_NORMAL: new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ; break; @@ -1487,10 +1495,10 @@ static bool process_madvise_remote_valid(int behavior) */ static int madvise_walk_vmas(struct mm_struct *mm, unsigned long start, - unsigned long end, unsigned long arg, + unsigned long end, void *arg, int (*visit)(struct vm_area_struct *vma, struct vm_area_struct **prev, unsigned long start, - unsigned long end, unsigned long arg)) + unsigned long end, void *arg)) { struct vm_area_struct *vma; struct vm_area_struct *prev; @@ -1548,7 +1556,7 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start, static int madvise_vma_anon_name(struct vm_area_struct *vma, struct vm_area_struct **prev, unsigned long start, unsigned long end, - unsigned long anon_name) + void *anon_name) { int error; @@ -1557,7 +1565,7 @@ static int madvise_vma_anon_name(struct vm_area_struct *vma, return -EBADF; error = madvise_update_vma(vma, prev, start, end, vma->vm_flags, - (struct anon_vma_name *)anon_name); + anon_name); /* * madvise() returns EAGAIN if kernel resources, such as @@ -1589,7 +1597,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, if (end == start) return 0; - return madvise_walk_vmas(mm, start, end, (unsigned long)anon_name, + return madvise_walk_vmas(mm, start, end, anon_name, madvise_vma_anon_name); } #endif /* CONFIG_ANON_VMA_NAME */ @@ -1677,8 +1685,10 @@ static bool is_madvise_populate(int behavior) } static int madvise_do_behavior(struct mm_struct *mm, - unsigned long start, size_t len_in, int behavior) + unsigned long start, size_t len_in, + struct madvise_behavior *madv_behavior) { + int behavior = madv_behavior->behavior; struct blk_plug plug; unsigned long end; int error; @@ -1692,7 +1702,7 @@ static int madvise_do_behavior(struct mm_struct *mm, if (is_madvise_populate(behavior)) error = madvise_populate(mm, start, end, behavior); else - error = madvise_walk_vmas(mm, start, end, behavior, + error = madvise_walk_vmas(mm, start, end, madv_behavior, madvise_vma_behavior); blk_finish_plug(&plug); return error; @@ -1773,13 +1783,14 @@ static int madvise_do_behavior(struct mm_struct *mm, int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior) { int error; + struct madvise_behavior madv_behavior = {.behavior = behavior}; if (madvise_should_skip(start, len_in, behavior, &error)) return error; error = madvise_lock(mm, behavior); if (error) return error; - error = madvise_do_behavior(mm, start, len_in, behavior); + error = madvise_do_behavior(mm, start, len_in, &madv_behavior); madvise_unlock(mm, behavior); return error; @@ -1796,6 +1807,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter, { ssize_t ret = 0; size_t total_len; + struct madvise_behavior madv_behavior = {.behavior = behavior}; total_len = iov_iter_count(iter); @@ -1811,7 +1823,8 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter, if (madvise_should_skip(start, len_in, behavior, &error)) ret = error; else - ret = madvise_do_behavior(mm, start, len_in, behavior); + ret = madvise_do_behavior(mm, start, len_in, + &madv_behavior); /* * An madvise operation is attempting to restart the syscall, * but we cannot proceed as it would not be correct to repeat From 01bef02bf9301ea6c255b0daa38356e07719dd69 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Wed, 9 Apr 2025 17:00:20 -0700 Subject: [PATCH 105/285] mm/madvise: batch tlb flushes for MADV_FREE MADV_FREE handling for [process_]madvise() flushes tlb for each vma of each address range. Update the logic to do tlb flushes in a batched way. Initialize an mmu_gather object from do_madvise() and vector_madvise(), which are the entry level functions for [process_]madvise(), respectively. And pass those objects to the function for per-vma work, via madvise_behavior struct. Make the per-vma logic not flushes tlb on their own but just saves the tlb entries to the received mmu_gather object. Finally, the entry level functions flush the tlb entries that gathered for the entire user request, at once. Link: https://lkml.kernel.org/r/20250410000022.1901-3-sj@kernel.org Signed-off-by: SeongJae Park Reviewed-by: Lorenzo Stoakes Signed-off-by: Andrew Morton --- mm/madvise.c | 57 ++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 46 insertions(+), 11 deletions(-) diff --git a/mm/madvise.c b/mm/madvise.c index 26fa868b41af..951038a9f36f 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -799,12 +799,13 @@ static const struct mm_walk_ops madvise_free_walk_ops = { .walk_lock = PGWALK_RDLOCK, }; -static int madvise_free_single_vma(struct vm_area_struct *vma, +static int madvise_free_single_vma(struct madvise_behavior *madv_behavior, + struct vm_area_struct *vma, unsigned long start_addr, unsigned long end_addr) { struct mm_struct *mm = vma->vm_mm; struct mmu_notifier_range range; - struct mmu_gather tlb; + struct mmu_gather *tlb = madv_behavior->tlb; /* MADV_FREE works for only anon vma at the moment */ if (!vma_is_anonymous(vma)) @@ -820,17 +821,14 @@ static int madvise_free_single_vma(struct vm_area_struct *vma, range.start, range.end); lru_add_drain(); - tlb_gather_mmu(&tlb, mm); update_hiwater_rss(mm); mmu_notifier_invalidate_range_start(&range); - tlb_start_vma(&tlb, vma); + tlb_start_vma(tlb, vma); walk_page_range(vma->vm_mm, range.start, range.end, - &madvise_free_walk_ops, &tlb); - tlb_end_vma(&tlb, vma); + &madvise_free_walk_ops, tlb); + tlb_end_vma(tlb, vma); mmu_notifier_invalidate_range_end(&range); - tlb_finish_mmu(&tlb); - return 0; } @@ -954,7 +952,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED) return madvise_dontneed_single_vma(vma, start, end); else if (behavior == MADV_FREE) - return madvise_free_single_vma(vma, start, end); + return madvise_free_single_vma(madv_behavior, vma, start, end); else return -EINVAL; } @@ -1627,6 +1625,29 @@ static void madvise_unlock(struct mm_struct *mm, int behavior) mmap_read_unlock(mm); } +static bool madvise_batch_tlb_flush(int behavior) +{ + switch (behavior) { + case MADV_FREE: + return true; + default: + return false; + } +} + +static void madvise_init_tlb(struct madvise_behavior *madv_behavior, + struct mm_struct *mm) +{ + if (madvise_batch_tlb_flush(madv_behavior->behavior)) + tlb_gather_mmu(madv_behavior->tlb, mm); +} + +static void madvise_finish_tlb(struct madvise_behavior *madv_behavior) +{ + if (madvise_batch_tlb_flush(madv_behavior->behavior)) + tlb_finish_mmu(madv_behavior->tlb); +} + static bool is_valid_madvise(unsigned long start, size_t len_in, int behavior) { size_t len; @@ -1783,14 +1804,20 @@ static int madvise_do_behavior(struct mm_struct *mm, int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior) { int error; - struct madvise_behavior madv_behavior = {.behavior = behavior}; + struct mmu_gather tlb; + struct madvise_behavior madv_behavior = { + .behavior = behavior, + .tlb = &tlb, + }; if (madvise_should_skip(start, len_in, behavior, &error)) return error; error = madvise_lock(mm, behavior); if (error) return error; + madvise_init_tlb(&madv_behavior, mm); error = madvise_do_behavior(mm, start, len_in, &madv_behavior); + madvise_finish_tlb(&madv_behavior); madvise_unlock(mm, behavior); return error; @@ -1807,13 +1834,18 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter, { ssize_t ret = 0; size_t total_len; - struct madvise_behavior madv_behavior = {.behavior = behavior}; + struct mmu_gather tlb; + struct madvise_behavior madv_behavior = { + .behavior = behavior, + .tlb = &tlb, + }; total_len = iov_iter_count(iter); ret = madvise_lock(mm, behavior); if (ret) return ret; + madvise_init_tlb(&madv_behavior, mm); while (iov_iter_count(iter)) { unsigned long start = (unsigned long)iter_iov_addr(iter); @@ -1842,14 +1874,17 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter, } /* Drop and reacquire lock to unwind race. */ + madvise_finish_tlb(&madv_behavior); madvise_unlock(mm, behavior); madvise_lock(mm, behavior); + madvise_init_tlb(&madv_behavior, mm); continue; } if (ret < 0) break; iov_iter_advance(iter, iter_iov_len(iter)); } + madvise_finish_tlb(&madv_behavior); madvise_unlock(mm, behavior); ret = (total_len - iov_iter_count(iter)) ? : ret; From de8efdf8cd273619121cbc2f0f70dddc74bc586e Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Wed, 9 Apr 2025 17:00:21 -0700 Subject: [PATCH 106/285] mm/memory: split non-tlb flushing part from zap_page_range_single() Some of zap_page_range_single() callers such as [process_]madvise() with MADV_DONTNEED[_LOCKED] cannot batch tlb flushes because zap_page_range_single() flushes tlb for each invocation. Split out the body of zap_page_range_single() except mmu_gather object initialization and gathered tlb entries flushing for such batched tlb flushing usage. To avoid hugetlb pages allocation failures from concurrent page faults, the tlb flush should be done before hugetlb faults unlocking, though. Do the flush and the unlock inside the split out function in the order for hugetlb vma case. Refer to commit 2820b0f09be9 ("hugetlbfs: close race between MADV_DONTNEED and page fault") for more details about the concurrent faults' page allocation failure problem. Link: https://lkml.kernel.org/r/20250410000022.1901-4-sj@kernel.org Signed-off-by: SeongJae Park Reviewed-by: Lorenzo Stoakes Signed-off-by: Andrew Morton --- mm/memory.c | 57 ++++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 43 insertions(+), 14 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index f41fac7118ba..33d34722e339 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1998,6 +1998,48 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas, mmu_notifier_invalidate_range_end(&range); } +/* + * zap_page_range_single_batched - remove user pages in a given range + * @tlb: pointer to the caller's struct mmu_gather + * @vma: vm_area_struct holding the applicable pages + * @address: starting address of pages to remove + * @size: number of bytes to remove + * @details: details of shared cache invalidation + * + * @tlb shouldn't be NULL. The range must fit into one VMA. If @vma is for + * hugetlb, @tlb is flushed and re-initialized by this function. + */ +static void zap_page_range_single_batched(struct mmu_gather *tlb, + struct vm_area_struct *vma, unsigned long address, + unsigned long size, struct zap_details *details) +{ + const unsigned long end = address + size; + struct mmu_notifier_range range; + + VM_WARN_ON_ONCE(!tlb || tlb->mm != vma->vm_mm); + + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm, + address, end); + hugetlb_zap_begin(vma, &range.start, &range.end); + update_hiwater_rss(vma->vm_mm); + mmu_notifier_invalidate_range_start(&range); + /* + * unmap 'address-end' not 'range.start-range.end' as range + * could have been expanded for hugetlb pmd sharing. + */ + unmap_single_vma(tlb, vma, address, end, details, false); + mmu_notifier_invalidate_range_end(&range); + if (is_vm_hugetlb_page(vma)) { + /* + * flush tlb and free resources before hugetlb_zap_end(), to + * avoid concurrent page faults' allocation failure. + */ + tlb_finish_mmu(tlb); + hugetlb_zap_end(vma, details); + tlb_gather_mmu(tlb, vma->vm_mm); + } +} + /** * zap_page_range_single - remove user pages in a given range * @vma: vm_area_struct holding the applicable pages @@ -2010,24 +2052,11 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas, void zap_page_range_single(struct vm_area_struct *vma, unsigned long address, unsigned long size, struct zap_details *details) { - const unsigned long end = address + size; - struct mmu_notifier_range range; struct mmu_gather tlb; - mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm, - address, end); - hugetlb_zap_begin(vma, &range.start, &range.end); tlb_gather_mmu(&tlb, vma->vm_mm); - update_hiwater_rss(vma->vm_mm); - mmu_notifier_invalidate_range_start(&range); - /* - * unmap 'address-end' not 'range.start-range.end' as range - * could have been expanded for hugetlb pmd sharing. - */ - unmap_single_vma(&tlb, vma, address, end, details, false); - mmu_notifier_invalidate_range_end(&range); + zap_page_range_single_batched(&tlb, vma, address, size, details); tlb_finish_mmu(&tlb); - hugetlb_zap_end(vma, details); } /** From 43c4cfde7e37460482e41dbb6837d4bff290d2cc Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Wed, 9 Apr 2025 17:00:22 -0700 Subject: [PATCH 107/285] mm/madvise: batch tlb flushes for MADV_DONTNEED[_LOCKED] MADV_DONTNEED[_LOCKED] handling for [process_]madvise() flushes tlb for each vma of each address range. Update the logic to do tlb flushes in a batched way. Initialize an mmu_gather object from do_madvise() and vector_madvise(), which are the entry level functions for [process_]madvise(), respectively. And pass those objects to the function for per-vma work, via madvise_behavior struct. Make the per-vma logic not flushes tlb on their own but just saves the tlb entries to the received mmu_gather object. For this internal logic change, make zap_page_range_single_batched() non-static and use it directly from madvise_dontneed_single_vma(). Finally, the entry level functions flush the tlb entries that gathered for the entire user request, at once. Link: https://lkml.kernel.org/r/20250410000022.1901-5-sj@kernel.org Signed-off-by: SeongJae Park Reviewed-by: Lorenzo Stoakes Signed-off-by: Andrew Morton --- mm/internal.h | 3 +++ mm/madvise.c | 11 ++++++++--- mm/memory.c | 4 ++-- 3 files changed, 13 insertions(+), 5 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index 0cf1f534ee1a..780481a8be0e 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -430,6 +430,9 @@ void unmap_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long addr, unsigned long end, struct zap_details *details); +void zap_page_range_single_batched(struct mmu_gather *tlb, + struct vm_area_struct *vma, unsigned long addr, + unsigned long size, struct zap_details *details); int folio_unmap_invalidate(struct address_space *mapping, struct folio *folio, gfp_t gfp); diff --git a/mm/madvise.c b/mm/madvise.c index 951038a9f36f..8433ac9b27e0 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -851,7 +851,8 @@ static int madvise_free_single_vma(struct madvise_behavior *madv_behavior, * An interface that causes the system to free clean pages and flush * dirty pages is already available as msync(MS_INVALIDATE). */ -static long madvise_dontneed_single_vma(struct vm_area_struct *vma, +static long madvise_dontneed_single_vma(struct madvise_behavior *madv_behavior, + struct vm_area_struct *vma, unsigned long start, unsigned long end) { struct zap_details details = { @@ -859,7 +860,8 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma, .even_cows = true, }; - zap_page_range_single(vma, start, end - start, &details); + zap_page_range_single_batched( + madv_behavior->tlb, vma, start, end - start, &details); return 0; } @@ -950,7 +952,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, } if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED) - return madvise_dontneed_single_vma(vma, start, end); + return madvise_dontneed_single_vma( + madv_behavior, vma, start, end); else if (behavior == MADV_FREE) return madvise_free_single_vma(madv_behavior, vma, start, end); else @@ -1628,6 +1631,8 @@ static void madvise_unlock(struct mm_struct *mm, int behavior) static bool madvise_batch_tlb_flush(int behavior) { switch (behavior) { + case MADV_DONTNEED: + case MADV_DONTNEED_LOCKED: case MADV_FREE: return true; default: diff --git a/mm/memory.c b/mm/memory.c index 33d34722e339..71c255f3fdcc 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1998,7 +1998,7 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas, mmu_notifier_invalidate_range_end(&range); } -/* +/** * zap_page_range_single_batched - remove user pages in a given range * @tlb: pointer to the caller's struct mmu_gather * @vma: vm_area_struct holding the applicable pages @@ -2009,7 +2009,7 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas, * @tlb shouldn't be NULL. The range must fit into one VMA. If @vma is for * hugetlb, @tlb is flushed and re-initialized by this function. */ -static void zap_page_range_single_batched(struct mmu_gather *tlb, +void zap_page_range_single_batched(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long address, unsigned long size, struct zap_details *details) { From 28092a652f9c17773aa6ca0f62f73738d756b914 Mon Sep 17 00:00:00 2001 From: Sidhartha Kumar Date: Thu, 10 Apr 2025 19:14:41 +0000 Subject: [PATCH 108/285] maple_tree: convert mas_prealloc_calc() to take in a maple write state Patch series "Track node vacancy to reduce worst case allocation counts", v5. ================ overview ======================== Currently, the maple tree preallocates the worst case number of nodes for given store type by taking into account the whole height of the tree. This comes from a worst case scenario of every node in the tree being full and having to propagate node allocation upwards until we reach the root of the tree. This can be optimized if there are vacancies in nodes that are at a lower depth than the root node. This series implements tracking the level at which there is a vacant node so we only need to allocate until this level is reached, rather than always using the full height of the tree. The ma_wr_state struct is modified to add a field which keeps track of the vacant height and is updated during walks of the tree. This value is then read in mas_prealloc_calc() when we decide how many nodes to allocate. For rebalancing and spanning stores, we also need to track the lowest height at which a node has 1 more entry than the minimum sufficient number of entries. This is because rebalancing can cause a parent node to become insufficient which results in further node allocations. In this case, we need to use the sufficient height as the worst case rather than the vacant height. patch 1-2: preparatory patches patch 3: implement vacant height tracking + update the tests patch 4: support vacant height tracking for rebalancing writes patch 5: implement sufficient height tracking patch 6: reorder switch case statements ================ results ========================= Bpftrace was used to profile the allocation path for requesting new maple nodes while running stress-ng mmap 120s. The histograms below represent requests to kmem_cache_alloc_bulk() and show the count argument. This represnts how many maple nodes the caller is requesting in kmem_cache_alloc_bulk() command: stress-ng --mmap 4 --timeout 120 mm-unstable @bulk_alloc_req: [3, 4) 4 | | [4, 5) 54170 |@ | [5, 6) 0 | | [6, 7) 893057 |@@@@@@@@@@@@@@@@@@@@ | [7, 8) 4 | | [8, 9) 2230287 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [9, 10) 55811 |@ | [10, 11) 77834 |@ | [11, 12) 0 | | [12, 13) 1368684 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [13, 14) 0 | | [14, 15) 0 | | [15, 16) 367197 |@@@@@@@@ | @maple_node_total: 46,630,160 @total_vmas: 46184591 mm-unstable + this series @bulk_alloc_req: [2, 3) 198 | | [3, 4) 4 | | [4, 5) 43 | | [5, 6) 0 | | [6, 7) 1069503 |@@@@@@@@@@@@@@@@@@@@@ | [7, 8) 4 | | [8, 9) 2597268 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [9, 10) 472191 |@@@@@@@@@ | [10, 11) 191904 |@@@ | [11, 12) 0 | | [12, 13) 247316 |@@@@ | [13, 14) 0 | | [14, 15) 0 | | [15, 16) 98769 |@ | @maple_node_total: 37,813,856 @total_vmas: 43493287 This represents a ~19% reduction in the number of bulk maple nodes allocated. For more reproducible results, a historgram of the return value of mas_prealloc_calc() is displayed while running the maple_tree_tests whcih have a deterministic store pattern mas_prealloc_calc() return value mm-unstable 1 : (12068) 3 : (11836) 5 : ***** (271192) 7 : ************************************************** (2329329) 9 : *********** (534186) 10 : (435) 11 : *************** (704306) 13 : ******** (409781) mas_prealloc_calc() return value mm-unstable + this series 1 : (12070) 3 : ************************************************** (3548777) 5 : ******** (633458) 7 : (65081) 9 : (11224) 10 : (341) 11 : (2973) 13 : (68) do_mmap latency was also measured for regressions: command: stress-ng --mmap 4 --timeout 120 mm-unstable: avg = 7162 nsecs, total: 16101821292 nsecs, count: 2248034 mm-unstable + this series: avg = 6689 nsecs, total: 15135391764 nsecs, count: 2262726 stress-ng --mmap4 --timeout 120 with vacant_height: stress-ng: info: [257] 21526312 Maple Tree Read 0.176 M/sec stress-ng: info: [257] 339979348 Maple Tree Write 2.774 M/sec without vacant_height: stress-ng: info: [8228] 20968900 Maple Tree Read 0.171 M/sec stress-ng: info: [8228] 312214648 Maple Tree Write 2.547 M/sec This represents an increase of ~3% read throughput and ~9% increase in write throughput. This patch (of 6): In a subsequent patch, mas_prealloc_calc() will need to access fields only in the ma_wr_state. Convert the function to take in a ma_wr_state and modify all callers. There is no functional change. Link: https://lkml.kernel.org/r/20250410191446.2474640-1-sidhartha.kumar@oracle.com Link: https://lkml.kernel.org/r/20250410191446.2474640-2-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar Reviewed-by: Liam R. Howlett Cc: Matthew Wilcox (Oracle) Cc: Wei Yang Signed-off-by: Andrew Morton --- lib/maple_tree.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index d0bea23fa4bc..f25ee210d495 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -4140,13 +4140,14 @@ set_content: /** * mas_prealloc_calc() - Calculate number of nodes needed for a * given store oepration - * @mas: The maple state + * @wr_mas: The maple write state * @entry: The entry to store into the tree * * Return: Number of nodes required for preallocation. */ -static inline int mas_prealloc_calc(struct ma_state *mas, void *entry) +static inline int mas_prealloc_calc(struct ma_wr_state *wr_mas, void *entry) { + struct ma_state *mas = wr_mas->mas; int ret = mas_mt_height(mas) * 3 + 1; switch (mas->store_type) { @@ -4243,16 +4244,15 @@ static inline enum store_type mas_wr_store_type(struct ma_wr_state *wr_mas) */ static inline void mas_wr_preallocate(struct ma_wr_state *wr_mas, void *entry) { - struct ma_state *mas = wr_mas->mas; int request; mas_wr_prealloc_setup(wr_mas); - mas->store_type = mas_wr_store_type(wr_mas); - request = mas_prealloc_calc(mas, entry); + wr_mas->mas->store_type = mas_wr_store_type(wr_mas); + request = mas_prealloc_calc(wr_mas, entry); if (!request) return; - mas_node_count(mas, request); + mas_node_count(wr_mas->mas, request); } /** @@ -5397,7 +5397,7 @@ void *mas_store(struct ma_state *mas, void *entry) return wr_mas.content; } - request = mas_prealloc_calc(mas, entry); + request = mas_prealloc_calc(&wr_mas, entry); if (!request) goto store; @@ -5494,7 +5494,7 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp) mas_wr_prealloc_setup(&wr_mas); mas->store_type = mas_wr_store_type(&wr_mas); - request = mas_prealloc_calc(mas, entry); + request = mas_prealloc_calc(&wr_mas, entry); if (!request) return ret; From f9d3a963fef4d3377b7ee122408cf2cdf37b3181 Mon Sep 17 00:00:00 2001 From: Sidhartha Kumar Date: Thu, 10 Apr 2025 19:14:42 +0000 Subject: [PATCH 109/285] maple_tree: use height and depth consistently For the maple tree, the root node is defined to have a depth of 0 with a height of 1. Each level down from the node, these values are incremented by 1. Various code paths define a root with depth 1 which is inconsisent with the definition. Modify the code to be consistent with this definition. In mas_spanning_rebalance(), l_mas.depth was being used to track the height based on the number of iterations done in the main loop. This information was then used in mas_put_in_tree() to set the height. Rather than overload the l_mas.depth field to track height, simply keep track of height in the local variable new_height and directly pass this to mas_wmb_replace() which will be passed into mas_put_in_tree(). This allows up to remove writes to l_mas.depth. Link: https://lkml.kernel.org/r/20250410191446.2474640-3-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar Reviewed-by: Liam R. Howlett Cc: Matthew Wilcox (Oracle) Cc: Wei Yang Signed-off-by: Andrew Morton --- lib/maple_tree.c | 84 +++++++++++++++++--------------- tools/testing/radix-tree/maple.c | 19 ++++++++ 2 files changed, 63 insertions(+), 40 deletions(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index f25ee210d495..195b19505b39 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -211,14 +211,14 @@ static void ma_free_rcu(struct maple_node *node) call_rcu(&node->rcu, mt_free_rcu); } -static void mas_set_height(struct ma_state *mas) +static void mt_set_height(struct maple_tree *mt, unsigned char height) { - unsigned int new_flags = mas->tree->ma_flags; + unsigned int new_flags = mt->ma_flags; new_flags &= ~MT_FLAGS_HEIGHT_MASK; - MAS_BUG_ON(mas, mas->depth > MAPLE_HEIGHT_MAX); - new_flags |= mas->depth << MT_FLAGS_HEIGHT_OFFSET; - mas->tree->ma_flags = new_flags; + MT_BUG_ON(mt, height > MAPLE_HEIGHT_MAX); + new_flags |= height << MT_FLAGS_HEIGHT_OFFSET; + mt->ma_flags = new_flags; } static unsigned int mas_mt_height(struct ma_state *mas) @@ -1371,7 +1371,7 @@ retry: root = mas_root(mas); /* Tree with nodes */ if (likely(xa_is_node(root))) { - mas->depth = 1; + mas->depth = 0; mas->status = ma_active; mas->node = mte_safe_root(root); mas->offset = 0; @@ -1712,9 +1712,10 @@ static inline void mas_adopt_children(struct ma_state *mas, * node as dead. * @mas: the maple state with the new node * @old_enode: The old maple encoded node to replace. + * @new_height: if we are inserting a root node, update the height of the tree */ static inline void mas_put_in_tree(struct ma_state *mas, - struct maple_enode *old_enode) + struct maple_enode *old_enode, char new_height) __must_hold(mas->tree->ma_lock) { unsigned char offset; @@ -1723,7 +1724,7 @@ static inline void mas_put_in_tree(struct ma_state *mas, if (mte_is_root(mas->node)) { mas_mn(mas)->parent = ma_parent_ptr(mas_tree_parent(mas)); rcu_assign_pointer(mas->tree->ma_root, mte_mk_root(mas->node)); - mas_set_height(mas); + mt_set_height(mas->tree, new_height); } else { offset = mte_parent_slot(mas->node); @@ -1741,12 +1742,13 @@ static inline void mas_put_in_tree(struct ma_state *mas, * the parent encoding to locate the maple node in the tree. * @mas: the ma_state with @mas->node pointing to the new node. * @old_enode: The old maple encoded node. + * @new_height: The new height of the tree as a result of the operation */ static inline void mas_replace_node(struct ma_state *mas, - struct maple_enode *old_enode) + struct maple_enode *old_enode, unsigned char new_height) __must_hold(mas->tree->ma_lock) { - mas_put_in_tree(mas, old_enode); + mas_put_in_tree(mas, old_enode, new_height); mas_free(mas, old_enode); } @@ -2536,10 +2538,11 @@ static inline void mas_topiary_node(struct ma_state *mas, * * @mas: The maple state pointing at the new data * @old_enode: The maple encoded node being replaced + * @new_height: The new height of the tree as a result of the operation * */ static inline void mas_topiary_replace(struct ma_state *mas, - struct maple_enode *old_enode) + struct maple_enode *old_enode, unsigned char new_height) { struct ma_state tmp[3], tmp_next[3]; MA_TOPIARY(subtrees, mas->tree); @@ -2547,7 +2550,7 @@ static inline void mas_topiary_replace(struct ma_state *mas, int i, n; /* Place data in tree & then mark node as old */ - mas_put_in_tree(mas, old_enode); + mas_put_in_tree(mas, old_enode, new_height); /* Update the parent pointers in the tree */ tmp[0] = *mas; @@ -2631,14 +2634,15 @@ static inline void mas_topiary_replace(struct ma_state *mas, * mas_wmb_replace() - Write memory barrier and replace * @mas: The maple state * @old_enode: The old maple encoded node that is being replaced. + * @new_height: The new height of the tree as a result of the operation * * Updates gap as necessary. */ static inline void mas_wmb_replace(struct ma_state *mas, - struct maple_enode *old_enode) + struct maple_enode *old_enode, unsigned char new_height) { /* Insert the new data in the tree */ - mas_topiary_replace(mas, old_enode); + mas_topiary_replace(mas, old_enode, new_height); if (mte_is_leaf(mas->node)) return; @@ -2824,6 +2828,7 @@ static void mas_spanning_rebalance(struct ma_state *mas, { unsigned char split, mid_split; unsigned char slot = 0; + unsigned char new_height = 0; /* used if node is a new root */ struct maple_enode *left = NULL, *middle = NULL, *right = NULL; struct maple_enode *old_enode; @@ -2845,8 +2850,6 @@ static void mas_spanning_rebalance(struct ma_state *mas, unlikely(mast->bn->b_end <= mt_min_slots[mast->bn->type])) mast_spanning_rebalance(mast); - l_mas.depth = 0; - /* * Each level of the tree is examined and balanced, pushing data to the left or * right, or rebalancing against left or right nodes is employed to avoid @@ -2866,6 +2869,7 @@ static void mas_spanning_rebalance(struct ma_state *mas, mast_set_split_parents(mast, left, middle, right, split, mid_split); mast_cp_to_nodes(mast, left, middle, right, split, mid_split); + new_height++; /* * Copy data from next level in the tree to mast->bn from next @@ -2873,7 +2877,6 @@ static void mas_spanning_rebalance(struct ma_state *mas, */ memset(mast->bn, 0, sizeof(struct maple_big_node)); mast->bn->type = mte_node_type(left); - l_mas.depth++; /* Root already stored in l->node. */ if (mas_is_root_limits(mast->l)) @@ -2909,8 +2912,9 @@ static void mas_spanning_rebalance(struct ma_state *mas, l_mas.node = mt_mk_node(ma_mnode_ptr(mas_pop_node(mas)), mte_node_type(mast->orig_l->node)); - l_mas.depth++; + mab_mas_cp(mast->bn, 0, mt_slots[mast->bn->type] - 1, &l_mas, true); + new_height++; mas_set_parent(mas, left, l_mas.node, slot); if (middle) mas_set_parent(mas, middle, l_mas.node, ++slot); @@ -2933,7 +2937,7 @@ new_root: mas->min = l_mas.min; mas->max = l_mas.max; mas->offset = l_mas.offset; - mas_wmb_replace(mas, old_enode); + mas_wmb_replace(mas, old_enode, new_height); mtree_range_walk(mas); return; } @@ -3009,6 +3013,7 @@ static inline void mas_destroy_rebalance(struct ma_state *mas, unsigned char end void __rcu **l_slots, **slots; unsigned long *l_pivs, *pivs, gap; bool in_rcu = mt_in_rcu(mas->tree); + unsigned char new_height = mas_mt_height(mas); MA_STATE(l_mas, mas->tree, mas->index, mas->last); @@ -3103,7 +3108,7 @@ done: mas_ascend(mas); if (in_rcu) { - mas_replace_node(mas, old_eparent); + mas_replace_node(mas, old_eparent, new_height); mas_adopt_children(mas, mas->node); } @@ -3114,10 +3119,9 @@ done: * mas_split_final_node() - Split the final node in a subtree operation. * @mast: the maple subtree state * @mas: The maple state - * @height: The height of the tree in case it's a new root. */ static inline void mas_split_final_node(struct maple_subtree_state *mast, - struct ma_state *mas, int height) + struct ma_state *mas) { struct maple_enode *ancestor; @@ -3126,7 +3130,6 @@ static inline void mas_split_final_node(struct maple_subtree_state *mast, mast->bn->type = maple_arange_64; else mast->bn->type = maple_range_64; - mas->depth = height; } /* * Only a single node is used here, could be root. @@ -3214,7 +3217,6 @@ static inline void mast_split_data(struct maple_subtree_state *mast, * mas_push_data() - Instead of splitting a node, it is beneficial to push the * data to the right or left node if there is room. * @mas: The maple state - * @height: The current height of the maple state * @mast: The maple subtree state * @left: Push left or not. * @@ -3222,8 +3224,8 @@ static inline void mast_split_data(struct maple_subtree_state *mast, * * Return: True if pushed, false otherwise. */ -static inline bool mas_push_data(struct ma_state *mas, int height, - struct maple_subtree_state *mast, bool left) +static inline bool mas_push_data(struct ma_state *mas, + struct maple_subtree_state *mast, bool left) { unsigned char slot_total = mast->bn->b_end; unsigned char end, space, split; @@ -3280,7 +3282,7 @@ static inline bool mas_push_data(struct ma_state *mas, int height, mast_split_data(mast, mas, split); mast_fill_bnode(mast, mas, 2); - mas_split_final_node(mast, mas, height + 1); + mas_split_final_node(mast, mas); return true; } @@ -3293,6 +3295,7 @@ static void mas_split(struct ma_state *mas, struct maple_big_node *b_node) { struct maple_subtree_state mast; int height = 0; + unsigned int orig_height = mas_mt_height(mas); unsigned char mid_split, split = 0; struct maple_enode *old; @@ -3319,7 +3322,6 @@ static void mas_split(struct ma_state *mas, struct maple_big_node *b_node) MA_STATE(prev_r_mas, mas->tree, mas->index, mas->last); trace_ma_op(__func__, mas); - mas->depth = mas_mt_height(mas); mast.l = &l_mas; mast.r = &r_mas; @@ -3327,9 +3329,9 @@ static void mas_split(struct ma_state *mas, struct maple_big_node *b_node) mast.orig_r = &prev_r_mas; mast.bn = b_node; - while (height++ <= mas->depth) { + while (height++ <= orig_height) { if (mt_slots[b_node->type] > b_node->b_end) { - mas_split_final_node(&mast, mas, height); + mas_split_final_node(&mast, mas); break; } @@ -3344,11 +3346,15 @@ static void mas_split(struct ma_state *mas, struct maple_big_node *b_node) * is a significant savings. */ /* Try to push left. */ - if (mas_push_data(mas, height, &mast, true)) + if (mas_push_data(mas, &mast, true)) { + height++; break; + } /* Try to push right. */ - if (mas_push_data(mas, height, &mast, false)) + if (mas_push_data(mas, &mast, false)) { + height++; break; + } split = mab_calc_split(mas, b_node, &mid_split); mast_split_data(&mast, mas, split); @@ -3365,7 +3371,7 @@ static void mas_split(struct ma_state *mas, struct maple_big_node *b_node) /* Set the original node as dead */ old = mas->node; mas->node = l_mas.node; - mas_wmb_replace(mas, old); + mas_wmb_replace(mas, old, height); mtree_range_walk(mas); return; } @@ -3424,8 +3430,7 @@ static inline void mas_root_expand(struct ma_state *mas, void *entry) if (mas->last != ULONG_MAX) pivots[++slot] = ULONG_MAX; - mas->depth = 1; - mas_set_height(mas); + mt_set_height(mas->tree, 1); ma_set_meta(node, maple_leaf_64, 0, slot); /* swap the new root into the tree */ rcu_assign_pointer(mas->tree->ma_root, mte_mk_root(mas->node)); @@ -3669,8 +3674,7 @@ static inline void mas_new_root(struct ma_state *mas, void *entry) WARN_ON_ONCE(mas->index || mas->last != ULONG_MAX); if (!entry) { - mas->depth = 0; - mas_set_height(mas); + mt_set_height(mas->tree, 0); rcu_assign_pointer(mas->tree->ma_root, entry); mas->status = ma_start; goto done; @@ -3684,8 +3688,7 @@ static inline void mas_new_root(struct ma_state *mas, void *entry) mas->status = ma_active; rcu_assign_pointer(slots[0], entry); pivots[0] = mas->last; - mas->depth = 1; - mas_set_height(mas); + mt_set_height(mas->tree, 1); rcu_assign_pointer(mas->tree->ma_root, mte_mk_root(mas->node)); done: @@ -3804,6 +3807,7 @@ static inline void mas_wr_node_store(struct ma_wr_state *wr_mas, struct maple_node reuse, *newnode; unsigned char copy_size, node_pivots = mt_pivots[wr_mas->type]; bool in_rcu = mt_in_rcu(mas->tree); + unsigned char height = mas_mt_height(mas); if (mas->last == wr_mas->end_piv) offset_end++; /* don't copy this offset */ @@ -3860,7 +3864,7 @@ done: struct maple_enode *old_enode = mas->node; mas->node = mt_mk_node(newnode, wr_mas->type); - mas_replace_node(mas, old_enode); + mas_replace_node(mas, old_enode, height); } else { memcpy(wr_mas->node, newnode, sizeof(struct maple_node)); } diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c index bc30050227fd..e0f8fabe8821 100644 --- a/tools/testing/radix-tree/maple.c +++ b/tools/testing/radix-tree/maple.c @@ -36248,6 +36248,21 @@ static noinline void __init check_mtree_dup(struct maple_tree *mt) extern void test_kmem_cache_bulk(void); +static inline void check_spanning_store_height(struct maple_tree *mt) +{ + int index = 0; + MA_STATE(mas, mt, 0, 0); + mas_lock(&mas); + while (mt_height(mt) != 3) { + mas_store_gfp(&mas, xa_mk_value(index), GFP_KERNEL); + mas_set(&mas, ++index); + } + mas_set_range(&mas, 90, 140); + mas_store_gfp(&mas, xa_mk_value(index), GFP_KERNEL); + MT_BUG_ON(mt, mas_mt_height(&mas) != 2); + mas_unlock(&mas); +} + /* callback function used for check_nomem_writer_race() */ static void writer2(void *maple_tree) { @@ -36414,6 +36429,10 @@ void farmer_tests(void) check_spanning_write(&tree); mtree_destroy(&tree); + mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE); + check_spanning_store_height(&tree); + mtree_destroy(&tree); + mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE); check_null_expand(&tree); mtree_destroy(&tree); From ad88fc17d2dafe45e40de2af80207f4b2e3b1f71 Mon Sep 17 00:00:00 2001 From: Sidhartha Kumar Date: Thu, 10 Apr 2025 19:14:43 +0000 Subject: [PATCH 110/285] maple_tree: use vacant nodes to reduce worst case allocations In order to determine the store type for a maple tree operation, a walk of the tree is done through mas_wr_walk(). This function descends the tree until a spanning write is detected or we reach a leaf node. While descending, keep track of the height at which we encounter a node with available space. This is done by checking if mas->end is less than the number of slots a given node type can fit. Now that the height of the vacant node is tracked, we can use the difference between the height of the tree and the height of the vacant node to know how many levels we will have to propagate creating new nodes. Update mas_prealloc_calc() to consider the vacant height and reduce the number of worst-case allocations. Rebalancing and spanning stores are not supported and fall back to using the full height of the tree for allocations. Update preallocation testing assertions to take into account vacant height. Link: https://lkml.kernel.org/r/20250410191446.2474640-4-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar Reviewed-by: Liam R. Howlett Cc: Matthew Wilcox (Oracle) Cc: Wei Yang Signed-off-by: Andrew Morton --- include/linux/maple_tree.h | 2 + lib/maple_tree.c | 13 ++++-- tools/testing/radix-tree/maple.c | 79 ++++++++++++++++++++++++++++---- 3 files changed, 82 insertions(+), 12 deletions(-) diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h index cbbcd18d4186..657adb33e61e 100644 --- a/include/linux/maple_tree.h +++ b/include/linux/maple_tree.h @@ -463,6 +463,7 @@ struct ma_wr_state { void __rcu **slots; /* mas->node->slots pointer */ void *entry; /* The entry to write */ void *content; /* The existing entry that is being overwritten */ + unsigned char vacant_height; /* Height of lowest node with free space */ }; #define mas_lock(mas) spin_lock(&((mas)->tree->ma_lock)) @@ -498,6 +499,7 @@ struct ma_wr_state { .mas = ma_state, \ .content = NULL, \ .entry = wr_entry, \ + .vacant_height = 0 \ } #define MA_TOPIARY(name, tree) \ diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 195b19505b39..3f794ef072f4 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -3537,6 +3537,9 @@ static bool mas_wr_walk(struct ma_wr_state *wr_mas) if (ma_is_leaf(wr_mas->type)) return true; + if (mas->end < mt_slots[wr_mas->type] - 1) + wr_mas->vacant_height = mas->depth + 1; + mas_wr_walk_traverse(wr_mas); } @@ -4152,7 +4155,9 @@ set_content: static inline int mas_prealloc_calc(struct ma_wr_state *wr_mas, void *entry) { struct ma_state *mas = wr_mas->mas; - int ret = mas_mt_height(mas) * 3 + 1; + unsigned char height = mas_mt_height(mas); + int ret = height * 3 + 1; + unsigned char delta = height - wr_mas->vacant_height; switch (mas->store_type) { case wr_invalid: @@ -4170,13 +4175,13 @@ static inline int mas_prealloc_calc(struct ma_wr_state *wr_mas, void *entry) ret = 0; break; case wr_spanning_store: - ret = mas_mt_height(mas) * 3 + 1; + WARN_ON_ONCE(ret != height * 3 + 1); break; case wr_split_store: - ret = mas_mt_height(mas) * 2 + 1; + ret = delta * 2 + 1; break; case wr_rebalance: - ret = mas_mt_height(mas) * 2 - 1; + ret = height * 2 + 1; break; case wr_node_store: ret = mt_in_rcu(mas->tree) ? 1 : 0; diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c index e0f8fabe8821..e37a3ab2e921 100644 --- a/tools/testing/radix-tree/maple.c +++ b/tools/testing/radix-tree/maple.c @@ -35475,15 +35475,65 @@ static void check_dfs_preorder(struct maple_tree *mt) } /* End of depth first search tests */ +/* get height of the lowest non-leaf node with free space */ +static unsigned char get_vacant_height(struct ma_wr_state *wr_mas, void *entry) +{ + struct ma_state *mas = wr_mas->mas; + char vacant_height = 0; + enum maple_type type; + unsigned long *pivots; + unsigned long min = 0; + unsigned long max = ULONG_MAX; + unsigned char offset; + + /* start traversal */ + mas_reset(mas); + mas_start(mas); + if (!xa_is_node(mas_root(mas))) + return 0; + + type = mte_node_type(mas->node); + wr_mas->type = type; + while (!ma_is_leaf(type)) { + mas_node_walk(mas, mte_to_node(mas->node), type, &min, &max); + offset = mas->offset; + mas->end = mas_data_end(mas); + pivots = ma_pivots(mte_to_node(mas->node), type); + + if (pivots) { + if (offset) + min = pivots[mas->offset - 1]; + if (offset < mas->end) + max = pivots[mas->offset]; + } + wr_mas->r_max = offset < mas->end ? pivots[offset] : mas->max; + + /* detect spanning write */ + if (mas_is_span_wr(wr_mas)) + break; + + if (mas->end < mt_slot_count(mas->node) - 1) + vacant_height = mas->depth + 1; + + mas_descend(mas); + type = mte_node_type(mas->node); + mas->depth++; + } + + return vacant_height; +} + /* Preallocation testing */ static noinline void __init check_prealloc(struct maple_tree *mt) { unsigned long i, max = 100; unsigned long allocated; unsigned char height; + unsigned char vacant_height; struct maple_node *mn; void *ptr = check_prealloc; MA_STATE(mas, mt, 10, 20); + MA_WR_STATE(wr_mas, &mas, ptr); mt_set_non_kernel(1000); for (i = 0; i <= max; i++) @@ -35494,8 +35544,9 @@ static noinline void __init check_prealloc(struct maple_tree *mt) MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0); allocated = mas_allocated(&mas); height = mas_mt_height(&mas); + vacant_height = get_vacant_height(&wr_mas, ptr); MT_BUG_ON(mt, allocated == 0); - MT_BUG_ON(mt, allocated != 1 + height * 3); + MT_BUG_ON(mt, allocated != 1 + (height - vacant_height) * 3); mas_destroy(&mas); allocated = mas_allocated(&mas); MT_BUG_ON(mt, allocated != 0); @@ -35503,8 +35554,9 @@ static noinline void __init check_prealloc(struct maple_tree *mt) MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0); allocated = mas_allocated(&mas); height = mas_mt_height(&mas); + vacant_height = get_vacant_height(&wr_mas, ptr); MT_BUG_ON(mt, allocated == 0); - MT_BUG_ON(mt, allocated != 1 + height * 3); + MT_BUG_ON(mt, allocated != 1 + (height - vacant_height) * 3); MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0); mas_destroy(&mas); allocated = mas_allocated(&mas); @@ -35514,7 +35566,8 @@ static noinline void __init check_prealloc(struct maple_tree *mt) MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0); allocated = mas_allocated(&mas); height = mas_mt_height(&mas); - MT_BUG_ON(mt, allocated != 1 + height * 3); + vacant_height = get_vacant_height(&wr_mas, ptr); + MT_BUG_ON(mt, allocated != 1 + (height - vacant_height) * 3); mn = mas_pop_node(&mas); MT_BUG_ON(mt, mas_allocated(&mas) != allocated - 1); mn->parent = ma_parent_ptr(mn); @@ -35527,7 +35580,8 @@ static noinline void __init check_prealloc(struct maple_tree *mt) MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0); allocated = mas_allocated(&mas); height = mas_mt_height(&mas); - MT_BUG_ON(mt, allocated != 1 + height * 3); + vacant_height = get_vacant_height(&wr_mas, ptr); + MT_BUG_ON(mt, allocated != 1 + (height - vacant_height) * 3); mn = mas_pop_node(&mas); MT_BUG_ON(mt, mas_allocated(&mas) != allocated - 1); MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0); @@ -35540,7 +35594,8 @@ static noinline void __init check_prealloc(struct maple_tree *mt) MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0); allocated = mas_allocated(&mas); height = mas_mt_height(&mas); - MT_BUG_ON(mt, allocated != 1 + height * 3); + vacant_height = get_vacant_height(&wr_mas, ptr); + MT_BUG_ON(mt, allocated != 1 + (height - vacant_height) * 3); mn = mas_pop_node(&mas); MT_BUG_ON(mt, mas_allocated(&mas) != allocated - 1); mas_push_node(&mas, mn); @@ -35553,7 +35608,8 @@ static noinline void __init check_prealloc(struct maple_tree *mt) MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0); allocated = mas_allocated(&mas); height = mas_mt_height(&mas); - MT_BUG_ON(mt, allocated != 1 + height * 3); + vacant_height = get_vacant_height(&wr_mas, ptr); + MT_BUG_ON(mt, allocated != 1 + (height - vacant_height) * 3); mas_store_prealloc(&mas, ptr); MT_BUG_ON(mt, mas_allocated(&mas) != 0); @@ -35578,7 +35634,8 @@ static noinline void __init check_prealloc(struct maple_tree *mt) MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0); allocated = mas_allocated(&mas); height = mas_mt_height(&mas); - MT_BUG_ON(mt, allocated != 1 + height * 2); + vacant_height = get_vacant_height(&wr_mas, ptr); + MT_BUG_ON(mt, allocated != 1 + (height - vacant_height) * 2); mas_store_prealloc(&mas, ptr); MT_BUG_ON(mt, mas_allocated(&mas) != 0); mt_set_non_kernel(1); @@ -35595,8 +35652,14 @@ static noinline void __init check_prealloc(struct maple_tree *mt) MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0); allocated = mas_allocated(&mas); height = mas_mt_height(&mas); + vacant_height = get_vacant_height(&wr_mas, ptr); MT_BUG_ON(mt, allocated == 0); - MT_BUG_ON(mt, allocated != 1 + height * 3); + /* + * vacant height cannot be used to compute the number of nodes needed + * as the root contains two entries which means it is on the verge of + * insufficiency. The worst case full height of the tree is needed. + */ + MT_BUG_ON(mt, allocated != height * 3 + 1); mas_store_prealloc(&mas, ptr); MT_BUG_ON(mt, mas_allocated(&mas) != 0); mas_set_range(&mas, 0, 200); From 300a5b4ffedf826998ed1f1b5d107e9fb0ef7579 Mon Sep 17 00:00:00 2001 From: Sidhartha Kumar Date: Thu, 10 Apr 2025 19:14:44 +0000 Subject: [PATCH 111/285] maple_tree: break on convergence in mas_spanning_rebalance() This allows support for using the vacant height to calculate the worst case number of nodes needed for wr_rebalance operation. mas_spanning_rebalance() was seen to perform unnecessary node allocations. We can reduce allocations by breaking early during the rebalancing loop once we realize that we have ascended to a common ancestor. Link: https://lkml.kernel.org/r/20250410191446.2474640-5-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar Suggested-by: Liam Howlett Reviewed-by: Wei Yang Reviewed-by: Liam R. Howlett Cc: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- lib/maple_tree.c | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 3f794ef072f4..5610b3742a79 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -2893,11 +2893,21 @@ static void mas_spanning_rebalance(struct ma_state *mas, mast_combine_cp_right(mast); mast->orig_l->last = mast->orig_l->max; - if (mast_sufficient(mast)) - continue; + if (mast_sufficient(mast)) { + if (mast_overflow(mast)) + continue; + + if (mast->orig_l->node == mast->orig_r->node) { + /* + * The data in b_node should be stored in one + * node and in the tree + */ + slot = mast->l->offset; + break; + } - if (mast_overflow(mast)) continue; + } /* May be a new root stored in mast->bn */ if (mas_is_root_limits(mast->orig_l)) From 271152a973cb01c135d29e91d1a05f51fbd88a9c Mon Sep 17 00:00:00 2001 From: Sidhartha Kumar Date: Thu, 10 Apr 2025 19:14:45 +0000 Subject: [PATCH 112/285] maple_tree: add sufficient height In order to support rebalancing and spanning stores using less than the worst case number of nodes, we need to track more than just the vacant height. Using only vacant height to reduce the worst case maple node allocation count can lead to a shortcoming of nodes in the following scenarios. For rebalancing writes, when a leaf node becomes insufficient, it may be combined with a sibling into a single node. This means that the parent node which has entries for this children will lose one entry. If this parent node was just meeting the minimum entries, losing one entry will now cause this parent node to be insufficient. This leads to a cascading operation of rebalancing at different levels and can lead to more node allocations than simply using vacant height can return. For spanning writes, a similar situation occurs. At the location at which a spanning write is detected, the number of ancestor nodes may similarly need to rebalanced into a smaller number of nodes and the same cascading situation could occur. To use less than the full height of the tree for the number of allocations, we also need to track the height at which a non-leaf node cannot become insufficient. This means even if a rebalance occurs to a child of this node, it currently has enough entries that it can lose one without any further action. This field is stored in the maple write state as sufficient height. In mas_prealloc_calc() when figuring out how many nodes to allocate, we check if the vacant node is lower in the tree than a sufficient node (has a larger value). If it is, we cannot use the vacant height and must use the difference in the height and sufficient height as the basis for the number of nodes needed. An off by one bug was also discovered in mast_overflow() where it is using >= rather than >. This caused extra iterations of the mas_spanning_rebalance() loop and lead to unneeded allocations. A test is also added to check the number of allocations is correct. Link: https://lkml.kernel.org/r/20250410191446.2474640-6-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar Reviewed-by: Liam R. Howlett Cc: Matthew Wilcox (Oracle) Cc: Wei Yang Signed-off-by: Andrew Morton --- include/linux/maple_tree.h | 4 +++- lib/maple_tree.c | 19 ++++++++++++++++--- tools/testing/radix-tree/maple.c | 28 ++++++++++++++++++++++++++++ 3 files changed, 47 insertions(+), 4 deletions(-) diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h index 657adb33e61e..9ef129038224 100644 --- a/include/linux/maple_tree.h +++ b/include/linux/maple_tree.h @@ -464,6 +464,7 @@ struct ma_wr_state { void *entry; /* The entry to write */ void *content; /* The existing entry that is being overwritten */ unsigned char vacant_height; /* Height of lowest node with free space */ + unsigned char sufficient_height;/* Height of lowest node with min sufficiency + 1 nodes */ }; #define mas_lock(mas) spin_lock(&((mas)->tree->ma_lock)) @@ -499,7 +500,8 @@ struct ma_wr_state { .mas = ma_state, \ .content = NULL, \ .entry = wr_entry, \ - .vacant_height = 0 \ + .vacant_height = 0, \ + .sufficient_height = 0 \ } #define MA_TOPIARY(name, tree) \ diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 5610b3742a79..aa139668bcae 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -2741,7 +2741,7 @@ static inline bool mast_sufficient(struct maple_subtree_state *mast) */ static inline bool mast_overflow(struct maple_subtree_state *mast) { - if (mast->bn->b_end >= mt_slot_count(mast->orig_l->node)) + if (mast->bn->b_end > mt_slot_count(mast->orig_l->node)) return true; return false; @@ -3550,6 +3550,13 @@ static bool mas_wr_walk(struct ma_wr_state *wr_mas) if (mas->end < mt_slots[wr_mas->type] - 1) wr_mas->vacant_height = mas->depth + 1; + if (ma_is_root(mas_mn(mas))) { + /* root needs more than 2 entries to be sufficient + 1 */ + if (mas->end > 2) + wr_mas->sufficient_height = 1; + } else if (mas->end > mt_min_slots[wr_mas->type] + 1) + wr_mas->sufficient_height = mas->depth + 1; + mas_wr_walk_traverse(wr_mas); } @@ -4185,13 +4192,19 @@ static inline int mas_prealloc_calc(struct ma_wr_state *wr_mas, void *entry) ret = 0; break; case wr_spanning_store: - WARN_ON_ONCE(ret != height * 3 + 1); + if (wr_mas->sufficient_height < wr_mas->vacant_height) + ret = (height - wr_mas->sufficient_height) * 3 + 1; + else + ret = delta * 3 + 1; break; case wr_split_store: ret = delta * 2 + 1; break; case wr_rebalance: - ret = height * 2 + 1; + if (wr_mas->sufficient_height < wr_mas->vacant_height) + ret = (height - wr_mas->sufficient_height) * 2 + 1; + else + ret = delta * 2 + 1; break; case wr_node_store: ret = mt_in_rcu(mas->tree) ? 1 : 0; diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c index e37a3ab2e921..2c0b38301253 100644 --- a/tools/testing/radix-tree/maple.c +++ b/tools/testing/radix-tree/maple.c @@ -36326,6 +36326,30 @@ static inline void check_spanning_store_height(struct maple_tree *mt) mas_unlock(&mas); } +/* + * Test to check the path of a spanning rebalance which results in + * a collapse where the rebalancing of the child node leads to + * insufficieny in the parent node. + */ +static void check_collapsing_rebalance(struct maple_tree *mt) +{ + int i = 0; + MA_STATE(mas, mt, ULONG_MAX, ULONG_MAX); + + /* create a height 6 tree */ + while (mt_height(mt) < 6) { + mtree_store_range(mt, i, i + 10, xa_mk_value(i), GFP_KERNEL); + i += 9; + } + + /* delete all entries one at a time, starting from the right */ + do { + mas_erase(&mas); + } while (mas_prev(&mas, 0) != NULL); + + mtree_unlock(mt); +} + /* callback function used for check_nomem_writer_race() */ static void writer2(void *maple_tree) { @@ -36496,6 +36520,10 @@ void farmer_tests(void) check_spanning_store_height(&tree); mtree_destroy(&tree); + mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE); + check_collapsing_rebalance(&tree); + mtree_destroy(&tree); + mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE); check_null_expand(&tree); mtree_destroy(&tree); From 2a6ed1b411c5a037033a4dccc65cf4a90e4ae40a Mon Sep 17 00:00:00 2001 From: Sidhartha Kumar Date: Thu, 10 Apr 2025 19:14:46 +0000 Subject: [PATCH 113/285] maple_tree: reorder mas->store_type case statements Move the unlikely case that mas->store_type is invalid to be the last evaluated case and put liklier cases higher up. Link: https://lkml.kernel.org/r/20250410191446.2474640-7-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar Suggested-by: Liam R. Howlett Reviewed-by: Liam R. Howlett Cc: Matthew Wilcox (Oracle) Cc: Wei Yang Signed-off-by: Andrew Morton --- lib/maple_tree.c | 51 ++++++++++++++++++++++++------------------------ 1 file changed, 25 insertions(+), 26 deletions(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index aa139668bcae..affe979bd14d 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -4083,15 +4083,6 @@ static inline void mas_wr_store_entry(struct ma_wr_state *wr_mas) unsigned char new_end = mas_wr_new_end(wr_mas); switch (mas->store_type) { - case wr_invalid: - MT_BUG_ON(mas->tree, 1); - return; - case wr_new_root: - mas_new_root(mas, wr_mas->entry); - break; - case wr_store_root: - mas_store_root(mas, wr_mas->entry); - break; case wr_exact_fit: rcu_assign_pointer(wr_mas->slots[mas->offset], wr_mas->entry); if (!!wr_mas->entry ^ !!wr_mas->content) @@ -4113,6 +4104,14 @@ static inline void mas_wr_store_entry(struct ma_wr_state *wr_mas) case wr_rebalance: mas_wr_bnode(wr_mas); break; + case wr_new_root: + mas_new_root(mas, wr_mas->entry); + break; + case wr_store_root: + mas_store_root(mas, wr_mas->entry); + break; + case wr_invalid: + MT_BUG_ON(mas->tree, 1); } return; @@ -4177,19 +4176,10 @@ static inline int mas_prealloc_calc(struct ma_wr_state *wr_mas, void *entry) unsigned char delta = height - wr_mas->vacant_height; switch (mas->store_type) { - case wr_invalid: - WARN_ON_ONCE(1); - break; - case wr_new_root: - ret = 1; - break; - case wr_store_root: - if (likely((mas->last != 0) || (mas->index != 0))) - ret = 1; - else if (((unsigned long) (entry) & 3) == 2) - ret = 1; - else - ret = 0; + case wr_exact_fit: + case wr_append: + case wr_slot_store: + ret = 0; break; case wr_spanning_store: if (wr_mas->sufficient_height < wr_mas->vacant_height) @@ -4209,10 +4199,19 @@ static inline int mas_prealloc_calc(struct ma_wr_state *wr_mas, void *entry) case wr_node_store: ret = mt_in_rcu(mas->tree) ? 1 : 0; break; - case wr_append: - case wr_exact_fit: - case wr_slot_store: - ret = 0; + case wr_new_root: + ret = 1; + break; + case wr_store_root: + if (likely((mas->last != 0) || (mas->index != 0))) + ret = 1; + else if (((unsigned long) (entry) & 3) == 2) + ret = 1; + else + ret = 0; + break; + case wr_invalid: + WARN_ON_ONCE(1); } return ret; From 585a9145886ad1d06b39ddcd72d457f37ebb4ff4 Mon Sep 17 00:00:00 2001 From: Donet Tom Date: Thu, 10 Apr 2025 05:07:48 -0500 Subject: [PATCH 114/285] selftests/mm: restore default nr_hugepages value during cleanup in hugetlb_reparenting_test.sh During cleanup, the value of /proc/sys/vm/nr_hugepages is currently being set to 0. At the end of the test, if all tests pass, the original nr_hugepages value is restored. However, if any test fails, it remains set to 0. With this patch, we ensure that the original nr_hugepages value is restored during cleanup, regardless of whether the test passes or fails. Link: https://lkml.kernel.org/r/20250410100748.2310-1-donettom@linux.ibm.com Fixes: 29750f71a9b4 ("hugetlb_cgroup: add hugetlb_cgroup reservation tests") Signed-off-by: Donet Tom Cc: Li Wang Cc: "Ritesh Harjani (IBM)" Cc: Shuah Khan Cc: Waiman Long Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/hugetlb_reparenting_test.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/mm/hugetlb_reparenting_test.sh b/tools/testing/selftests/mm/hugetlb_reparenting_test.sh index 9245549b66cf..0dd31892ff67 100755 --- a/tools/testing/selftests/mm/hugetlb_reparenting_test.sh +++ b/tools/testing/selftests/mm/hugetlb_reparenting_test.sh @@ -56,7 +56,7 @@ function cleanup() { rmdir "$CGROUP_ROOT"/a/b 2>/dev/null rmdir "$CGROUP_ROOT"/a 2>/dev/null rmdir "$CGROUP_ROOT"/test1 2>/dev/null - echo 0 >/proc/sys/vm/nr_hugepages + echo $nr_hugepgs >/proc/sys/vm/nr_hugepages set -e } From 60cada258dfe3694e15c4539259173c37bd5f1fe Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Wed, 9 Apr 2025 19:57:52 -0700 Subject: [PATCH 115/285] memcg: optimize memcg_rstat_updated Currently the kernel maintains the stats updates per-memcg which is needed to implement stats flushing threshold. On the update side, the update is added to the per-cpu per-memcg update of the given memcg and all of its ancestors. However when the given memcg has passed the flushing threshold, all of its ancestors should have passed the threshold as well. There is no need to traverse up the memcg tree to maintain the stats updates. Perf profile collected from our fleet shows that memcg_rstat_updated is one of the most expensive memcg function i.e. a lot of cumulative CPU is being spent on it. So, even small micro optimizations matter a lot. This patch is microbenchmarked with multiple instances of netperf on a single machine with locally running netserver and we see couple of percentage of improvement. Link: https://lkml.kernel.org/r/20250410025752.92159-1-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Acked-by: Roman Gushchin Reviewed-by: Yosry Ahmed Cc: Johannes Weiner Cc: Michal Hocko Cc: Muchun Song Signed-off-by: Andrew Morton --- mm/memcontrol.c | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index aebb1f2c8657..e64ac7942a21 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -592,18 +592,20 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) cgroup_rstat_updated(memcg->css.cgroup, cpu); statc = this_cpu_ptr(memcg->vmstats_percpu); for (; statc; statc = statc->parent) { + /* + * If @memcg is already flushable then all its ancestors are + * flushable as well and also there is no need to increase + * stats_updates. + */ + if (memcg_vmstats_needs_flush(statc->vmstats)) + break; + stats_updates = READ_ONCE(statc->stats_updates) + abs(val); WRITE_ONCE(statc->stats_updates, stats_updates); if (stats_updates < MEMCG_CHARGE_BATCH) continue; - /* - * If @memcg is already flush-able, increasing stats_updates is - * redundant. Avoid the overhead of the atomic update. - */ - if (!memcg_vmstats_needs_flush(statc->vmstats)) - atomic64_add(stats_updates, - &statc->vmstats->stats_updates); + atomic64_add(stats_updates, &statc->vmstats->stats_updates); WRITE_ONCE(statc->stats_updates, 0); } } From f736953e2b1f1d7096e70580180f98744c9c9c86 Mon Sep 17 00:00:00 2001 From: Enze Li Date: Fri, 11 Apr 2025 10:43:32 +0800 Subject: [PATCH 116/285] selftests/damon: remove the remaining test scripts for DAMON debugfs interface DAMON has dropped debugfs support; therefore, remove these unused scripts. Link: https://lkml.kernel.org/r/20250411024332.1373861-1-enze.li@linux.dev Fixes: 5ec4333b1967 ("mm/damon: remove DAMON debugfs interface") Signed-off-by: Enze Li Reviewed-by: SeongJae Park Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/damon/Makefile | 2 +- .../selftests/damon/_chk_dependency.sh | 52 --------------- .../selftests/damon/_debugfs_common.sh | 64 ------------------- 3 files changed, 1 insertion(+), 117 deletions(-) delete mode 100644 tools/testing/selftests/damon/_chk_dependency.sh delete mode 100644 tools/testing/selftests/damon/_debugfs_common.sh diff --git a/tools/testing/selftests/damon/Makefile b/tools/testing/selftests/damon/Makefile index ecbf07afc6dd..ff21524be458 100644 --- a/tools/testing/selftests/damon/Makefile +++ b/tools/testing/selftests/damon/Makefile @@ -3,7 +3,7 @@ TEST_GEN_FILES += access_memory access_memory_even -TEST_FILES = _chk_dependency.sh _damon_sysfs.py +TEST_FILES = _damon_sysfs.py # functionality tests TEST_PROGS += sysfs.sh diff --git a/tools/testing/selftests/damon/_chk_dependency.sh b/tools/testing/selftests/damon/_chk_dependency.sh deleted file mode 100644 index dda3a87dc00a..000000000000 --- a/tools/testing/selftests/damon/_chk_dependency.sh +++ /dev/null @@ -1,52 +0,0 @@ -#!/bin/bash -# SPDX-License-Identifier: GPL-2.0 - -# Kselftest framework requirement - SKIP code is 4. -ksft_skip=4 - -DBGFS=$(grep debugfs /proc/mounts --max-count 1 | awk '{print $2}') -if [ "$DBGFS" = "" ] -then - echo "debugfs not mounted" - exit $ksft_skip -fi - -DBGFS+="/damon" - -if [ $EUID -ne 0 ]; -then - echo "Run as root" - exit $ksft_skip -fi - -if [ ! -d "$DBGFS" ] -then - echo "$DBGFS not found" - exit $ksft_skip -fi - -if [ -f "$DBGFS/monitor_on_DEPRECATED" ] -then - monitor_on_file="monitor_on_DEPRECATED" -else - monitor_on_file="monitor_on" -fi - -for f in attrs target_ids "$monitor_on_file" -do - if [ ! -f "$DBGFS/$f" ] - then - echo "$f not found" - exit 1 - fi -done - -permission_error="Operation not permitted" -for f in attrs target_ids "$monitor_on_file" -do - status=$( cat "$DBGFS/$f" 2>&1 ) - if [ "${status#*$permission_error}" != "$status" ]; then - echo "Permission for reading $DBGFS/$f denied; maybe secureboot enabled?" - exit $ksft_skip - fi -done diff --git a/tools/testing/selftests/damon/_debugfs_common.sh b/tools/testing/selftests/damon/_debugfs_common.sh deleted file mode 100644 index 54d45791b0d9..000000000000 --- a/tools/testing/selftests/damon/_debugfs_common.sh +++ /dev/null @@ -1,64 +0,0 @@ -#!/bin/bash -# SPDX-License-Identifier: GPL-2.0 - -test_write_result() { - file=$1 - content=$2 - orig_content=$3 - expect_reason=$4 - expected=$5 - - if [ "$expected" = "0" ] - then - echo "$content" > "$file" - else - echo "$content" > "$file" 2> /dev/null - fi - if [ $? -ne "$expected" ] - then - echo "writing $content to $file doesn't return $expected" - echo "expected because: $expect_reason" - echo "$orig_content" > "$file" - exit 1 - fi -} - -test_write_succ() { - test_write_result "$1" "$2" "$3" "$4" 0 -} - -test_write_fail() { - test_write_result "$1" "$2" "$3" "$4" 1 -} - -test_content() { - file=$1 - orig_content=$2 - expected=$3 - expect_reason=$4 - - content=$(cat "$file") - if [ "$content" != "$expected" ] - then - echo "reading $file expected $expected but $content" - echo "expected because: $expect_reason" - echo "$orig_content" > "$file" - exit 1 - fi -} - -source ./_chk_dependency.sh - -damon_onoff="$DBGFS/monitor_on" -if [ -f "$DBGFS/monitor_on_DEPRECATED" ] -then - damon_onoff="$DBGFS/monitor_on_DEPRECATED" -else - damon_onoff="$DBGFS/monitor_on" -fi - -if [ $(cat "$damon_onoff") = "on" ] -then - echo "monitoring is on" - exit $ksft_skip -fi From 00ccf40ae298262feb13b3ad56b4428647f3aafa Mon Sep 17 00:00:00 2001 From: Oscar Salvador Date: Tue, 15 Apr 2025 14:15:03 +0200 Subject: [PATCH 117/285] mm, hugetlb: avoid passing a null nodemask when there is mbind policy Before trying to allocate a page, gather_surplus_pages() sets up a nodemask for the nodes we can allocate from, but instead of passing the nodemask down the road to the page allocator, it iterates over the nodes within that nodemask right there, meaning that the page allocator will receive a preferred_nid and a null nodemask. This is a problem when using a memory policy, because it might be that the page allocator ends up using a node as a fallback which is not represented in the policy. Avoid that by passing the nodemask directly to the page allocator, so it can filter out fallback nodes that are not part of the nodemask. Link: https://lkml.kernel.org/r/20250415121503.376811-1-osalvador@suse.de Signed-off-by: Oscar Salvador Reviewed-by: Vlastimil Babka Cc: David Hildenbrand Cc: Muchun Song Signed-off-by: Andrew Morton --- mm/hugetlb.c | 22 ++++++---------------- 1 file changed, 6 insertions(+), 16 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index a2c111447812..38738293e6b6 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2419,7 +2419,6 @@ static int gather_surplus_pages(struct hstate *h, long delta) long i; long needed, allocated; bool alloc_ok = true; - int node; nodemask_t *mbind_nodemask, alloc_nodemask; mbind_nodemask = policy_mbind_nodemask(htlb_alloc_mask(h)); @@ -2443,21 +2442,12 @@ retry: for (i = 0; i < needed; i++) { folio = NULL; - /* Prioritize current node */ - if (node_isset(numa_mem_id(), alloc_nodemask)) - folio = alloc_surplus_hugetlb_folio(h, htlb_alloc_mask(h), - numa_mem_id(), NULL); - - if (!folio) { - for_each_node_mask(node, alloc_nodemask) { - if (node == numa_mem_id()) - continue; - folio = alloc_surplus_hugetlb_folio(h, htlb_alloc_mask(h), - node, NULL); - if (folio) - break; - } - } + /* + * It is okay to use NUMA_NO_NODE because we use numa_mem_id() + * down the road to pick the current node if that is the case. + */ + folio = alloc_surplus_hugetlb_folio(h, htlb_alloc_mask(h), + NUMA_NO_NODE, &alloc_nodemask); if (!folio) { alloc_ok = false; break; From 0272d07ef6ebc574e62d6423e5efa5b975f9f910 Mon Sep 17 00:00:00 2001 From: "Uladzislau Rezki (Sony)" Date: Tue, 15 Apr 2025 13:26:46 +0200 Subject: [PATCH 118/285] vmalloc: use atomic_long_add_return_relaxed() Switch from the atomic_long_add_return() to its relaxed version. We do not need a full memory barrier or any memory ordering during increasing the "vmap_lazy_nr" variable. What we only need is to do it atomically. This is what atomic_long_add_return_relaxed() guarantees. AARCH64: Default: 40ec: d34cfe94 lsr x20, x20, #12 40f0: 14000044 b 4200 40f4: 94000000 bl 0 <__sanitizer_cov_trace_pc> 40f8: 90000000 adrp x0, 0 <__traceiter_alloc_vmap_area> 40fc: 91000000 add x0, x0, #0x0 4100: f8f40016 ldaddal x20, x22, [x0] 4104: 8b160296 add x22, x20, x22 Relaxed: 40ec: d34cfe94 lsr x20, x20, #12 40f0: 14000044 b 4200 40f4: 94000000 bl 0 <__sanitizer_cov_trace_pc> 40f8: 90000000 adrp x0, 0 <__traceiter_alloc_vmap_area> 40fc: 91000000 add x0, x0, #0x0 4100: f8340016 ldadd x20, x22, [x0] 4104: 8b160296 add x22, x20, x22 Link: https://lkml.kernel.org/r/20250415112646.113091-1-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) Reviewed-by: Baoquan He Cc: Christop Hellwig Cc: Mateusz Guzik Signed-off-by: Andrew Morton --- mm/vmalloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 9eb3880887b6..dc33ebeb8b1b 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2370,7 +2370,7 @@ static void free_vmap_area_noflush(struct vmap_area *va) if (WARN_ON_ONCE(!list_empty(&va->list))) return; - nr_lazy = atomic_long_add_return(va_size(va) >> PAGE_SHIFT, + nr_lazy = atomic_long_add_return_relaxed(va_size(va) >> PAGE_SHIFT, &vmap_lazy_nr); /* From e7a446030bdaf6b4d933c53fd809cb83a0953ffe Mon Sep 17 00:00:00 2001 From: Oscar Salvador Date: Fri, 11 Apr 2025 15:23:59 +0200 Subject: [PATCH 119/285] mm,hugetlb: allocate frozen pages in alloc_buddy_hugetlb_folio alloc_buddy_hugetlb_folio() allocates a rmappable folio, then strips the rmappable part and freezes it. We can simplify all that by allocating frozen pages directly. Link: https://lkml.kernel.org/r/20250411132359.312708-1-osalvador@suse.de Signed-off-by: Oscar Salvador Suggested-by: Vlastimil Babka Reviewed-by: Vlastimil Babka Reviewed-by: Matthew Wilcox (Oracle) Reviewed-by: David Hildenbrand Cc: Muchun Song Signed-off-by: Andrew Morton --- mm/hugetlb.c | 17 +---------------- 1 file changed, 1 insertion(+), 16 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 38738293e6b6..351254ad6ef8 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1950,7 +1950,6 @@ static struct folio *alloc_buddy_hugetlb_folio(struct hstate *h, int order = huge_page_order(h); struct folio *folio; bool alloc_try_hard = true; - bool retry = true; /* * By default we always try hard to allocate the folio with @@ -1965,22 +1964,8 @@ static struct folio *alloc_buddy_hugetlb_folio(struct hstate *h, gfp_mask |= __GFP_RETRY_MAYFAIL; if (nid == NUMA_NO_NODE) nid = numa_mem_id(); -retry: - folio = __folio_alloc(gfp_mask, order, nid, nmask); - /* Ensure hugetlb folio won't have large_rmappable flag set. */ - if (folio) - folio_clear_large_rmappable(folio); - if (folio && !folio_ref_freeze(folio, 1)) { - folio_put(folio); - if (retry) { /* retry once */ - retry = false; - goto retry; - } - /* WOW! twice in a row. */ - pr_warn("HugeTLB unexpected inflated folio ref count\n"); - folio = NULL; - } + folio = (struct folio *)__alloc_frozen_pages(gfp_mask, order, nid, nmask); /* * If we did not specify __GFP_RETRY_MAYFAIL, but still got a From ede27b7ee2e6c0c5148d4bab04e5d287d6fbc771 Mon Sep 17 00:00:00 2001 From: Baoquan He Date: Thu, 10 Apr 2025 11:57:15 +0800 Subject: [PATCH 120/285] mm/gup: remove unneeded checking in follow_page_pte() Patch series "mm/gup: Minor fix, cleanup and improvements", v4. These were made from code inspection in mm/gup.c. This patch (of 3): In __get_user_pages(), it will traverse page table and take a reference to the page the given user address corresponds to if GUP_GET or GUP_PIN is set. However, it's not supported both GUP_GET and GUP_PIN are set. Even though this check need be done, it should be done earlier, but not doing it till entering into follow_page_pte() and failed. Furthermore, this checking has been done in is_valid_gup_args() and all external users of __get_user_pages() will call is_valid_gup_args() to catch the illegal setting. We don't need to worry about internal users of __get_user_pages() because the gup_flags are set by MM code correctly. Here remove the checking in follow_page_pte(), and add VM_WARN_ON_ONCE() to catch the possible exceptional setting just in case. And also change the VM_BUG_ON to VM_WARN_ON_ONCE() for checking (!!pages != !!(gup_flags & (FOLL_GET | FOLL_PIN))) because the checking has been done in is_valid_gup_args() for external users of __get_user_pages(). Link: https://lkml.kernel.org/r/20250410035717.473207-1-bhe@redhat.com Link: https://lkml.kernel.org/r/20250410035717.473207-3-bhe@redhat.com Signed-off-by: Baoquan He Reviewed-by: Oscar Salvador Acked-by: David Hildenbrand Cc: Yanjun.Zhu Signed-off-by: Andrew Morton --- mm/gup.c | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index 84461d384ae2..eb668da933e1 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -844,11 +844,6 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, pte_t *ptep, pte; int ret; - /* FOLL_GET and FOLL_PIN are mutually exclusive. */ - if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) == - (FOLL_PIN | FOLL_GET))) - return ERR_PTR(-EINVAL); - ptep = pte_offset_map_lock(mm, pmd, address, &ptl); if (!ptep) return no_page_table(vma, flags, address); @@ -1432,7 +1427,11 @@ static long __get_user_pages(struct mm_struct *mm, start = untagged_addr_remote(mm, start); - VM_BUG_ON(!!pages != !!(gup_flags & (FOLL_GET | FOLL_PIN))); + VM_WARN_ON_ONCE(!!pages != !!(gup_flags & (FOLL_GET | FOLL_PIN))); + + /* FOLL_GET and FOLL_PIN are mutually exclusive. */ + VM_WARN_ON_ONCE((gup_flags & (FOLL_PIN | FOLL_GET)) == + (FOLL_PIN | FOLL_GET)); do { struct page *page; From 339122abb556db2cf497c929a2b6196e37385aee Mon Sep 17 00:00:00 2001 From: Baoquan He Date: Thu, 10 Apr 2025 11:57:16 +0800 Subject: [PATCH 121/285] mm/gup: remove gup_fast_pgd_leaf() and clean up the relevant codes In the current kernel, only pud huge page is supported in some architectures. P4d and pgd huge pages haven't been supported yet. And in mm/gup.c, there's no pgd huge page handling in the follow_page_mask() code path. Hence it doesn't make sense to only have gup_fast_pgd_leaf() in gup_fast code path. Here remove gup_fast_pgd_leaf() and clean up the relevant codes. Link: https://lkml.kernel.org/r/20250410035717.473207-4-bhe@redhat.com Signed-off-by: Baoquan He Reviewed-by: Oscar Salvador Acked-by: David Hildenbrand Cc: Yanjun.Zhu Signed-off-by: Andrew Morton --- mm/gup.c | 49 +++---------------------------------------------- 1 file changed, 3 insertions(+), 46 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index eb668da933e1..77a5bc622567 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -3172,46 +3172,6 @@ static int gup_fast_pud_leaf(pud_t orig, pud_t *pudp, unsigned long addr, return 1; } -static int gup_fast_pgd_leaf(pgd_t orig, pgd_t *pgdp, unsigned long addr, - unsigned long end, unsigned int flags, struct page **pages, - int *nr) -{ - int refs; - struct page *page; - struct folio *folio; - - if (!pgd_access_permitted(orig, flags & FOLL_WRITE)) - return 0; - - BUILD_BUG_ON(pgd_devmap(orig)); - - page = pgd_page(orig); - refs = record_subpages(page, PGDIR_SIZE, addr, end, pages + *nr); - - folio = try_grab_folio_fast(page, refs, flags); - if (!folio) - return 0; - - if (unlikely(pgd_val(orig) != pgd_val(*pgdp))) { - gup_put_folio(folio, refs, flags); - return 0; - } - - if (!pgd_write(orig) && gup_must_unshare(NULL, flags, &folio->page)) { - gup_put_folio(folio, refs, flags); - return 0; - } - - if (!gup_fast_folio_allowed(folio, flags)) { - gup_put_folio(folio, refs, flags); - return 0; - } - - *nr += refs; - folio_set_referenced(folio); - return 1; -} - static int gup_fast_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned long end, unsigned int flags, struct page **pages, int *nr) @@ -3306,12 +3266,9 @@ static void gup_fast_pgd_range(unsigned long addr, unsigned long end, next = pgd_addr_end(addr, end); if (pgd_none(pgd)) return; - if (unlikely(pgd_leaf(pgd))) { - if (!gup_fast_pgd_leaf(pgd, pgdp, addr, next, flags, - pages, nr)) - return; - } else if (!gup_fast_p4d_range(pgdp, pgd, addr, next, flags, - pages, nr)) + BUILD_BUG_ON(pgd_leaf(pgd)); + if (!gup_fast_p4d_range(pgdp, pgd, addr, next, flags, + pages, nr)) return; } while (pgdp++, addr = next, addr != end); } From a7797e74bd397e7a9c3095bf212e7059a128155a Mon Sep 17 00:00:00 2001 From: Baoquan He Date: Thu, 10 Apr 2025 11:57:17 +0800 Subject: [PATCH 122/285] mm/gup: clean up codes in fault_in_xxx() functions The code style in fault_in_readable() and fault_in_writable() is a little inconsistent with fault_in_safe_writeable(). In fault_in_readable() and fault_in_writable(), it uses 'uaddr' passed in as loop cursor. While in fault_in_safe_writeable(), local variable 'start' is used as loop cursor. This may mislead people when reading code or making change in these codes. Here define explicit loop cursor and use for loop to simplify codes in these three functions. These cleanup can make them be consistent in code style and improve readability. [bhe@redhat.com: address minor concerns from David] Link: https://lkml.kernel.org/r/Z/sbv3EmLXWgEE7+@MiWiFi-R3L-srv Link: https://lkml.kernel.org/r/20250410035717.473207-5-bhe@redhat.com Signed-off-by: Baoquan He Cc: David Hildenbrand Cc: Oscar Salvador Cc: Yanjun.Zhu Signed-off-by: Andrew Morton --- mm/gup.c | 62 ++++++++++++++++++++++---------------------------------- 1 file changed, 24 insertions(+), 38 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index 77a5bc622567..f32168339390 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2113,28 +2113,22 @@ static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start, */ size_t fault_in_writeable(char __user *uaddr, size_t size) { - char __user *start = uaddr, *end; + const unsigned long start = (unsigned long)uaddr; + const unsigned long end = start + size; + unsigned long cur; if (unlikely(size == 0)) return 0; if (!user_write_access_begin(uaddr, size)) return size; - if (!PAGE_ALIGNED(uaddr)) { - unsafe_put_user(0, uaddr, out); - uaddr = (char __user *)PAGE_ALIGN((unsigned long)uaddr); - } - end = (char __user *)PAGE_ALIGN((unsigned long)start + size); - if (unlikely(end < start)) - end = NULL; - while (uaddr != end) { - unsafe_put_user(0, uaddr, out); - uaddr += PAGE_SIZE; - } + /* Stop once we overflow to 0. */ + for (cur = start; cur && cur < end; cur = PAGE_ALIGN_DOWN(cur + PAGE_SIZE)) + unsafe_put_user(0, (char __user *)cur, out); out: user_write_access_end(); - if (size > uaddr - start) - return size - (uaddr - start); + if (size > cur - start) + return size - (cur - start); return 0; } EXPORT_SYMBOL(fault_in_writeable); @@ -2188,26 +2182,24 @@ EXPORT_SYMBOL(fault_in_subpage_writeable); */ size_t fault_in_safe_writeable(const char __user *uaddr, size_t size) { - unsigned long start = (unsigned long)uaddr, end; + const unsigned long start = (unsigned long)uaddr; + const unsigned long end = start + size; + unsigned long cur; struct mm_struct *mm = current->mm; bool unlocked = false; if (unlikely(size == 0)) return 0; - end = PAGE_ALIGN(start + size); - if (end < start) - end = 0; mmap_read_lock(mm); - do { - if (fixup_user_fault(mm, start, FAULT_FLAG_WRITE, &unlocked)) + /* Stop once we overflow to 0. */ + for (cur = start; cur && cur < end; cur = PAGE_ALIGN_DOWN(cur + PAGE_SIZE)) + if (fixup_user_fault(mm, cur, FAULT_FLAG_WRITE, &unlocked)) break; - start = (start + PAGE_SIZE) & PAGE_MASK; - } while (start != end); mmap_read_unlock(mm); - if (size > start - (unsigned long)uaddr) - return size - (start - (unsigned long)uaddr); + if (size > cur - start) + return size - (cur - start); return 0; } EXPORT_SYMBOL(fault_in_safe_writeable); @@ -2222,30 +2214,24 @@ EXPORT_SYMBOL(fault_in_safe_writeable); */ size_t fault_in_readable(const char __user *uaddr, size_t size) { - const char __user *start = uaddr, *end; + const unsigned long start = (unsigned long)uaddr; + const unsigned long end = start + size; + unsigned long cur; volatile char c; if (unlikely(size == 0)) return 0; if (!user_read_access_begin(uaddr, size)) return size; - if (!PAGE_ALIGNED(uaddr)) { - unsafe_get_user(c, uaddr, out); - uaddr = (const char __user *)PAGE_ALIGN((unsigned long)uaddr); - } - end = (const char __user *)PAGE_ALIGN((unsigned long)start + size); - if (unlikely(end < start)) - end = NULL; - while (uaddr != end) { - unsafe_get_user(c, uaddr, out); - uaddr += PAGE_SIZE; - } + /* Stop once we overflow to 0. */ + for (cur = start; cur && cur < end; cur = PAGE_ALIGN_DOWN(cur + PAGE_SIZE)) + unsafe_get_user(c, (const char __user *)cur, out); out: user_read_access_end(); (void)c; - if (size > uaddr - start) - return size - (uaddr - start); + if (size > cur - start) + return size - (cur - start); return 0; } EXPORT_SYMBOL(fault_in_readable); From 1477b8cd26883a78362057954a247dab3bc5d2fb Mon Sep 17 00:00:00 2001 From: Enze Li Date: Fri, 11 Apr 2025 15:38:00 +0800 Subject: [PATCH 123/285] samples/damon/prcl: fix a comment typo This patch just fixes a typo in the comment. Link: https://lkml.kernel.org/r/20250411073800.1444481-1-lienze@kylinos.cn Signed-off-by: Enze Li Reviewed-by: SeongJae Park Signed-off-by: Andrew Morton --- samples/damon/prcl.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/samples/damon/prcl.c b/samples/damon/prcl.c index c3acbdab7a62..056b1b21a0fe 100644 --- a/samples/damon/prcl.c +++ b/samples/damon/prcl.c @@ -1,7 +1,7 @@ // SPDX-License-Identifier: GPL-2.0 /* * proactive reclamation: monitor access pattern of a given process, find - * regiosn that seems not accessed, and proactively page out the regions. + * regions that seems not accessed, and proactively page out the regions. */ #define pr_fmt(fmt) "damon_sample_prcl: " fmt From df2bbc47e707c7836756159e3530266656a9ea87 Mon Sep 17 00:00:00 2001 From: Hao Ge Date: Thu, 17 Apr 2025 17:24:22 +0800 Subject: [PATCH 124/285] mm/vmscan: modify the assignment logic of the scan and total_scan variables The scan and total_scan variables can be initialized to 0 when they are defined, replacing the separate assignment statements. Link: https://lkml.kernel.org/r/20250417092422.1333620-1-hao.ge@linux.dev Signed-off-by: Hao Ge Acked-by: Dev Jain Signed-off-by: Andrew Morton --- mm/vmscan.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 3783e45bfc92..ceffb20bb8c9 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1725,13 +1725,11 @@ static unsigned long isolate_lru_folios(unsigned long nr_to_scan, unsigned long nr_taken = 0; unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 }; unsigned long nr_skipped[MAX_NR_ZONES] = { 0, }; - unsigned long skipped = 0; - unsigned long scan, total_scan, nr_pages; + unsigned long skipped = 0, total_scan = 0, scan = 0; + unsigned long nr_pages; unsigned long max_nr_skipped = 0; LIST_HEAD(folios_skipped); - total_scan = 0; - scan = 0; while (scan < nr_to_scan && !list_empty(src)) { struct list_head *move_to = src; struct folio *folio; From 4f219913c1368a372b07832451cb86ea3a1df31c Mon Sep 17 00:00:00 2001 From: gaoxu Date: Sat, 12 Apr 2025 09:27:24 +0000 Subject: [PATCH 125/285] mm: add nr_free_highatomic in show_free_areas commit c928807f6f6b6("mm/page_alloc: keep track of free highatomic") adds a new variable nr_free_highatomic, which is useful for analyzing low mem issues. add nr_free_highatomic in show_free_areas. Signed-off-by: gao xu Link: https://lkml.kernel.org/r/d92eeff74f7a4578a14ac777cfe3603a@honor.com Acked-by: David Hildenbrand Reviewed-by: Barry Song Acked-by: David Rientjes Reviewed-by: Anshuman Khandual Cc: Mike Rapoport Cc: Suren Baghdasaryan Cc: Yu Zhao Signed-off-by: Andrew Morton --- mm/show_mem.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/show_mem.c b/mm/show_mem.c index ad373b4b6e39..03e8d968fd1a 100644 --- a/mm/show_mem.c +++ b/mm/show_mem.c @@ -305,6 +305,7 @@ static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_z " low:%lukB" " high:%lukB" " reserved_highatomic:%luKB" + " free_highatomic:%luKB" " active_anon:%lukB" " inactive_anon:%lukB" " active_file:%lukB" @@ -326,6 +327,7 @@ static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_z K(low_wmark_pages(zone)), K(high_wmark_pages(zone)), K(zone->nr_reserved_highatomic), + K(zone->nr_free_highatomic), K(zone_page_state(zone, NR_ZONE_ACTIVE_ANON)), K(zone_page_state(zone, NR_ZONE_INACTIVE_ANON)), K(zone_page_state(zone, NR_ZONE_ACTIVE_FILE)), From 06340b927051bf71b59a9cd4cff3417247318251 Mon Sep 17 00:00:00 2001 From: Fan Ni Date: Wed, 16 Apr 2025 13:12:15 -0700 Subject: [PATCH 126/285] mm: convert free_page_and_swap_cache() to free_folio_and_swap_cache() free_page_and_swap_cache() takes a struct page pointer as input parameter, but it will immediately convert it to folio and all operations following within use folio instead of page. It makes more sense to pass in folio directly. Convert free_page_and_swap_cache() to free_folio_and_swap_cache() to consume folio directly. Link: https://lkml.kernel.org/r/20250416201720.41678-1-nifan.cxl@gmail.com Signed-off-by: Fan Ni Acked-by: Davidlohr Bueso Acked-by: David Hildenbrand Reviewed-by: Zi Yan Reviewed-by: Vishal Moola (Oracle) Reviewed-by: Matthew Wilcox (Oracle) Cc: Adam Manzanares Cc: "Aneesh Kumar K.V" Cc: Heiko Carstens Cc: Luis Chamberalin Cc: Vasily Gorbik Cc: Will Deacon Signed-off-by: Andrew Morton --- arch/s390/include/asm/tlb.h | 4 ++-- include/linux/swap.h | 8 +++----- mm/huge_memory.c | 2 +- mm/khugepaged.c | 2 +- mm/swap_state.c | 8 +++----- 5 files changed, 10 insertions(+), 14 deletions(-) diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h index f20601995bb0..e5103e8e697d 100644 --- a/arch/s390/include/asm/tlb.h +++ b/arch/s390/include/asm/tlb.h @@ -40,7 +40,7 @@ static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb, /* * Release the page cache reference for a pte removed by * tlb_ptep_clear_flush. In both flush modes the tlb for a page cache page - * has already been freed, so just do free_page_and_swap_cache. + * has already been freed, so just do free_folio_and_swap_cache. * * s390 doesn't delay rmap removal. */ @@ -49,7 +49,7 @@ static inline bool __tlb_remove_page_size(struct mmu_gather *tlb, { VM_WARN_ON_ONCE(delay_rmap); - free_page_and_swap_cache(page); + free_folio_and_swap_cache(page_folio(page)); return false; } diff --git a/include/linux/swap.h b/include/linux/swap.h index db46b25a65ae..4e4e27d3ce3d 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -450,7 +450,7 @@ static inline unsigned long total_swapcache_pages(void) } void free_swap_cache(struct folio *folio); -void free_page_and_swap_cache(struct page *); +void free_folio_and_swap_cache(struct folio *folio); void free_pages_and_swap_cache(struct encoded_page **, int); /* linux/mm/swapfile.c */ extern atomic_long_t nr_swap_pages; @@ -520,10 +520,8 @@ static inline void put_swap_device(struct swap_info_struct *si) #define si_swapinfo(val) \ do { (val)->freeswap = (val)->totalswap = 0; } while (0) -/* only sparc can not include linux/pagemap.h in this file - * so leave put_page and release_pages undeclared... */ -#define free_page_and_swap_cache(page) \ - put_page(page) +#define free_folio_and_swap_cache(folio) \ + folio_put(folio) #define free_pages_and_swap_cache(pages, nr) \ release_pages((pages), (nr)); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 5576a08a593d..fdcf0a6049b9 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3653,7 +3653,7 @@ after_split: * requires taking the lru_lock so we do the put_page * of the tail pages after the split is complete. */ - free_page_and_swap_cache(&new_folio->page); + free_folio_and_swap_cache(new_folio); } return ret; } diff --git a/mm/khugepaged.c b/mm/khugepaged.c index b8838ba8207a..5cf204ab6af0 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -746,7 +746,7 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte, ptep_clear(vma->vm_mm, address, _pte); folio_remove_rmap_pte(src, src_page, vma); spin_unlock(ptl); - free_page_and_swap_cache(src_page); + free_folio_and_swap_cache(src); } } diff --git a/mm/swap_state.c b/mm/swap_state.c index 68fd981b514f..ac4e0994931c 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -232,13 +232,11 @@ void free_swap_cache(struct folio *folio) } /* - * Perform a free_page(), also freeing any swap cache associated with - * this page if it is the last user of the page. + * Freeing a folio and also freeing any swap cache associated with + * this folio if it is the last user. */ -void free_page_and_swap_cache(struct page *page) +void free_folio_and_swap_cache(struct folio *folio) { - struct folio *folio = page_folio(page); - free_swap_cache(folio); if (!is_huge_zero_folio(folio)) folio_put(folio); From f735eebe55f8f61758fe014bd0b02ab50b059e4d Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Wed, 16 Apr 2025 11:02:29 -0700 Subject: [PATCH 127/285] memcg: multi-memcg percpu charge cache Memory cgroup accounting is expensive and to reduce the cost, the kernel maintains per-cpu charge cache for a single memcg. So, if a charge request comes for a different memcg, the kernel will flush the old memcg's charge cache and then charge the newer memcg a fixed amount (64 pages), subtracts the charge request amount and stores the remaining in the per-cpu charge cache for the newer memcg. This mechanism is based on the assumption that the kernel, for locality, keep a process on a CPU for long period of time and most of the charge requests from that process will be served by that CPU's local charge cache. However this assumption breaks down for incoming network traffic in a multi-tenant machine. We are in the process of running multiple workloads on a single machine and if such workloads are network heavy, we are seeing very high network memory accounting cost. We have observed multiple CPUs spending almost 100% of their time in net_rx_action and almost all of that time is spent in memcg accounting of the network traffic. More precisely, net_rx_action is serving packets from multiple workloads and is observing/serving mix of packets of these workloads. The memcg switch of per-cpu cache is very expensive and we are observing a lot of memcg switches on the machine. Almost all the time is being spent on charging new memcg and flushing older memcg cache. So, definitely we need per-cpu cache that support multiple memcgs for this scenario. This patch implements a simple (and dumb) multiple memcg percpu charge cache. Actually we started with more sophisticated LRU based approach but the dumb one was always better than the sophisticated one by 1% to 3%, so going with the simple approach. Some of the design choices are: 1. Fit all caches memcgs in a single cacheline. 2. The cache array can be mix of empty slots or memcg charged slots, so the kernel has to traverse the full array. 3. The cache drain from the reclaim will drain all cached memcgs to keep things simple. To evaluate the impact of this optimization, on a 72 CPUs machine, we ran the following workload where each netperf client runs in a different cgroup. The next-20250415 kernel is used as base. $ netserver -6 $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K number of clients | Without patch | With patch 6 | 42584.1 Mbps | 48603.4 Mbps (14.13% improvement) 12 | 30617.1 Mbps | 47919.7 Mbps (56.51% improvement) 18 | 25305.2 Mbps | 45497.3 Mbps (79.79% improvement) 24 | 20104.1 Mbps | 37907.7 Mbps (88.55% improvement) 30 | 14702.4 Mbps | 30746.5 Mbps (109.12% improvement) 36 | 10801.5 Mbps | 26476.3 Mbps (145.11% improvement) The results show drastic improvement for network intensive workloads. [shakeel.butt@linux.dev: add BUILD_BUG_ON() for MEMCG_CHARGE_BATCH] Link: https://lkml.kernel.org/r/rlsgeosg3j7v5nihhbxxxbv3xfy4ejvigihj7lkkbt3n6imyne@2apxx2jm2e57 [shakeel.butt@linux.dev: simplify refill_stock] Link: https://lkml.kernel.org/r/as5cdsm4lraxupg3t6onep2ixql72za25hvd4x334dsoyo4apr@zyzl4vkuevuv [hughd@google.com: it's better to stock nr_pages than the uninitialized stock_pages] Link: https://lkml.kernel.org/r/d542d18f-1caa-6fea-e2c3-3555c87bcf64@google.com [shakeel.butt@linux.dev: add comment per Michal and use DEFINE_PER_CPU_ALIGNED instead of DEFINE_PER_CPU per Vlastimil] Link: https://lkml.kernel.org/r/dieeei3squ2gcnqxdjayvxbvzldr266rhnvtl3vjzsqevxkevf@ckui5vjzl2qg Link: https://lkml.kernel.org/r/20250416180229.2902751-1-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Signed-off-by: Hugh Dickins Acked-by: Jakub Kicinski Acked-by: Michal Hocko Cc: Eric Dumaze Cc: Johannes Weiner Cc: Muchun Song Cc: Roman Gushchin Cc: Soheil Hassas Yeganeh Cc: Vlastimil Babka Cc: Michal Hocko Signed-off-by: Andrew Morton --- mm/memcontrol.c | 144 +++++++++++++++++++++++++++++++++++------------- 1 file changed, 106 insertions(+), 38 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e64ac7942a21..3020bb82c94c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1769,10 +1769,15 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg) pr_cont(" are going to be killed due to memory.oom.group set\n"); } +/* + * The value of NR_MEMCG_STOCK is selected to keep the cached memcgs and their + * nr_pages in a single cacheline. This may change in future. + */ +#define NR_MEMCG_STOCK 7 struct memcg_stock_pcp { local_trylock_t stock_lock; - struct mem_cgroup *cached; /* this never be root cgroup */ - unsigned int nr_pages; + uint8_t nr_pages[NR_MEMCG_STOCK]; + struct mem_cgroup *cached[NR_MEMCG_STOCK]; struct obj_cgroup *cached_objcg; struct pglist_data *cached_pgdat; @@ -1784,7 +1789,7 @@ struct memcg_stock_pcp { unsigned long flags; #define FLUSHING_CACHED_CHARGE 0 }; -static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock) = { +static DEFINE_PER_CPU_ALIGNED(struct memcg_stock_pcp, memcg_stock) = { .stock_lock = INIT_LOCAL_TRYLOCK(stock_lock), }; static DEFINE_MUTEX(percpu_charge_mutex); @@ -1809,9 +1814,10 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages, gfp_t gfp_mask) { struct memcg_stock_pcp *stock; - unsigned int stock_pages; + uint8_t stock_pages; unsigned long flags; bool ret = false; + int i; if (nr_pages > MEMCG_CHARGE_BATCH) return ret; @@ -1822,10 +1828,17 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages, return ret; stock = this_cpu_ptr(&memcg_stock); - stock_pages = READ_ONCE(stock->nr_pages); - if (memcg == READ_ONCE(stock->cached) && stock_pages >= nr_pages) { - WRITE_ONCE(stock->nr_pages, stock_pages - nr_pages); - ret = true; + + for (i = 0; i < NR_MEMCG_STOCK; ++i) { + if (memcg != READ_ONCE(stock->cached[i])) + continue; + + stock_pages = READ_ONCE(stock->nr_pages[i]); + if (stock_pages >= nr_pages) { + WRITE_ONCE(stock->nr_pages[i], stock_pages - nr_pages); + ret = true; + } + break; } local_unlock_irqrestore(&memcg_stock.stock_lock, flags); @@ -1843,21 +1856,30 @@ static void memcg_uncharge(struct mem_cgroup *memcg, unsigned int nr_pages) /* * Returns stocks cached in percpu and reset cached information. */ -static void drain_stock(struct memcg_stock_pcp *stock) +static void drain_stock(struct memcg_stock_pcp *stock, int i) { - unsigned int stock_pages = READ_ONCE(stock->nr_pages); - struct mem_cgroup *old = READ_ONCE(stock->cached); + struct mem_cgroup *old = READ_ONCE(stock->cached[i]); + uint8_t stock_pages; if (!old) return; + stock_pages = READ_ONCE(stock->nr_pages[i]); if (stock_pages) { memcg_uncharge(old, stock_pages); - WRITE_ONCE(stock->nr_pages, 0); + WRITE_ONCE(stock->nr_pages[i], 0); } css_put(&old->css); - WRITE_ONCE(stock->cached, NULL); + WRITE_ONCE(stock->cached[i], NULL); +} + +static void drain_stock_fully(struct memcg_stock_pcp *stock) +{ + int i; + + for (i = 0; i < NR_MEMCG_STOCK; ++i) + drain_stock(stock, i); } static void drain_local_stock(struct work_struct *dummy) @@ -1874,7 +1896,7 @@ static void drain_local_stock(struct work_struct *dummy) stock = this_cpu_ptr(&memcg_stock); drain_obj_stock(stock); - drain_stock(stock); + drain_stock_fully(stock); clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags); local_unlock_irqrestore(&memcg_stock.stock_lock, flags); @@ -1883,35 +1905,91 @@ static void drain_local_stock(struct work_struct *dummy) static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) { struct memcg_stock_pcp *stock; - unsigned int stock_pages; + struct mem_cgroup *cached; + uint8_t stock_pages; unsigned long flags; + bool success = false; + int empty_slot = -1; + int i; + + /* + * For now limit MEMCG_CHARGE_BATCH to 127 and less. In future if we + * decide to increase it more than 127 then we will need more careful + * handling of nr_pages[] in struct memcg_stock_pcp. + */ + BUILD_BUG_ON(MEMCG_CHARGE_BATCH > S8_MAX); VM_WARN_ON_ONCE(mem_cgroup_is_root(memcg)); - if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags)) { + if (nr_pages > MEMCG_CHARGE_BATCH || + !local_trylock_irqsave(&memcg_stock.stock_lock, flags)) { /* - * In case of unlikely failure to lock percpu stock_lock - * uncharge memcg directly. + * In case of larger than batch refill or unlikely failure to + * lock the percpu stock_lock, uncharge memcg directly. */ memcg_uncharge(memcg, nr_pages); return; } stock = this_cpu_ptr(&memcg_stock); - if (READ_ONCE(stock->cached) != memcg) { /* reset if necessary */ - drain_stock(stock); - css_get(&memcg->css); - WRITE_ONCE(stock->cached, memcg); + for (i = 0; i < NR_MEMCG_STOCK; ++i) { + cached = READ_ONCE(stock->cached[i]); + if (!cached && empty_slot == -1) + empty_slot = i; + if (memcg == READ_ONCE(stock->cached[i])) { + stock_pages = READ_ONCE(stock->nr_pages[i]) + nr_pages; + WRITE_ONCE(stock->nr_pages[i], stock_pages); + if (stock_pages > MEMCG_CHARGE_BATCH) + drain_stock(stock, i); + success = true; + break; + } } - stock_pages = READ_ONCE(stock->nr_pages) + nr_pages; - WRITE_ONCE(stock->nr_pages, stock_pages); - if (stock_pages > MEMCG_CHARGE_BATCH) - drain_stock(stock); + if (!success) { + i = empty_slot; + if (i == -1) { + i = get_random_u32_below(NR_MEMCG_STOCK); + drain_stock(stock, i); + } + css_get(&memcg->css); + WRITE_ONCE(stock->cached[i], memcg); + WRITE_ONCE(stock->nr_pages[i], nr_pages); + } local_unlock_irqrestore(&memcg_stock.stock_lock, flags); } +static bool is_drain_needed(struct memcg_stock_pcp *stock, + struct mem_cgroup *root_memcg) +{ + struct mem_cgroup *memcg; + bool flush = false; + int i; + + rcu_read_lock(); + + if (obj_stock_flush_required(stock, root_memcg)) { + flush = true; + goto out; + } + + for (i = 0; i < NR_MEMCG_STOCK; ++i) { + memcg = READ_ONCE(stock->cached[i]); + if (!memcg) + continue; + + if (READ_ONCE(stock->nr_pages[i]) && + mem_cgroup_is_descendant(memcg, root_memcg)) { + flush = true; + break; + } + } +out: + rcu_read_unlock(); + return flush; +} + /* * Drains all per-CPU charge caches for given root_memcg resp. subtree * of the hierarchy under it. @@ -1933,17 +2011,7 @@ void drain_all_stock(struct mem_cgroup *root_memcg) curcpu = smp_processor_id(); for_each_online_cpu(cpu) { struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu); - struct mem_cgroup *memcg; - bool flush = false; - - rcu_read_lock(); - memcg = READ_ONCE(stock->cached); - if (memcg && READ_ONCE(stock->nr_pages) && - mem_cgroup_is_descendant(memcg, root_memcg)) - flush = true; - else if (obj_stock_flush_required(stock, root_memcg)) - flush = true; - rcu_read_unlock(); + bool flush = is_drain_needed(stock, root_memcg); if (flush && !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) { @@ -1969,7 +2037,7 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu) drain_obj_stock(stock); local_unlock_irqrestore(&memcg_stock.stock_lock, flags); - drain_stock(stock); + drain_stock_fully(stock); return 0; } From 75404e07663b1622948944cf31531fa87cb1785d Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Wed, 16 Apr 2025 11:38:36 +0100 Subject: [PATCH 128/285] mm: move mmap/vma locking logic into specific files Currently the VMA and mmap locking logic is entangled in two of the most overwrought files in mm - include/linux/mm.h and mm/memory.c. Separate this logic out so we can more easily make changes and create an appropriate MAINTAINERS entry that spans only the logic relating to locking. This should have no functional change. Care is taken to avoid dependency loops, we must regrettably keep release_fault_lock() and assert_fault_locked() in mm.h as a result due to the dependence on the vm_fault type. Additionally we must declare rcuwait_wake_up() manually to avoid a dependency cycle on linux/rcuwait.h. Additionally move the nommu implementatino of lock_mm_and_find_vma() to mmap_lock.c so everything lock-related is in one place. Link: https://lkml.kernel.org/r/bec6c8e29fa8de9267a811a10b1bdae355d67ed4.1744799282.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Suren Baghdasaryan Reviewed-by: Liam R. Howlett Reviewed-by: Vlastimil Babka Cc: David Hildenbrand Cc: Matthew Wilcox (Oracle) Cc: "Paul E . McKenney" Cc: SeongJae Park Cc: Shakeel Butt Signed-off-by: Andrew Morton --- include/linux/mm.h | 233 +------------------------------- include/linux/mmap_lock.h | 227 +++++++++++++++++++++++++++++++ mm/memory.c | 252 ----------------------------------- mm/mmap_lock.c | 273 ++++++++++++++++++++++++++++++++++++++ mm/nommu.c | 16 --- 5 files changed, 505 insertions(+), 496 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 5eb0d77c4438..9b701cfbef22 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -671,204 +671,11 @@ static inline void vma_numab_state_init(struct vm_area_struct *vma) {} static inline void vma_numab_state_free(struct vm_area_struct *vma) {} #endif /* CONFIG_NUMA_BALANCING */ +/* + * These must be here rather than mmap_lock.h as dependent on vm_fault type, + * declared in this header. + */ #ifdef CONFIG_PER_VMA_LOCK -static inline void vma_lock_init(struct vm_area_struct *vma, bool reset_refcnt) -{ -#ifdef CONFIG_DEBUG_LOCK_ALLOC - static struct lock_class_key lockdep_key; - - lockdep_init_map(&vma->vmlock_dep_map, "vm_lock", &lockdep_key, 0); -#endif - if (reset_refcnt) - refcount_set(&vma->vm_refcnt, 0); - vma->vm_lock_seq = UINT_MAX; -} - -static inline bool is_vma_writer_only(int refcnt) -{ - /* - * With a writer and no readers, refcnt is VMA_LOCK_OFFSET if the vma - * is detached and (VMA_LOCK_OFFSET + 1) if it is attached. Waiting on - * a detached vma happens only in vma_mark_detached() and is a rare - * case, therefore most of the time there will be no unnecessary wakeup. - */ - return refcnt & VMA_LOCK_OFFSET && refcnt <= VMA_LOCK_OFFSET + 1; -} - -static inline void vma_refcount_put(struct vm_area_struct *vma) -{ - /* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt */ - struct mm_struct *mm = vma->vm_mm; - int oldcnt; - - rwsem_release(&vma->vmlock_dep_map, _RET_IP_); - if (!__refcount_dec_and_test(&vma->vm_refcnt, &oldcnt)) { - - if (is_vma_writer_only(oldcnt - 1)) - rcuwait_wake_up(&mm->vma_writer_wait); - } -} - -/* - * Try to read-lock a vma. The function is allowed to occasionally yield false - * locked result to avoid performance overhead, in which case we fall back to - * using mmap_lock. The function should never yield false unlocked result. - * False locked result is possible if mm_lock_seq overflows or if vma gets - * reused and attached to a different mm before we lock it. - * Returns the vma on success, NULL on failure to lock and EAGAIN if vma got - * detached. - */ -static inline struct vm_area_struct *vma_start_read(struct mm_struct *mm, - struct vm_area_struct *vma) -{ - int oldcnt; - - /* - * Check before locking. A race might cause false locked result. - * We can use READ_ONCE() for the mm_lock_seq here, and don't need - * ACQUIRE semantics, because this is just a lockless check whose result - * we don't rely on for anything - the mm_lock_seq read against which we - * need ordering is below. - */ - if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence)) - return NULL; - - /* - * If VMA_LOCK_OFFSET is set, __refcount_inc_not_zero_limited_acquire() - * will fail because VMA_REF_LIMIT is less than VMA_LOCK_OFFSET. - * Acquire fence is required here to avoid reordering against later - * vm_lock_seq check and checks inside lock_vma_under_rcu(). - */ - if (unlikely(!__refcount_inc_not_zero_limited_acquire(&vma->vm_refcnt, &oldcnt, - VMA_REF_LIMIT))) { - /* return EAGAIN if vma got detached from under us */ - return oldcnt ? NULL : ERR_PTR(-EAGAIN); - } - - rwsem_acquire_read(&vma->vmlock_dep_map, 0, 1, _RET_IP_); - /* - * Overflow of vm_lock_seq/mm_lock_seq might produce false locked result. - * False unlocked result is impossible because we modify and check - * vma->vm_lock_seq under vma->vm_refcnt protection and mm->mm_lock_seq - * modification invalidates all existing locks. - * - * We must use ACQUIRE semantics for the mm_lock_seq so that if we are - * racing with vma_end_write_all(), we only start reading from the VMA - * after it has been unlocked. - * This pairs with RELEASE semantics in vma_end_write_all(). - */ - if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&mm->mm_lock_seq))) { - vma_refcount_put(vma); - return NULL; - } - - return vma; -} - -/* - * Use only while holding mmap read lock which guarantees that locking will not - * fail (nobody can concurrently write-lock the vma). vma_start_read() should - * not be used in such cases because it might fail due to mm_lock_seq overflow. - * This functionality is used to obtain vma read lock and drop the mmap read lock. - */ -static inline bool vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass) -{ - int oldcnt; - - mmap_assert_locked(vma->vm_mm); - if (unlikely(!__refcount_inc_not_zero_limited_acquire(&vma->vm_refcnt, &oldcnt, - VMA_REF_LIMIT))) - return false; - - rwsem_acquire_read(&vma->vmlock_dep_map, 0, 1, _RET_IP_); - return true; -} - -/* - * Use only while holding mmap read lock which guarantees that locking will not - * fail (nobody can concurrently write-lock the vma). vma_start_read() should - * not be used in such cases because it might fail due to mm_lock_seq overflow. - * This functionality is used to obtain vma read lock and drop the mmap read lock. - */ -static inline bool vma_start_read_locked(struct vm_area_struct *vma) -{ - return vma_start_read_locked_nested(vma, 0); -} - -static inline void vma_end_read(struct vm_area_struct *vma) -{ - vma_refcount_put(vma); -} - -/* WARNING! Can only be used if mmap_lock is expected to be write-locked */ -static bool __is_vma_write_locked(struct vm_area_struct *vma, unsigned int *mm_lock_seq) -{ - mmap_assert_write_locked(vma->vm_mm); - - /* - * current task is holding mmap_write_lock, both vma->vm_lock_seq and - * mm->mm_lock_seq can't be concurrently modified. - */ - *mm_lock_seq = vma->vm_mm->mm_lock_seq.sequence; - return (vma->vm_lock_seq == *mm_lock_seq); -} - -void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq); - -/* - * Begin writing to a VMA. - * Exclude concurrent readers under the per-VMA lock until the currently - * write-locked mmap_lock is dropped or downgraded. - */ -static inline void vma_start_write(struct vm_area_struct *vma) -{ - unsigned int mm_lock_seq; - - if (__is_vma_write_locked(vma, &mm_lock_seq)) - return; - - __vma_start_write(vma, mm_lock_seq); -} - -static inline void vma_assert_write_locked(struct vm_area_struct *vma) -{ - unsigned int mm_lock_seq; - - VM_BUG_ON_VMA(!__is_vma_write_locked(vma, &mm_lock_seq), vma); -} - -static inline void vma_assert_locked(struct vm_area_struct *vma) -{ - unsigned int mm_lock_seq; - - VM_BUG_ON_VMA(refcount_read(&vma->vm_refcnt) <= 1 && - !__is_vma_write_locked(vma, &mm_lock_seq), vma); -} - -/* - * WARNING: to avoid racing with vma_mark_attached()/vma_mark_detached(), these - * assertions should be made either under mmap_write_lock or when the object - * has been isolated under mmap_write_lock, ensuring no competing writers. - */ -static inline void vma_assert_attached(struct vm_area_struct *vma) -{ - WARN_ON_ONCE(!refcount_read(&vma->vm_refcnt)); -} - -static inline void vma_assert_detached(struct vm_area_struct *vma) -{ - WARN_ON_ONCE(refcount_read(&vma->vm_refcnt)); -} - -static inline void vma_mark_attached(struct vm_area_struct *vma) -{ - vma_assert_write_locked(vma); - vma_assert_detached(vma); - refcount_set_release(&vma->vm_refcnt, 1); -} - -void vma_mark_detached(struct vm_area_struct *vma); - static inline void release_fault_lock(struct vm_fault *vmf) { if (vmf->flags & FAULT_FLAG_VMA_LOCK) @@ -884,36 +691,7 @@ static inline void assert_fault_locked(struct vm_fault *vmf) else mmap_assert_locked(vmf->vma->vm_mm); } - -struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, - unsigned long address); - -#else /* CONFIG_PER_VMA_LOCK */ - -static inline void vma_lock_init(struct vm_area_struct *vma, bool reset_refcnt) {} -static inline struct vm_area_struct *vma_start_read(struct mm_struct *mm, - struct vm_area_struct *vma) - { return NULL; } -static inline void vma_end_read(struct vm_area_struct *vma) {} -static inline void vma_start_write(struct vm_area_struct *vma) {} -static inline void vma_assert_write_locked(struct vm_area_struct *vma) - { mmap_assert_write_locked(vma->vm_mm); } -static inline void vma_assert_attached(struct vm_area_struct *vma) {} -static inline void vma_assert_detached(struct vm_area_struct *vma) {} -static inline void vma_mark_attached(struct vm_area_struct *vma) {} -static inline void vma_mark_detached(struct vm_area_struct *vma) {} - -static inline struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, - unsigned long address) -{ - return NULL; -} - -static inline void vma_assert_locked(struct vm_area_struct *vma) -{ - mmap_assert_locked(vma->vm_mm); -} - +#else static inline void release_fault_lock(struct vm_fault *vmf) { mmap_read_unlock(vmf->vma->vm_mm); @@ -923,7 +701,6 @@ static inline void assert_fault_locked(struct vm_fault *vmf) { mmap_assert_locked(vmf->vma->vm_mm); } - #endif /* CONFIG_PER_VMA_LOCK */ extern const struct vm_operations_struct vma_dummy_vm_ops; diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h index 4706c6769902..7983b2efe9bf 100644 --- a/include/linux/mmap_lock.h +++ b/include/linux/mmap_lock.h @@ -1,6 +1,10 @@ +/* SPDX-License-Identifier: GPL-2.0 */ #ifndef _LINUX_MMAP_LOCK_H #define _LINUX_MMAP_LOCK_H +/* Avoid a dependency loop by declaring here. */ +extern int rcuwait_wake_up(struct rcuwait *w); + #include #include #include @@ -104,6 +108,206 @@ static inline bool mmap_lock_speculate_retry(struct mm_struct *mm, unsigned int return read_seqcount_retry(&mm->mm_lock_seq, seq); } +static inline void vma_lock_init(struct vm_area_struct *vma, bool reset_refcnt) +{ +#ifdef CONFIG_DEBUG_LOCK_ALLOC + static struct lock_class_key lockdep_key; + + lockdep_init_map(&vma->vmlock_dep_map, "vm_lock", &lockdep_key, 0); +#endif + if (reset_refcnt) + refcount_set(&vma->vm_refcnt, 0); + vma->vm_lock_seq = UINT_MAX; +} + +static inline bool is_vma_writer_only(int refcnt) +{ + /* + * With a writer and no readers, refcnt is VMA_LOCK_OFFSET if the vma + * is detached and (VMA_LOCK_OFFSET + 1) if it is attached. Waiting on + * a detached vma happens only in vma_mark_detached() and is a rare + * case, therefore most of the time there will be no unnecessary wakeup. + */ + return refcnt & VMA_LOCK_OFFSET && refcnt <= VMA_LOCK_OFFSET + 1; +} + +static inline void vma_refcount_put(struct vm_area_struct *vma) +{ + /* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt */ + struct mm_struct *mm = vma->vm_mm; + int oldcnt; + + rwsem_release(&vma->vmlock_dep_map, _RET_IP_); + if (!__refcount_dec_and_test(&vma->vm_refcnt, &oldcnt)) { + + if (is_vma_writer_only(oldcnt - 1)) + rcuwait_wake_up(&mm->vma_writer_wait); + } +} + +/* + * Try to read-lock a vma. The function is allowed to occasionally yield false + * locked result to avoid performance overhead, in which case we fall back to + * using mmap_lock. The function should never yield false unlocked result. + * False locked result is possible if mm_lock_seq overflows or if vma gets + * reused and attached to a different mm before we lock it. + * Returns the vma on success, NULL on failure to lock and EAGAIN if vma got + * detached. + */ +static inline struct vm_area_struct *vma_start_read(struct mm_struct *mm, + struct vm_area_struct *vma) +{ + int oldcnt; + + /* + * Check before locking. A race might cause false locked result. + * We can use READ_ONCE() for the mm_lock_seq here, and don't need + * ACQUIRE semantics, because this is just a lockless check whose result + * we don't rely on for anything - the mm_lock_seq read against which we + * need ordering is below. + */ + if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence)) + return NULL; + + /* + * If VMA_LOCK_OFFSET is set, __refcount_inc_not_zero_limited_acquire() + * will fail because VMA_REF_LIMIT is less than VMA_LOCK_OFFSET. + * Acquire fence is required here to avoid reordering against later + * vm_lock_seq check and checks inside lock_vma_under_rcu(). + */ + if (unlikely(!__refcount_inc_not_zero_limited_acquire(&vma->vm_refcnt, &oldcnt, + VMA_REF_LIMIT))) { + /* return EAGAIN if vma got detached from under us */ + return oldcnt ? NULL : ERR_PTR(-EAGAIN); + } + + rwsem_acquire_read(&vma->vmlock_dep_map, 0, 1, _RET_IP_); + /* + * Overflow of vm_lock_seq/mm_lock_seq might produce false locked result. + * False unlocked result is impossible because we modify and check + * vma->vm_lock_seq under vma->vm_refcnt protection and mm->mm_lock_seq + * modification invalidates all existing locks. + * + * We must use ACQUIRE semantics for the mm_lock_seq so that if we are + * racing with vma_end_write_all(), we only start reading from the VMA + * after it has been unlocked. + * This pairs with RELEASE semantics in vma_end_write_all(). + */ + if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&mm->mm_lock_seq))) { + vma_refcount_put(vma); + return NULL; + } + + return vma; +} + +/* + * Use only while holding mmap read lock which guarantees that locking will not + * fail (nobody can concurrently write-lock the vma). vma_start_read() should + * not be used in such cases because it might fail due to mm_lock_seq overflow. + * This functionality is used to obtain vma read lock and drop the mmap read lock. + */ +static inline bool vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass) +{ + int oldcnt; + + mmap_assert_locked(vma->vm_mm); + if (unlikely(!__refcount_inc_not_zero_limited_acquire(&vma->vm_refcnt, &oldcnt, + VMA_REF_LIMIT))) + return false; + + rwsem_acquire_read(&vma->vmlock_dep_map, 0, 1, _RET_IP_); + return true; +} + +/* + * Use only while holding mmap read lock which guarantees that locking will not + * fail (nobody can concurrently write-lock the vma). vma_start_read() should + * not be used in such cases because it might fail due to mm_lock_seq overflow. + * This functionality is used to obtain vma read lock and drop the mmap read lock. + */ +static inline bool vma_start_read_locked(struct vm_area_struct *vma) +{ + return vma_start_read_locked_nested(vma, 0); +} + +static inline void vma_end_read(struct vm_area_struct *vma) +{ + vma_refcount_put(vma); +} + +/* WARNING! Can only be used if mmap_lock is expected to be write-locked */ +static bool __is_vma_write_locked(struct vm_area_struct *vma, unsigned int *mm_lock_seq) +{ + mmap_assert_write_locked(vma->vm_mm); + + /* + * current task is holding mmap_write_lock, both vma->vm_lock_seq and + * mm->mm_lock_seq can't be concurrently modified. + */ + *mm_lock_seq = vma->vm_mm->mm_lock_seq.sequence; + return (vma->vm_lock_seq == *mm_lock_seq); +} + +void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq); + +/* + * Begin writing to a VMA. + * Exclude concurrent readers under the per-VMA lock until the currently + * write-locked mmap_lock is dropped or downgraded. + */ +static inline void vma_start_write(struct vm_area_struct *vma) +{ + unsigned int mm_lock_seq; + + if (__is_vma_write_locked(vma, &mm_lock_seq)) + return; + + __vma_start_write(vma, mm_lock_seq); +} + +static inline void vma_assert_write_locked(struct vm_area_struct *vma) +{ + unsigned int mm_lock_seq; + + VM_BUG_ON_VMA(!__is_vma_write_locked(vma, &mm_lock_seq), vma); +} + +static inline void vma_assert_locked(struct vm_area_struct *vma) +{ + unsigned int mm_lock_seq; + + VM_BUG_ON_VMA(refcount_read(&vma->vm_refcnt) <= 1 && + !__is_vma_write_locked(vma, &mm_lock_seq), vma); +} + +/* + * WARNING: to avoid racing with vma_mark_attached()/vma_mark_detached(), these + * assertions should be made either under mmap_write_lock or when the object + * has been isolated under mmap_write_lock, ensuring no competing writers. + */ +static inline void vma_assert_attached(struct vm_area_struct *vma) +{ + WARN_ON_ONCE(!refcount_read(&vma->vm_refcnt)); +} + +static inline void vma_assert_detached(struct vm_area_struct *vma) +{ + WARN_ON_ONCE(refcount_read(&vma->vm_refcnt)); +} + +static inline void vma_mark_attached(struct vm_area_struct *vma) +{ + vma_assert_write_locked(vma); + vma_assert_detached(vma); + refcount_set_release(&vma->vm_refcnt, 1); +} + +void vma_mark_detached(struct vm_area_struct *vma); + +struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, + unsigned long address); + #else /* CONFIG_PER_VMA_LOCK */ static inline void mm_lock_seqcount_init(struct mm_struct *mm) {} @@ -119,6 +323,29 @@ static inline bool mmap_lock_speculate_retry(struct mm_struct *mm, unsigned int { return true; } +static inline void vma_lock_init(struct vm_area_struct *vma, bool reset_refcnt) {} +static inline struct vm_area_struct *vma_start_read(struct mm_struct *mm, + struct vm_area_struct *vma) + { return NULL; } +static inline void vma_end_read(struct vm_area_struct *vma) {} +static inline void vma_start_write(struct vm_area_struct *vma) {} +static inline void vma_assert_write_locked(struct vm_area_struct *vma) + { mmap_assert_write_locked(vma->vm_mm); } +static inline void vma_assert_attached(struct vm_area_struct *vma) {} +static inline void vma_assert_detached(struct vm_area_struct *vma) {} +static inline void vma_mark_attached(struct vm_area_struct *vma) {} +static inline void vma_mark_detached(struct vm_area_struct *vma) {} + +static inline struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, + unsigned long address) +{ + return NULL; +} + +static inline void vma_assert_locked(struct vm_area_struct *vma) +{ + mmap_assert_locked(vma->vm_mm); +} #endif /* CONFIG_PER_VMA_LOCK */ diff --git a/mm/memory.c b/mm/memory.c index 71c255f3fdcc..f18266b5a0a9 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -6378,258 +6378,6 @@ out: } EXPORT_SYMBOL_GPL(handle_mm_fault); -#ifdef CONFIG_LOCK_MM_AND_FIND_VMA -#include - -static inline bool get_mmap_lock_carefully(struct mm_struct *mm, struct pt_regs *regs) -{ - if (likely(mmap_read_trylock(mm))) - return true; - - if (regs && !user_mode(regs)) { - unsigned long ip = exception_ip(regs); - if (!search_exception_tables(ip)) - return false; - } - - return !mmap_read_lock_killable(mm); -} - -static inline bool mmap_upgrade_trylock(struct mm_struct *mm) -{ - /* - * We don't have this operation yet. - * - * It should be easy enough to do: it's basically a - * atomic_long_try_cmpxchg_acquire() - * from RWSEM_READER_BIAS -> RWSEM_WRITER_LOCKED, but - * it also needs the proper lockdep magic etc. - */ - return false; -} - -static inline bool upgrade_mmap_lock_carefully(struct mm_struct *mm, struct pt_regs *regs) -{ - mmap_read_unlock(mm); - if (regs && !user_mode(regs)) { - unsigned long ip = exception_ip(regs); - if (!search_exception_tables(ip)) - return false; - } - return !mmap_write_lock_killable(mm); -} - -/* - * Helper for page fault handling. - * - * This is kind of equivalent to "mmap_read_lock()" followed - * by "find_extend_vma()", except it's a lot more careful about - * the locking (and will drop the lock on failure). - * - * For example, if we have a kernel bug that causes a page - * fault, we don't want to just use mmap_read_lock() to get - * the mm lock, because that would deadlock if the bug were - * to happen while we're holding the mm lock for writing. - * - * So this checks the exception tables on kernel faults in - * order to only do this all for instructions that are actually - * expected to fault. - * - * We can also actually take the mm lock for writing if we - * need to extend the vma, which helps the VM layer a lot. - */ -struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm, - unsigned long addr, struct pt_regs *regs) -{ - struct vm_area_struct *vma; - - if (!get_mmap_lock_carefully(mm, regs)) - return NULL; - - vma = find_vma(mm, addr); - if (likely(vma && (vma->vm_start <= addr))) - return vma; - - /* - * Well, dang. We might still be successful, but only - * if we can extend a vma to do so. - */ - if (!vma || !(vma->vm_flags & VM_GROWSDOWN)) { - mmap_read_unlock(mm); - return NULL; - } - - /* - * We can try to upgrade the mmap lock atomically, - * in which case we can continue to use the vma - * we already looked up. - * - * Otherwise we'll have to drop the mmap lock and - * re-take it, and also look up the vma again, - * re-checking it. - */ - if (!mmap_upgrade_trylock(mm)) { - if (!upgrade_mmap_lock_carefully(mm, regs)) - return NULL; - - vma = find_vma(mm, addr); - if (!vma) - goto fail; - if (vma->vm_start <= addr) - goto success; - if (!(vma->vm_flags & VM_GROWSDOWN)) - goto fail; - } - - if (expand_stack_locked(vma, addr)) - goto fail; - -success: - mmap_write_downgrade(mm); - return vma; - -fail: - mmap_write_unlock(mm); - return NULL; -} -#endif - -#ifdef CONFIG_PER_VMA_LOCK -static inline bool __vma_enter_locked(struct vm_area_struct *vma, bool detaching) -{ - unsigned int tgt_refcnt = VMA_LOCK_OFFSET; - - /* Additional refcnt if the vma is attached. */ - if (!detaching) - tgt_refcnt++; - - /* - * If vma is detached then only vma_mark_attached() can raise the - * vm_refcnt. mmap_write_lock prevents racing with vma_mark_attached(). - */ - if (!refcount_add_not_zero(VMA_LOCK_OFFSET, &vma->vm_refcnt)) - return false; - - rwsem_acquire(&vma->vmlock_dep_map, 0, 0, _RET_IP_); - rcuwait_wait_event(&vma->vm_mm->vma_writer_wait, - refcount_read(&vma->vm_refcnt) == tgt_refcnt, - TASK_UNINTERRUPTIBLE); - lock_acquired(&vma->vmlock_dep_map, _RET_IP_); - - return true; -} - -static inline void __vma_exit_locked(struct vm_area_struct *vma, bool *detached) -{ - *detached = refcount_sub_and_test(VMA_LOCK_OFFSET, &vma->vm_refcnt); - rwsem_release(&vma->vmlock_dep_map, _RET_IP_); -} - -void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq) -{ - bool locked; - - /* - * __vma_enter_locked() returns false immediately if the vma is not - * attached, otherwise it waits until refcnt is indicating that vma - * is attached with no readers. - */ - locked = __vma_enter_locked(vma, false); - - /* - * We should use WRITE_ONCE() here because we can have concurrent reads - * from the early lockless pessimistic check in vma_start_read(). - * We don't really care about the correctness of that early check, but - * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy. - */ - WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq); - - if (locked) { - bool detached; - - __vma_exit_locked(vma, &detached); - WARN_ON_ONCE(detached); /* vma should remain attached */ - } -} -EXPORT_SYMBOL_GPL(__vma_start_write); - -void vma_mark_detached(struct vm_area_struct *vma) -{ - vma_assert_write_locked(vma); - vma_assert_attached(vma); - - /* - * We are the only writer, so no need to use vma_refcount_put(). - * The condition below is unlikely because the vma has been already - * write-locked and readers can increment vm_refcnt only temporarily - * before they check vm_lock_seq, realize the vma is locked and drop - * back the vm_refcnt. That is a narrow window for observing a raised - * vm_refcnt. - */ - if (unlikely(!refcount_dec_and_test(&vma->vm_refcnt))) { - /* Wait until vma is detached with no readers. */ - if (__vma_enter_locked(vma, true)) { - bool detached; - - __vma_exit_locked(vma, &detached); - WARN_ON_ONCE(!detached); - } - } -} - -/* - * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be - * stable and not isolated. If the VMA is not found or is being modified the - * function returns NULL. - */ -struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, - unsigned long address) -{ - MA_STATE(mas, &mm->mm_mt, address, address); - struct vm_area_struct *vma; - - rcu_read_lock(); -retry: - vma = mas_walk(&mas); - if (!vma) - goto inval; - - vma = vma_start_read(mm, vma); - if (IS_ERR_OR_NULL(vma)) { - /* Check if the VMA got isolated after we found it */ - if (PTR_ERR(vma) == -EAGAIN) { - count_vm_vma_lock_event(VMA_LOCK_MISS); - /* The area was replaced with another one */ - goto retry; - } - - /* Failed to lock the VMA */ - goto inval; - } - /* - * At this point, we have a stable reference to a VMA: The VMA is - * locked and we know it hasn't already been isolated. - * From here on, we can access the VMA without worrying about which - * fields are accessible for RCU readers. - */ - - /* Check if the vma we locked is the right one. */ - if (unlikely(vma->vm_mm != mm || - address < vma->vm_start || address >= vma->vm_end)) - goto inval_end_read; - - rcu_read_unlock(); - return vma; - -inval_end_read: - vma_end_read(vma); -inval: - rcu_read_unlock(); - count_vm_vma_lock_event(VMA_LOCK_ABORT); - return NULL; -} -#endif /* CONFIG_PER_VMA_LOCK */ - #ifndef __PAGETABLE_P4D_FOLDED /* * Allocate p4d page table. diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c index e7dbaf96aa17..5f725cc67334 100644 --- a/mm/mmap_lock.c +++ b/mm/mmap_lock.c @@ -42,3 +42,276 @@ void __mmap_lock_do_trace_released(struct mm_struct *mm, bool write) } EXPORT_SYMBOL(__mmap_lock_do_trace_released); #endif /* CONFIG_TRACING */ + +#ifdef CONFIG_MMU +#ifdef CONFIG_PER_VMA_LOCK +static inline bool __vma_enter_locked(struct vm_area_struct *vma, bool detaching) +{ + unsigned int tgt_refcnt = VMA_LOCK_OFFSET; + + /* Additional refcnt if the vma is attached. */ + if (!detaching) + tgt_refcnt++; + + /* + * If vma is detached then only vma_mark_attached() can raise the + * vm_refcnt. mmap_write_lock prevents racing with vma_mark_attached(). + */ + if (!refcount_add_not_zero(VMA_LOCK_OFFSET, &vma->vm_refcnt)) + return false; + + rwsem_acquire(&vma->vmlock_dep_map, 0, 0, _RET_IP_); + rcuwait_wait_event(&vma->vm_mm->vma_writer_wait, + refcount_read(&vma->vm_refcnt) == tgt_refcnt, + TASK_UNINTERRUPTIBLE); + lock_acquired(&vma->vmlock_dep_map, _RET_IP_); + + return true; +} + +static inline void __vma_exit_locked(struct vm_area_struct *vma, bool *detached) +{ + *detached = refcount_sub_and_test(VMA_LOCK_OFFSET, &vma->vm_refcnt); + rwsem_release(&vma->vmlock_dep_map, _RET_IP_); +} + +void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq) +{ + bool locked; + + /* + * __vma_enter_locked() returns false immediately if the vma is not + * attached, otherwise it waits until refcnt is indicating that vma + * is attached with no readers. + */ + locked = __vma_enter_locked(vma, false); + + /* + * We should use WRITE_ONCE() here because we can have concurrent reads + * from the early lockless pessimistic check in vma_start_read(). + * We don't really care about the correctness of that early check, but + * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy. + */ + WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq); + + if (locked) { + bool detached; + + __vma_exit_locked(vma, &detached); + WARN_ON_ONCE(detached); /* vma should remain attached */ + } +} +EXPORT_SYMBOL_GPL(__vma_start_write); + +void vma_mark_detached(struct vm_area_struct *vma) +{ + vma_assert_write_locked(vma); + vma_assert_attached(vma); + + /* + * We are the only writer, so no need to use vma_refcount_put(). + * The condition below is unlikely because the vma has been already + * write-locked and readers can increment vm_refcnt only temporarily + * before they check vm_lock_seq, realize the vma is locked and drop + * back the vm_refcnt. That is a narrow window for observing a raised + * vm_refcnt. + */ + if (unlikely(!refcount_dec_and_test(&vma->vm_refcnt))) { + /* Wait until vma is detached with no readers. */ + if (__vma_enter_locked(vma, true)) { + bool detached; + + __vma_exit_locked(vma, &detached); + WARN_ON_ONCE(!detached); + } + } +} + +/* + * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be + * stable and not isolated. If the VMA is not found or is being modified the + * function returns NULL. + */ +struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, + unsigned long address) +{ + MA_STATE(mas, &mm->mm_mt, address, address); + struct vm_area_struct *vma; + + rcu_read_lock(); +retry: + vma = mas_walk(&mas); + if (!vma) + goto inval; + + vma = vma_start_read(mm, vma); + if (IS_ERR_OR_NULL(vma)) { + /* Check if the VMA got isolated after we found it */ + if (PTR_ERR(vma) == -EAGAIN) { + count_vm_vma_lock_event(VMA_LOCK_MISS); + /* The area was replaced with another one */ + goto retry; + } + + /* Failed to lock the VMA */ + goto inval; + } + /* + * At this point, we have a stable reference to a VMA: The VMA is + * locked and we know it hasn't already been isolated. + * From here on, we can access the VMA without worrying about which + * fields are accessible for RCU readers. + */ + + /* Check if the vma we locked is the right one. */ + if (unlikely(vma->vm_mm != mm || + address < vma->vm_start || address >= vma->vm_end)) + goto inval_end_read; + + rcu_read_unlock(); + return vma; + +inval_end_read: + vma_end_read(vma); +inval: + rcu_read_unlock(); + count_vm_vma_lock_event(VMA_LOCK_ABORT); + return NULL; +} +#endif /* CONFIG_PER_VMA_LOCK */ + +#ifdef CONFIG_LOCK_MM_AND_FIND_VMA +#include + +static inline bool get_mmap_lock_carefully(struct mm_struct *mm, struct pt_regs *regs) +{ + if (likely(mmap_read_trylock(mm))) + return true; + + if (regs && !user_mode(regs)) { + unsigned long ip = exception_ip(regs); + if (!search_exception_tables(ip)) + return false; + } + + return !mmap_read_lock_killable(mm); +} + +static inline bool mmap_upgrade_trylock(struct mm_struct *mm) +{ + /* + * We don't have this operation yet. + * + * It should be easy enough to do: it's basically a + * atomic_long_try_cmpxchg_acquire() + * from RWSEM_READER_BIAS -> RWSEM_WRITER_LOCKED, but + * it also needs the proper lockdep magic etc. + */ + return false; +} + +static inline bool upgrade_mmap_lock_carefully(struct mm_struct *mm, struct pt_regs *regs) +{ + mmap_read_unlock(mm); + if (regs && !user_mode(regs)) { + unsigned long ip = exception_ip(regs); + if (!search_exception_tables(ip)) + return false; + } + return !mmap_write_lock_killable(mm); +} + +/* + * Helper for page fault handling. + * + * This is kind of equivalent to "mmap_read_lock()" followed + * by "find_extend_vma()", except it's a lot more careful about + * the locking (and will drop the lock on failure). + * + * For example, if we have a kernel bug that causes a page + * fault, we don't want to just use mmap_read_lock() to get + * the mm lock, because that would deadlock if the bug were + * to happen while we're holding the mm lock for writing. + * + * So this checks the exception tables on kernel faults in + * order to only do this all for instructions that are actually + * expected to fault. + * + * We can also actually take the mm lock for writing if we + * need to extend the vma, which helps the VM layer a lot. + */ +struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm, + unsigned long addr, struct pt_regs *regs) +{ + struct vm_area_struct *vma; + + if (!get_mmap_lock_carefully(mm, regs)) + return NULL; + + vma = find_vma(mm, addr); + if (likely(vma && (vma->vm_start <= addr))) + return vma; + + /* + * Well, dang. We might still be successful, but only + * if we can extend a vma to do so. + */ + if (!vma || !(vma->vm_flags & VM_GROWSDOWN)) { + mmap_read_unlock(mm); + return NULL; + } + + /* + * We can try to upgrade the mmap lock atomically, + * in which case we can continue to use the vma + * we already looked up. + * + * Otherwise we'll have to drop the mmap lock and + * re-take it, and also look up the vma again, + * re-checking it. + */ + if (!mmap_upgrade_trylock(mm)) { + if (!upgrade_mmap_lock_carefully(mm, regs)) + return NULL; + + vma = find_vma(mm, addr); + if (!vma) + goto fail; + if (vma->vm_start <= addr) + goto success; + if (!(vma->vm_flags & VM_GROWSDOWN)) + goto fail; + } + + if (expand_stack_locked(vma, addr)) + goto fail; + +success: + mmap_write_downgrade(mm); + return vma; + +fail: + mmap_write_unlock(mm); + return NULL; +} +#endif /* CONFIG_LOCK_MM_AND_FIND_VMA */ + +#else /* CONFIG_MMU */ + +/* + * At least xtensa ends up having protection faults even with no + * MMU.. No stack expansion, at least. + */ +struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm, + unsigned long addr, struct pt_regs *regs) +{ + struct vm_area_struct *vma; + + mmap_read_lock(mm); + vma = vma_lookup(mm, addr); + if (!vma) + mmap_read_unlock(mm); + return vma; +} + +#endif /* CONFIG_MMU */ diff --git a/mm/nommu.c b/mm/nommu.c index 617e7ba8022f..2b4d304c6445 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -626,22 +626,6 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr) } EXPORT_SYMBOL(find_vma); -/* - * At least xtensa ends up having protection faults even with no - * MMU.. No stack expansion, at least. - */ -struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm, - unsigned long addr, struct pt_regs *regs) -{ - struct vm_area_struct *vma; - - mmap_read_lock(mm); - vma = vma_lookup(mm, addr); - if (!vma) - mmap_read_unlock(mm); - return vma; -} - /* * expand a stack to a given address * - not supported under NOMMU conditions From 4a34c584d8cd13d2b721d21cf629f77c60bfb4a4 Mon Sep 17 00:00:00 2001 From: Dev Jain Date: Wed, 16 Apr 2025 11:00:48 +0530 Subject: [PATCH 129/285] mempolicy: optimize queue_folios_pte_range by PTE batching After the check for queue_folio_required(), the code only cares about the folio in the for loop, i.e the PTEs are redundant. Therefore, optimize this loop by skipping over a PTE batch mapping the same folio. With a test program migrating pages of the calling process, which includes a mapped VMA of size 4GB with pte-mapped large folios of order-9, and migrating once back and forth node-0 and node-1, the average execution time reduces from 7.5 to 4 seconds, giving an approx 47% speedup. Link: https://lkml.kernel.org/r/20250416053048.96479-1-dev.jain@arm.com Signed-off-by: Dev Jain Acked-by: David Hildenbrand Cc: Baolin Wang Cc: Hugh Dickins Cc: Matthew Wilcox (Oracle) Cc: Ryan Roberts Cc: Vishal Moola (Oracle) Cc: Yang Shi Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/mempolicy.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index b28a1e6ae096..4d2dc8b63965 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -566,6 +566,7 @@ static void queue_folios_pmd(pmd_t *pmd, struct mm_walk *walk) static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { + const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY; struct vm_area_struct *vma = walk->vma; struct folio *folio; struct queue_pages *qp = walk->private; @@ -573,6 +574,7 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr, pte_t *pte, *mapped_pte; pte_t ptent; spinlock_t *ptl; + int max_nr, nr; ptl = pmd_trans_huge_lock(pmd, vma); if (ptl) { @@ -586,7 +588,9 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr, walk->action = ACTION_AGAIN; return 0; } - for (; addr != end; pte++, addr += PAGE_SIZE) { + for (; addr != end; pte += nr, addr += nr * PAGE_SIZE) { + max_nr = (end - addr) >> PAGE_SHIFT; + nr = 1; ptent = ptep_get(pte); if (pte_none(ptent)) continue; @@ -598,6 +602,10 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr, folio = vm_normal_folio(vma, addr, ptent); if (!folio || folio_is_zone_device(folio)) continue; + if (folio_test_large(folio) && max_nr != 1) + nr = folio_pte_batch(folio, addr, pte, ptent, + max_nr, fpb_flags, + NULL, NULL, NULL); /* * vm_normal_folio() filters out zero pages, but there might * still be reserved folios to skip, perhaps in a VDSO. @@ -630,7 +638,7 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr, if (!(flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) || !vma_migratable(vma) || !migrate_folio_add(folio, qp->pagelist, flags)) { - qp->nr_failed++; + qp->nr_failed += nr; if (strictly_unmovable(flags)) break; } From b05f8d7e077952d14acb63e3ccdf5f64404b59a4 Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Wed, 16 Apr 2025 13:27:59 +0900 Subject: [PATCH 130/285] Documentation: zram: update IDLE pages tracking documentation Move IDLE pages tracking into a separate chapter because there are multiple features that use (or depend on) it either in built-in variant ("mark all") or in extended variant (ac-time tracking). In addition, recompression doesn't require memory tracking to be enabled in order to be able to perform idle recompression. Link: https://lkml.kernel.org/r/20250416042833.3858827-1-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky Reported-by: Shin Kawamura Cc: Jonathan Corbet Cc: Minchan Kim Signed-off-by: Andrew Morton --- Documentation/admin-guide/blockdev/zram.rst | 41 +++++++++++---------- 1 file changed, 21 insertions(+), 20 deletions(-) diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst index b8d36134a151..3e273c1bb749 100644 --- a/Documentation/admin-guide/blockdev/zram.rst +++ b/Documentation/admin-guide/blockdev/zram.rst @@ -317,6 +317,26 @@ a single line of text and contains the following stats separated by whitespace: Optional Feature ================ +IDLE pages tracking +------------------- + +zram has built-in support for idle pages tracking (that is, allocated but +not used pages). This feature is useful for e.g. zram writeback and +recompression. In order to mark pages as idle, execute the following command:: + + echo all > /sys/block/zramX/idle + +This will mark all allocated zram pages as idle. The idle mark will be +removed only when the page (block) is accessed (e.g. overwritten or freed). +Additionally, when CONFIG_ZRAM_TRACK_ENTRY_ACTIME is enabled, pages can be +marked as idle based on how many seconds have passed since the last access to +a particular zram page:: + + echo 86400 > /sys/block/zramX/idle + +In this example, all pages which haven't been accessed in more than 86400 +seconds (one day) will be marked idle. + writeback --------- @@ -331,24 +351,7 @@ If admin wants to use incompressible page writeback, they could do it via:: echo huge > /sys/block/zramX/writeback -To use idle page writeback, first, user need to declare zram pages -as idle:: - - echo all > /sys/block/zramX/idle - -From now on, any pages on zram are idle pages. The idle mark -will be removed until someone requests access of the block. -IOW, unless there is access request, those pages are still idle pages. -Additionally, when CONFIG_ZRAM_TRACK_ENTRY_ACTIME is enabled pages can be -marked as idle based on how long (in seconds) it's been since they were -last accessed:: - - echo 86400 > /sys/block/zramX/idle - -In this example all pages which haven't been accessed in more than 86400 -seconds (one day) will be marked idle. - -Admin can request writeback of those idle pages at right timing via:: +Admin can request writeback of idle pages at right timing via:: echo idle > /sys/block/zramX/writeback @@ -499,8 +502,6 @@ attempt to recompress::: echo "type=huge_idle max_pages=42" > /sys/block/zramX/recompress -Recompression of idle pages requires memory tracking. - During re-compression for every page, that matches re-compression criteria, ZRAM iterates the list of registered alternative compression algorithms in order of their priorities. ZRAM stops either when re-compression was From 7a73348e5d4715b5565a53f21c01ea7b54e46cbd Mon Sep 17 00:00:00 2001 From: "Uladzislau Rezki (Sony)" Date: Thu, 17 Apr 2025 18:12:13 +0200 Subject: [PATCH 131/285] lib/test_vmalloc.c: replace RWSEM to SRCU for setup The test has the initialization step during which threads are created. To prevent the workers from starting prematurely a write lock was previously used by the main setup thread, while each worker would block on a read lock. Replace this RWSEM based synchronization with a simpler SRCU based approach. Which does two basic steps: - Main thread wraps the setup phase in an SRCU read-side critical section. Pair of srcu_read_lock()/srcu_read_unlock(). - Each worker calls synchronize_srcu() on entry, ensuring it waits for the initialization phase to be completed. This patch eliminates the need for down_read()/up_read() and down_write()/up_write() pairs thus simplifying the logic and improving clarity. [urezki@gmail.com: fix compile error with CONFIG_TINY_RCU] Link: https://lkml.kernel.org/r/20250420142029.103169-1-urezki@gmail.com Link: https://lkml.kernel.org/r/20250417161216.88318-1-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) Reviewed-by: Adrian Huang Tested-by: Adrian Huang Cc: Baoquan He Cc: Christop Hellwig Cc: Mateusz Guzik Signed-off-by: Andrew Morton --- lib/test_vmalloc.c | 17 +++++++---------- 1 file changed, 7 insertions(+), 10 deletions(-) diff --git a/lib/test_vmalloc.c b/lib/test_vmalloc.c index f585949ff696..399751022eea 100644 --- a/lib/test_vmalloc.c +++ b/lib/test_vmalloc.c @@ -13,9 +13,9 @@ #include #include #include -#include #include #include +#include #include #define __param(type, name, init, msg) \ @@ -58,10 +58,9 @@ __param(int, run_test_mask, INT_MAX, ); /* - * Read write semaphore for synchronization of setup - * phase that is done in main thread and workers. + * This is for synchronization of setup phase. */ -static DECLARE_RWSEM(prepare_for_test_rwsem); +DEFINE_STATIC_SRCU(prepare_for_test_srcu); /* * Completion tracking for worker threads. @@ -458,7 +457,7 @@ static int test_func(void *private) /* * Block until initialization is done. */ - down_read(&prepare_for_test_rwsem); + synchronize_srcu(&prepare_for_test_srcu); t->start = get_cycles(); for (i = 0; i < ARRAY_SIZE(test_case_array); i++) { @@ -487,8 +486,6 @@ static int test_func(void *private) t->data[index].time = delta; } t->stop = get_cycles(); - - up_read(&prepare_for_test_rwsem); test_report_one_done(); /* @@ -526,7 +523,7 @@ init_test_configuration(void) static void do_concurrent_test(void) { - int i, ret; + int i, ret, idx; /* * Set some basic configurations plus sanity check. @@ -538,7 +535,7 @@ static void do_concurrent_test(void) /* * Put on hold all workers. */ - down_write(&prepare_for_test_rwsem); + idx = srcu_read_lock(&prepare_for_test_srcu); for (i = 0; i < nr_threads; i++) { struct test_driver *t = &tdriver[i]; @@ -555,7 +552,7 @@ static void do_concurrent_test(void) /* * Now let the workers do their job. */ - up_write(&prepare_for_test_rwsem); + srcu_read_unlock(&prepare_for_test_srcu, idx); /* * Sleep quiet until all workers are done with 1 second From 2d76e79315e403aab595d4c8830b7a46c19f0f3b Mon Sep 17 00:00:00 2001 From: "Uladzislau Rezki (Sony)" Date: Thu, 17 Apr 2025 18:12:14 +0200 Subject: [PATCH 132/285] lib/test_vmalloc.c: allow built-in execution Remove the dependency on module loading ("m") for the vmalloc test suite, enabling it to be built directly into the kernel, so both ("=m") and ("=y") are supported. Motivation: - Faster debugging/testing of vmalloc code; - It allows to configure the test via kernel-boot parameters. Configuration example: test_vmalloc.nr_threads=64 test_vmalloc.run_test_mask=7 test_vmalloc.sequential_test_order=1 Link: https://lkml.kernel.org/r/20250417161216.88318-2-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) Reviewed-by: Baoquan He Reviewed-by: Adrian Huang Tested-by: Adrian Huang Cc: Christop Hellwig Cc: Mateusz Guzik Signed-off-by: Andrew Morton --- lib/Kconfig.debug | 3 +-- lib/test_vmalloc.c | 5 +++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index f9051ab610d5..166b9d830a85 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -2574,8 +2574,7 @@ config TEST_BITOPS config TEST_VMALLOC tristate "Test module for stress/performance analysis of vmalloc allocator" default n - depends on MMU - depends on m + depends on MMU help This builds the "test_vmalloc" module that should be used for stress and performance analysis. So, any new change for vmalloc diff --git a/lib/test_vmalloc.c b/lib/test_vmalloc.c index 399751022eea..1b0b59549aaf 100644 --- a/lib/test_vmalloc.c +++ b/lib/test_vmalloc.c @@ -591,10 +591,11 @@ static void do_concurrent_test(void) kvfree(tdriver); } -static int vmalloc_test_init(void) +static int __init vmalloc_test_init(void) { do_concurrent_test(); - return -EAGAIN; /* Fail will directly unload the module */ + /* Fail will directly unload the module */ + return IS_BUILTIN(CONFIG_TEST_VMALLOC) ? 0:-EAGAIN; } module_init(vmalloc_test_init) From 7a6fe5877745e46795865e3f6a0ae346671741a3 Mon Sep 17 00:00:00 2001 From: "Uladzislau Rezki (Sony)" Date: Thu, 17 Apr 2025 18:12:15 +0200 Subject: [PATCH 133/285] MAINTAINERS: add test_vmalloc.c to VMALLOC section A vmalloc subsystem includes "lib/test_vmalloc.c" test suite. Add an "F:" entry under VMALLOC section to track this file as part of the subsystem. Link: https://lkml.kernel.org/r/20250417161216.88318-3-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) Reviewed-by: Baoquan He Cc: Christop Hellwig Cc: Mateusz Guzik Signed-off-by: Andrew Morton --- MAINTAINERS | 1 + 1 file changed, 1 insertion(+) diff --git a/MAINTAINERS b/MAINTAINERS index 022a19cb9aa0..d8e6c1beef5c 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -25927,6 +25927,7 @@ W: http://www.linux-mm.org T: git git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm F: include/linux/vmalloc.h F: mm/vmalloc.c +F: lib/test_vmalloc.c VME SUBSYSTEM L: linux-kernel@vger.kernel.org From d0966120486833cd837feea9917b7ed06e74f58c Mon Sep 17 00:00:00 2001 From: "Uladzislau Rezki (Sony)" Date: Thu, 17 Apr 2025 18:12:16 +0200 Subject: [PATCH 134/285] vmalloc: align nr_vmalloc_pages and vmap_lazy_nr Currently both atomics share one cache-line: ... ffffffff83eab400 b vmap_lazy_nr ffffffff83eab408 b nr_vmalloc_pages ... those are global variables and they are only 8 bytes apart. Since they are modified by different threads this causes a false sharing. This can lead to a performance drop due to unnecessary cache invalidations. After this patch it is aligned to a cache line boundary: ... ffffffff8260a600 d vmap_lazy_nr ffffffff8260a640 d nr_vmalloc_pages ... Link: https://lkml.kernel.org/r/20250417161216.88318-4-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) Reviewed-by: Baoquan He Reviewed-by: Adrian Huang Tested-by: Adrian Huang Cc: Mateusz Guzik Cc: Christop Hellwig Signed-off-by: Andrew Morton --- mm/vmalloc.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index dc33ebeb8b1b..3fd802134e4e 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1008,7 +1008,8 @@ static BLOCKING_NOTIFIER_HEAD(vmap_notify_list); static void drain_vmap_area_work(struct work_struct *work); static DECLARE_WORK(drain_vmap_work, drain_vmap_area_work); -static atomic_long_t nr_vmalloc_pages; +static __cacheline_aligned_in_smp atomic_long_t nr_vmalloc_pages; +static __cacheline_aligned_in_smp atomic_long_t vmap_lazy_nr; unsigned long vmalloc_nr_pages(void) { @@ -2117,8 +2118,6 @@ static unsigned long lazy_max_pages(void) return log * (32UL * 1024 * 1024 / PAGE_SIZE); } -static atomic_long_t vmap_lazy_nr = ATOMIC_LONG_INIT(0); - /* * Serialize vmap purging. There is no actual critical section protected * by this lock, but we want to avoid concurrent calls for performance From 6e14fd33f148cf36db9bcee7561f2b5b6c3b65aa Mon Sep 17 00:00:00 2001 From: Chen Ni Date: Thu, 17 Apr 2025 16:43:30 +0800 Subject: [PATCH 135/285] mm: memcontrol: remove unnecessary NULL check before free_percpu() free_percpu() checks for NULL pointers internally. Remove unneeded NULL check here. Link: https://lkml.kernel.org/r/20250417084330.937380-1-nichen@iscas.ac.cn Signed-off-by: Chen Ni Cc: Johannes Weiner Cc: Michal Hocko Cc: Muchun Song Cc: Roman Gushchin Cc: Shakeel Butt Signed-off-by: Andrew Morton --- mm/memcontrol-v1.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c index 4a9cf27a70af..54c49cbfc968 100644 --- a/mm/memcontrol-v1.c +++ b/mm/memcontrol-v1.c @@ -2198,8 +2198,7 @@ bool memcg1_alloc_events(struct mem_cgroup *memcg) void memcg1_free_events(struct mem_cgroup *memcg) { - if (memcg->events_percpu) - free_percpu(memcg->events_percpu); + free_percpu(memcg->events_percpu); } static int __init memcg1_init(void) From bb52e89d8bdbfdf896f6899a830fead3c9344a60 Mon Sep 17 00:00:00 2001 From: Rakie Kim Date: Thu, 17 Apr 2025 16:28:35 +0900 Subject: [PATCH 136/285] mm/mempolicy: fix memory leaks in weighted interleave sysfs Patch series "Enhance sysfs handling for memory hotplug in weighted interleave", v9. The following patch series enhances the weighted interleave policy in the memory management subsystem by improving sysfs handling, fixing memory leaks, and introducing dynamic sysfs updates for memory hotplug support. This patch (of 3): Memory leaks occurred when removing sysfs attributes for weighted interleave. Improper kobject deallocation led to unreleased memory when initialization failed or when nodes were removed. The risk of leak is low because it only appears to trigger if setup fails. Setup only fails due to -ENOMEM which is unlikely to happen from a late_initcall() when memory pressure is low. This patch resolves the issue by replacing unnecessary `kfree()` calls with proper `kobject_del()` and `kobject_put()` sequences, ensuring correct teardown and preventing memory leaks. By explicitly calling `kobject_del()` before `kobject_put()`, the release function is now invoked safely, and internal sysfs state is correctly cleaned up. This guarantees that the memory associated with the kobject is fully released and avoids resource leaks, thereby improving system stability. Additionally, sysfs_remove_file() is no longer called from the release function to avoid accessing invalid sysfs state after kobject_del(). All attribute removals are now done before kobject_del(), preventing WARN_ON() in kernfs and ensuring safe and consistent cleanup of sysfs entries. Link: https://lkml.kernel.org/r/20250417072839.711-1-rakie.kim@sk.com Link: https://lkml.kernel.org/r/20250417072839.711-2-rakie.kim@sk.com Fixes: dce41f5ae253 ("mm/mempolicy: implement the sysfs-based weighted_interleave interface") Signed-off-by: Rakie Kim Reviewed-by: Gregory Price Reviewed-by: Joshua Hahn Reviewed-by: Jonathan Cameron Reviewed-by: Dan Williams Cc: David Hildenbrand Cc: Honggyu Kim Cc: "Huang, Ying" Cc: Oscar Salvador Cc: Yunjeong Mun Cc: Dan Carpenter Signed-off-by: Andrew Morton --- mm/mempolicy.c | 127 ++++++++++++++++++++++++------------------------- 1 file changed, 62 insertions(+), 65 deletions(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 4d2dc8b63965..b8c7cce61103 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -3471,8 +3471,8 @@ static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr, static struct iw_node_attr **node_attrs; -static void sysfs_wi_node_release(struct iw_node_attr *node_attr, - struct kobject *parent) +static void sysfs_wi_node_delete(struct iw_node_attr *node_attr, + struct kobject *parent) { if (!node_attr) return; @@ -3481,18 +3481,42 @@ static void sysfs_wi_node_release(struct iw_node_attr *node_attr, kfree(node_attr); } -static void sysfs_wi_release(struct kobject *wi_kobj) +static void sysfs_wi_node_delete_all(struct kobject *wi_kobj) { - int i; + int nid; - for (i = 0; i < nr_node_ids; i++) - sysfs_wi_node_release(node_attrs[i], wi_kobj); - kobject_put(wi_kobj); + for (nid = 0; nid < nr_node_ids; nid++) + sysfs_wi_node_delete(node_attrs[nid], wi_kobj); +} + +static void iw_table_free(void) +{ + u8 *old; + + mutex_lock(&iw_table_lock); + old = rcu_dereference_protected(iw_table, + lockdep_is_held(&iw_table_lock)); + rcu_assign_pointer(iw_table, NULL); + mutex_unlock(&iw_table_lock); + + synchronize_rcu(); + kfree(old); +} + +static void wi_cleanup(struct kobject *wi_kobj) { + sysfs_wi_node_delete_all(wi_kobj); + iw_table_free(); + kfree(node_attrs); +} + +static void wi_kobj_release(struct kobject *wi_kobj) +{ + kfree(wi_kobj); } static const struct kobj_type wi_ktype = { .sysfs_ops = &kobj_sysfs_ops, - .release = sysfs_wi_release, + .release = wi_kobj_release, }; static int add_weight_node(int nid, struct kobject *wi_kobj) @@ -3533,85 +3557,58 @@ static int add_weighted_interleave_group(struct kobject *root_kobj) struct kobject *wi_kobj; int nid, err; - wi_kobj = kzalloc(sizeof(struct kobject), GFP_KERNEL); - if (!wi_kobj) + node_attrs = kcalloc(nr_node_ids, sizeof(struct iw_node_attr *), + GFP_KERNEL); + if (!node_attrs) return -ENOMEM; + wi_kobj = kzalloc(sizeof(struct kobject), GFP_KERNEL); + if (!wi_kobj) { + kfree(node_attrs); + return -ENOMEM; + } + err = kobject_init_and_add(wi_kobj, &wi_ktype, root_kobj, "weighted_interleave"); - if (err) { - kfree(wi_kobj); - return err; - } + if (err) + goto err_put_kobj; for_each_node_state(nid, N_POSSIBLE) { err = add_weight_node(nid, wi_kobj); if (err) { pr_err("failed to add sysfs [node%d]\n", nid); - break; + goto err_cleanup_kobj; } } - if (err) - kobject_put(wi_kobj); + return 0; + +err_cleanup_kobj: + wi_cleanup(wi_kobj); + kobject_del(wi_kobj); +err_put_kobj: + kobject_put(wi_kobj); + return err; } -static void mempolicy_kobj_release(struct kobject *kobj) -{ - u8 *old; - - mutex_lock(&iw_table_lock); - old = rcu_dereference_protected(iw_table, - lockdep_is_held(&iw_table_lock)); - rcu_assign_pointer(iw_table, NULL); - mutex_unlock(&iw_table_lock); - synchronize_rcu(); - kfree(old); - kfree(node_attrs); - kfree(kobj); -} - -static const struct kobj_type mempolicy_ktype = { - .release = mempolicy_kobj_release -}; - static int __init mempolicy_sysfs_init(void) { int err; static struct kobject *mempolicy_kobj; - mempolicy_kobj = kzalloc(sizeof(*mempolicy_kobj), GFP_KERNEL); - if (!mempolicy_kobj) { - err = -ENOMEM; - goto err_out; - } - - node_attrs = kcalloc(nr_node_ids, sizeof(struct iw_node_attr *), - GFP_KERNEL); - if (!node_attrs) { - err = -ENOMEM; - goto mempol_out; - } - - err = kobject_init_and_add(mempolicy_kobj, &mempolicy_ktype, mm_kobj, - "mempolicy"); - if (err) - goto node_out; + mempolicy_kobj = kobject_create_and_add("mempolicy", mm_kobj); + if (!mempolicy_kobj) + return -ENOMEM; err = add_weighted_interleave_group(mempolicy_kobj); - if (err) { - pr_err("mempolicy sysfs structure failed to initialize\n"); - kobject_put(mempolicy_kobj); - return err; - } + if (err) + goto err_kobj; - return err; -node_out: - kfree(node_attrs); -mempol_out: - kfree(mempolicy_kobj); -err_out: - pr_err("failed to add mempolicy kobject to the system\n"); + return 0; + +err_kobj: + kobject_del(mempolicy_kobj); + kobject_put(mempolicy_kobj); return err; } From cf8cecf2bc221cea99fadc2dc62cef2d4c970dff Mon Sep 17 00:00:00 2001 From: Rakie Kim Date: Thu, 17 Apr 2025 16:28:36 +0900 Subject: [PATCH 137/285] mm/mempolicy: prepare weighted interleave sysfs for memory hotplug Previously, the weighted interleave sysfs structure was statically managed during initialization. This prevented new nodes from being recognized when memory hotplug events occurred, limiting the ability to update or extend sysfs entries dynamically at runtime. To address this, this patch refactors the sysfs infrastructure and encapsulates it within a new structure, `sysfs_wi_group`, which holds both the kobject and an array of node attribute pointers. By allocating this group structure globally, the per-node sysfs attributes can be managed beyond initialization time, enabling external modules to insert or remove node entries in response to events such as memory hotplug or node online/offline transitions. Instead of allocating all per-node sysfs attributes at once, the initialization path now uses the existing sysfs_wi_node_add() and sysfs_wi_node_delete() helpers. This refactoring makes it possible to modularly manage per-node sysfs entries and ensures the infrastructure is ready for runtime extension. Link: https://lkml.kernel.org/r/20250417072839.711-3-rakie.kim@sk.com Signed-off-by: Rakie Kim Reviewed-by: Gregory Price Reviewed-by: Joshua Hahn Reviewed-by: Dan Williams Cc: David Hildenbrand Cc: Honggyu Kim Cc: "Huang, Ying" Cc: Jonathan Cameron Cc: Oscar Salvador Cc: Yunjeong Mun Cc: Dan Carpenter Signed-off-by: Andrew Morton --- mm/mempolicy.c | 64 ++++++++++++++++++++++++-------------------------- 1 file changed, 31 insertions(+), 33 deletions(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index b8c7cce61103..c99bae7d8219 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -3427,6 +3427,13 @@ struct iw_node_attr { int nid; }; +struct sysfs_wi_group { + struct kobject wi_kobj; + struct iw_node_attr *nattrs[]; +}; + +static struct sysfs_wi_group *wi_group; + static ssize_t node_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { @@ -3469,24 +3476,23 @@ static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr, return count; } -static struct iw_node_attr **node_attrs; - -static void sysfs_wi_node_delete(struct iw_node_attr *node_attr, - struct kobject *parent) +static void sysfs_wi_node_delete(int nid) { - if (!node_attr) + if (!wi_group->nattrs[nid]) return; - sysfs_remove_file(parent, &node_attr->kobj_attr.attr); - kfree(node_attr->kobj_attr.attr.name); - kfree(node_attr); + + sysfs_remove_file(&wi_group->wi_kobj, + &wi_group->nattrs[nid]->kobj_attr.attr); + kfree(wi_group->nattrs[nid]->kobj_attr.attr.name); + kfree(wi_group->nattrs[nid]); } -static void sysfs_wi_node_delete_all(struct kobject *wi_kobj) +static void sysfs_wi_node_delete_all(void) { int nid; for (nid = 0; nid < nr_node_ids; nid++) - sysfs_wi_node_delete(node_attrs[nid], wi_kobj); + sysfs_wi_node_delete(nid); } static void iw_table_free(void) @@ -3503,15 +3509,14 @@ static void iw_table_free(void) kfree(old); } -static void wi_cleanup(struct kobject *wi_kobj) { - sysfs_wi_node_delete_all(wi_kobj); +static void wi_cleanup(void) { + sysfs_wi_node_delete_all(); iw_table_free(); - kfree(node_attrs); } static void wi_kobj_release(struct kobject *wi_kobj) { - kfree(wi_kobj); + kfree(wi_group); } static const struct kobj_type wi_ktype = { @@ -3519,7 +3524,7 @@ static const struct kobj_type wi_ktype = { .release = wi_kobj_release, }; -static int add_weight_node(int nid, struct kobject *wi_kobj) +static int sysfs_wi_node_add(int nid) { struct iw_node_attr *node_attr; char *name; @@ -3541,40 +3546,33 @@ static int add_weight_node(int nid, struct kobject *wi_kobj) node_attr->kobj_attr.store = node_store; node_attr->nid = nid; - if (sysfs_create_file(wi_kobj, &node_attr->kobj_attr.attr)) { + if (sysfs_create_file(&wi_group->wi_kobj, &node_attr->kobj_attr.attr)) { kfree(node_attr->kobj_attr.attr.name); kfree(node_attr); pr_err("failed to add attribute to weighted_interleave\n"); return -ENOMEM; } - node_attrs[nid] = node_attr; + wi_group->nattrs[nid] = node_attr; return 0; } -static int add_weighted_interleave_group(struct kobject *root_kobj) +static int __init add_weighted_interleave_group(struct kobject *mempolicy_kobj) { - struct kobject *wi_kobj; int nid, err; - node_attrs = kcalloc(nr_node_ids, sizeof(struct iw_node_attr *), - GFP_KERNEL); - if (!node_attrs) + wi_group = kzalloc(struct_size(wi_group, nattrs, nr_node_ids), + GFP_KERNEL); + if (!wi_group) return -ENOMEM; - wi_kobj = kzalloc(sizeof(struct kobject), GFP_KERNEL); - if (!wi_kobj) { - kfree(node_attrs); - return -ENOMEM; - } - - err = kobject_init_and_add(wi_kobj, &wi_ktype, root_kobj, + err = kobject_init_and_add(&wi_group->wi_kobj, &wi_ktype, mempolicy_kobj, "weighted_interleave"); if (err) goto err_put_kobj; for_each_node_state(nid, N_POSSIBLE) { - err = add_weight_node(nid, wi_kobj); + err = sysfs_wi_node_add(nid); if (err) { pr_err("failed to add sysfs [node%d]\n", nid); goto err_cleanup_kobj; @@ -3584,10 +3582,10 @@ static int add_weighted_interleave_group(struct kobject *root_kobj) return 0; err_cleanup_kobj: - wi_cleanup(wi_kobj); - kobject_del(wi_kobj); + wi_cleanup(); + kobject_del(&wi_group->wi_kobj); err_put_kobj: - kobject_put(wi_kobj); + kobject_put(&wi_group->wi_kobj); return err; } From dec92bf95f5a710287b588cb203ff27abcd24706 Mon Sep 17 00:00:00 2001 From: Rakie Kim Date: Thu, 17 Apr 2025 16:28:37 +0900 Subject: [PATCH 138/285] mm/mempolicy: support memory hotplug in weighted interleave The weighted interleave policy distributes page allocations across multiple NUMA nodes based on their performance weight, thereby improving memory bandwidth utilization. The weight values for each node are configured through sysfs. Previously, sysfs entries for configuring weighted interleave were created for all possible nodes (N_POSSIBLE) at initialization, including nodes that might not have memory. However, not all nodes in N_POSSIBLE are usable at runtime, as some may remain memoryless or offline. This led to sysfs entries being created for unusable nodes, causing potential misconfiguration issues. To address this issue, this patch modifies the sysfs creation logic to: 1) Limit sysfs entries to nodes that are online and have memory, avoiding the creation of sysfs entries for nodes that cannot be used. 2) Support memory hotplug by dynamically adding and removing sysfs entries based on whether a node transitions into or out of the N_MEMORY state. Additionally, the patch ensures that sysfs attributes are properly managed when nodes go offline, preventing stale or redundant entries from persisting in the system. By making these changes, the weighted interleave policy now manages its sysfs entries more efficiently, ensuring that only relevant nodes are considered for interleaving, and dynamically adapting to memory hotplug events. [dan.carpenter@linaro.org: fix error code in sysfs_wi_node_add()] Link: https://lkml.kernel.org/r/aBjL7Bwc0QBzgajK@stanley.mountain Link: https://lkml.kernel.org/r/20250417072839.711-4-rakie.kim@sk.com Co-developed-by: Honggyu Kim Signed-off-by: Honggyu Kim Co-developed-by: Yunjeong Mun Signed-off-by: Yunjeong Mun Signed-off-by: Rakie Kim Signed-off-by: Dan Carpenter Reviewed-by: Oscar Salvador Reviewed-by: Joshua Hahn Reviewed-by: Gregory Price Reviewed-by: Dan Williams Acked-by: David Hildenbrand Cc: "Huang, Ying" Cc: Jonathan Cameron Cc: Dan Carpenter Signed-off-by: Andrew Morton --- mm/mempolicy.c | 107 ++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 84 insertions(+), 23 deletions(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index c99bae7d8219..9a2b4b36f558 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -113,6 +113,7 @@ #include #include #include +#include #include "internal.h" @@ -3429,6 +3430,7 @@ struct iw_node_attr { struct sysfs_wi_group { struct kobject wi_kobj; + struct mutex kobj_lock; struct iw_node_attr *nattrs[]; }; @@ -3478,13 +3480,24 @@ static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr, static void sysfs_wi_node_delete(int nid) { - if (!wi_group->nattrs[nid]) + struct iw_node_attr *attr; + + if (nid < 0 || nid >= nr_node_ids) return; - sysfs_remove_file(&wi_group->wi_kobj, - &wi_group->nattrs[nid]->kobj_attr.attr); - kfree(wi_group->nattrs[nid]->kobj_attr.attr.name); - kfree(wi_group->nattrs[nid]); + mutex_lock(&wi_group->kobj_lock); + attr = wi_group->nattrs[nid]; + if (!attr) { + mutex_unlock(&wi_group->kobj_lock); + return; + } + + wi_group->nattrs[nid] = NULL; + mutex_unlock(&wi_group->kobj_lock); + + sysfs_remove_file(&wi_group->wi_kobj, &attr->kobj_attr.attr); + kfree(attr->kobj_attr.attr.name); + kfree(attr); } static void sysfs_wi_node_delete_all(void) @@ -3526,35 +3539,77 @@ static const struct kobj_type wi_ktype = { static int sysfs_wi_node_add(int nid) { - struct iw_node_attr *node_attr; + int ret; char *name; + struct iw_node_attr *new_attr; - node_attr = kzalloc(sizeof(*node_attr), GFP_KERNEL); - if (!node_attr) + if (nid < 0 || nid >= nr_node_ids) { + pr_err("invalid node id: %d\n", nid); + return -EINVAL; + } + + new_attr = kzalloc(sizeof(*new_attr), GFP_KERNEL); + if (!new_attr) return -ENOMEM; name = kasprintf(GFP_KERNEL, "node%d", nid); if (!name) { - kfree(node_attr); + kfree(new_attr); return -ENOMEM; } - sysfs_attr_init(&node_attr->kobj_attr.attr); - node_attr->kobj_attr.attr.name = name; - node_attr->kobj_attr.attr.mode = 0644; - node_attr->kobj_attr.show = node_show; - node_attr->kobj_attr.store = node_store; - node_attr->nid = nid; + sysfs_attr_init(&new_attr->kobj_attr.attr); + new_attr->kobj_attr.attr.name = name; + new_attr->kobj_attr.attr.mode = 0644; + new_attr->kobj_attr.show = node_show; + new_attr->kobj_attr.store = node_store; + new_attr->nid = nid; - if (sysfs_create_file(&wi_group->wi_kobj, &node_attr->kobj_attr.attr)) { - kfree(node_attr->kobj_attr.attr.name); - kfree(node_attr); - pr_err("failed to add attribute to weighted_interleave\n"); - return -ENOMEM; + mutex_lock(&wi_group->kobj_lock); + if (wi_group->nattrs[nid]) { + mutex_unlock(&wi_group->kobj_lock); + ret = -EEXIST; + goto out; } - wi_group->nattrs[nid] = node_attr; + ret = sysfs_create_file(&wi_group->wi_kobj, &new_attr->kobj_attr.attr); + if (ret) { + mutex_unlock(&wi_group->kobj_lock); + goto out; + } + wi_group->nattrs[nid] = new_attr; + mutex_unlock(&wi_group->kobj_lock); return 0; + +out: + kfree(new_attr->kobj_attr.attr.name); + kfree(new_attr); + return ret; +} + +static int wi_node_notifier(struct notifier_block *nb, + unsigned long action, void *data) +{ + int err; + struct memory_notify *arg = data; + int nid = arg->status_change_nid; + + if (nid < 0) + return NOTIFY_OK; + + switch (action) { + case MEM_ONLINE: + err = sysfs_wi_node_add(nid); + if (err) + pr_err("failed to add sysfs for node%d during hotplug: %d\n", + nid, err); + break; + case MEM_OFFLINE: + sysfs_wi_node_delete(nid); + break; + } + + return NOTIFY_OK; } static int __init add_weighted_interleave_group(struct kobject *mempolicy_kobj) @@ -3565,20 +3620,26 @@ static int __init add_weighted_interleave_group(struct kobject *mempolicy_kobj) GFP_KERNEL); if (!wi_group) return -ENOMEM; + mutex_init(&wi_group->kobj_lock); err = kobject_init_and_add(&wi_group->wi_kobj, &wi_ktype, mempolicy_kobj, "weighted_interleave"); if (err) goto err_put_kobj; - for_each_node_state(nid, N_POSSIBLE) { + for_each_online_node(nid) { + if (!node_state(nid, N_MEMORY)) + continue; + err = sysfs_wi_node_add(nid); if (err) { - pr_err("failed to add sysfs [node%d]\n", nid); + pr_err("failed to add sysfs for node%d during init: %d\n", + nid, err); goto err_cleanup_kobj; } } + hotplug_memory_notifier(wi_node_notifier, DEFAULT_CALLBACK_PRI); return 0; err_cleanup_kobj: From 0e1c773b501f33437d87b72c7d26080361e224b1 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 20 Apr 2025 12:40:24 -0700 Subject: [PATCH 139/285] mm/damon/core: introduce damos quota goal metrics for memory node utilization Patch series "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory". Utilizing DAMON for memory tiering usually requires manual tuning and/or tedious controls. Let it self-tune hotness and coldness thresholds for promotion and demotion aiming high utilization of high memory tiers, by introducing new DAMOS quota goal metrics representing the used and the free memory ratios of specific NUMA nodes. And introduce a sample DAMON module that demonstrates how the new feature can be used for memory tiering use cases. Backgrounds =========== A type of tiered memory system exposes the memory tiers as NUMA nodes. A straightforward pages placement strategy for such systems is placing access-hot and cold pages on upper and lower tiers, reespectively, pursuing higher utilization of upper tiers. Since access temperature can be dynamic, periodically finding and migrating hot pages and cold pages to proper tiers (promoting and demoting) is also required. Linux kernel provides several features for such dynamic and transparent pages placement. Page Faults and LRU ------------------- One widely known way is using NUMA balancing in tiering mode (a.k.a NUMAB-2) and reclaim-based demotion features. In the setup, NUMAB-2 finds hot pages using access check-purpose page faults (a.k.a prot_none) and promote those inside each process' context, until there is no more pages to promote, or the upper tier is filled up and memory pressure happens. In the latter case, LRU-based reclaim logic wakes up as a response to the memory pressure and demotes cold pages to lower tiers in asynchronous (kswapd) and/or synchronous ways (direct reclaim). DAMON ----- Yet another available solution is using DAMOS with migrate_hot and migrate_cold DAMOS actions for promotions and demotions, respectively. To make it optimum, users need to specify aggressiveness and access temperature thresholds for promotions and demotions in a good balance that results in high utilization of upper tiers. The number of parameters is not small, and optimum parameter values depend on characteristics of the underlying hardware and the workload. As a result, it often requires manual, time consuming and repetitive tuning of the DAMOS schemes for given workloads and systems combinations. Self-tuned DAMON-based Memory Tiering ===================================== To solve such manual tuning problems, DAMOS provides aim-oriented feedback-driven quotas self-tuning. Using the feature, we design a self-tuned DAMON-based memory tiering for general multi-tier memory systems. For each memory tier node, if it has a lower tier, run a DAMOS scheme that demotes cold pages of the node, auto-tuning the aggressiveness aiming an amount of free space of the node. The free space is for keeping the headroom that avoids significant memory pressure during upper tier memory usage spike, and promoting hot pages from the lower tier. For each memory tier node, if it has an upper tier, run a DAMOS scheme that promotes hot pages of the current node to the upper tier, auto-tuning the aggressiveness aiming a high utilization ratio of the upper tier. The target ratio is to ensure higher tiers are utilized as much as possible. It should match with the headroom for demotion scheme, but have slight overlap, to ensure promotion and demotion are not entirely stopped. The aim-oriented aggressiveness auto-tuning of DAMOS is already available. Hence, to make such tiering solution implementation, only new quota goal metrics for utilization and free space ratio of specific NUMA node need to be developed. Discussions =========== The design imposes below discussion points. Expected Behaviors ------------------ The system will let upper tier memory node accommodates as many hot data as possible. If total amount of the data is less than the top tier memory's promotion/demotion target utilization, entire data will be just placed on the top tier. Promotion scheme will do nothing since there is no data to promote. Demotion scheme will also do nothing since the free space ratio of the top tier is higher than the goal. Only if the amount of data is larger than the top tier's utilization ratio, demotion scheme will demote cold pages and ensure the headroom free space. Since the promotion and demotion schemes for a single node has small overlap at their target utilization and free space goals, promotions and demotions will continue working with a moderate aggressiveness level. It will keep all data is placed on access hotness under dynamic access pattern, while minimizing the migration overhead. In any case, each node will keep headroom free space and as many upper tiers are utilized as possible. Ease of Use ----------- Users still need to set the target utilization and free space ratio, but it will be easier to set. We argue 99.7 % utilization and 0.5 % free space ratios can be good default values. It can be easily adjusted based on desired headroom size of given use case. Users are also still required to answer the minimum coldness and hotness thresholds. Together with monitoring intervals auto-tuning[2], DAMON will always show meaningful amount of hot and cold memory. And DAMOS quota's prioritization mechanism will make good decision as long as the source information is that colorful. Hence, users can very naively set the minimum criterias. We believe any access observation and no access observation within last one aggregation interval is enough for minimum hot and cold regions criterias. General Tiered Memory Setup Applicability ----------------------------------------- The design can be applied to any number of tiers having any performance characteristics, as long as they can be hierarchical. Hence, applying the system to different tiered memory system will be straightforward. Note that this assumes only single CPU NUMA node case. Because today's DAMON is not aware of which CPU made each access, applying this on systems having multiple CPU NUMA nodes can be complicated. We are planning to extend DAMON for the use case, but that's out of the scope of this patch series. How To Use ---------- Users can implement the auto-tuned DAMON-based memory tiering using DAMON sysfs interface. It can be easily done using DAMON user-space tool like user-space tool. Below evaluation results section shows an example DAMON user-space tool command for that. For wider and simpler deployment, having a kernel module that sets up and run the DAMOS schemes via DAMON kernel API can be useful. The module can enable the memory tiering at boot time via kernel command line parameter or at run time with single command. This patch series implements a sample DAMON kernel module that shows how such module can be implemented. Comparison To Page Faults and LRU-based Approaches -------------------------------------------------- The existing page faults based promotion (NUMAB-2) does hot pages detection and migration in the process context. When there are many pages to promote, it can block the progress of the application's real works. DAMOS works in asynchronous worker thread, so it doesn't block the real works. NUMAB-2 doesn't provide a way to control aggressiveness of promotion other than the maximum amount of pages to promote per given time widnow. If hot pages are found, promotions can happen in the upper-bound speed, regardless of upper tier's memory pressure. If the maximum speed is not well set for the given workload, it can result in slow promotion or unnecessary memory pressure. Self-tuned DAMON-based memory tiering alleviates the problem by adjusting the speed based on current utilization of the upper tier. LRU-based demotion can be triggered in both asynchronous (kswapd) and synchronous (direct reclaim) ways. Other than the way of finding cold pages, asynchronous LRU-based demotion and DAMON-based demotion has no big difference. DAMON-based demotion can make a better balancing with DAMON-based promotion, though. The LRU-based demotion can do better than DAMON-based demotion when the tier is having significant memory pressure. It would be wise to use DAMON-based demotion as a proactive and primary one, but utilizing LRU-based demotions together as a fast backup solution. Evaluation ========== In short, under a setup that requires fast and frequent promotions, self-tuned DAMON-based memory tiering's hot pages promotion improves performance about 4.42 %. We believe this shows self-tuned DAMON-based promotion's effectiveness. Meanwhile, NUMAB-2's hot pages promotion degrades the performance about 7.34 %. We suspect the degradation is mostly due to NUMAB-2's synchronous nature that can block the application's progress, which highlights the advantage of DAMON-based solution's asynchronous nature. Note that the test was done with the RFC version of this patch series. We don't run it again since this patch series got no meaningful change after the RFC, while the test takes pretty long time. Setup ----- Hardware. Use a machine that equips 250 GiB DRAM memory tier and 50 GiB CXL memory tier. The tiers are exposed as NUMA nodes 0 and 1, respectively. Kernel. Use Linux kernel v6.13 that modified as following. Add all DAMON patches that available on mm tree of 2025-03-15, and this patch series. Also modify it to ignore mempolicy() system calls, to avoid bad effects from application's traditional NUMA systems assumed optimizations. Workload. Use a modified version of Taobench benchmark[3] that available on DCPerf benchmark suite. It represents an in-memory caching workload. We set its 'memsize', 'warmup_time', and 'test_time' parameter as 340 GiB, 2,500 seconds and 1,440 seconds. The parameters are chosen to ensure the workload uses more than DRAM memory tier. Its RSS under the parameter grows to 270 GiB within the warmup time. It turned out the workload has a very static access pattrn. Only about 13 % of the RSS is frequently accessed from the beginning to end. Hence promotion shows no meaningful performance difference regardless of different design and implementations. We therefore modify the kernel to periodically demote up to 10 GiB hot pages and promote up to 10 GiB cold pages once per minute. The intention is to simulate periodic access pattern changes. The hotness and coldness threshold is very naively set so that it is more like random access pattern change rather than strict hot/cold pages exchange. This is why we call the workload as "modified". It is implemented as two DAMOS schemes each running on an asynchronous thread. It can be reproduced with DAMON user-space tool like below. # ./damo start \ --ops paddr --numa_node 0 --monitoring_intervals 10s 200s 200s \ --damos_action migrate_hot 1 \ --damos_quota_interval 60s --damos_quota_space 10G \ --ops paddr --numa_node 1 --monitoring_intervals 10s 200s 200s \ --damos_action migrate_cold 0 \ --damos_quota_interval 60s --damos_quota_space 10G \ --nr_schemes 1 1 --nr_targets 1 1 --nr_ctxs 1 1 System configurations. Use below variant system configurations. - Baseline. No memory tiering features are turned on. - Numab_tiering. On the baseline, enable NUMAB-2 and relcaim-based demotion. In detail, following command is executed: echo 2 > /proc/sys/kernel/numa_balancing; echo 1 > /sys/kernel/mm/numa/demotion_enabled; echo 7 > /proc/sys/vm/zone_reclaim_mode - DAMON_tiering. On the baseline, utilize self-tuned DAMON-based memory tiering implementation via DAMON user-space tool. It utilizes two kernel threads, namely promotion thread and demotion thread. Demotion thread monitors access pattern of DRAM node using DAMON with auto-tuned monitoring intervals aiming 4% DAMON-observed access ratio, and demote coldest pages up to 200 MiB per second aiming 0.5% free space of DRAM node. Promotion thread monitors CXL node using same intervals auto-tuning, and promote hot pages in same way but aiming for 99.7% utilization of DRAM node. Because DAMON provides only best-effort accuracy, add young page DAMOS filters to allow only and reject all young pages at promoting and demoting, respectively. It can be reproduced with DAMON user-space tool like below. # ./damo start \ --numa_node 0 --monitoring_intervals_goal 4% 3 5ms 10s \ --damos_action migrate_cold 1 --damos_access_rate 0% 0% \ --damos_apply_interval 1s \ --damos_quota_interval 1s --damos_quota_space 200MB \ --damos_quota_goal node_mem_free_bp 0.5% 0 \ --damos_filter reject young \ --numa_node 1 --monitoring_intervals_goal 4% 3 5ms 10s \ --damos_action migrate_hot 0 --damos_access_rate 5% max \ --damos_apply_interval 1s \ --damos_quota_interval 1s --damos_quota_space 200MB \ --damos_quota_goal node_mem_used_bp 99.7% 0 \ --damos_filter allow young \ --damos_nr_quota_goals 1 1 --damos_nr_filters 1 1 \ --nr_targets 1 1 --nr_schemes 1 1 --nr_ctxs 1 1 Measurment Results ------------------ On each system configuration, run the modified version of Taobench and collect 'score'. 'score' is a metric that calculated and provided by Taobench to represents the performance of the run on the system. To handle the measurement errors, repeat the measurement five times. The results are as below. Config Score Stdev (%) Normalized Baseline 1.6165 0.0319 1.9764 1.0000 Numab_tiering 1.4976 0.0452 3.0209 0.9264 DAMON_tiering 1.6881 0.0249 1.4767 1.0443 'Config' column shows the system config of the measurement. 'Score' column shows the 'score' measurement in average of the five runs on the system config. 'Stdev' column shows the standsard deviation of the five measurements of the scores. '(%)' column shows the 'Stdev' to 'Score' ratio in percentage. Finally, 'Normalized' column shows the averaged score values of the configs that normalized to that of 'Baseline'. The periodic hot pages demotion and cold pages promotion that was conducted to simulate dynamic access pattern was started from the beginning of the workload. It resulted in the DRAM tier utilization always under the watermark, and hence no real demotion was happened for all test runs. This means the above results show no difference between LRU-based and DAMON-based demotions. Only difference between NUMAB-2 and DAMON-based promotions are represented on the results. Numab_tiering config degraded the performance about 7.36 %. We suspect this happened because NUMAB-2's synchronous promotion was blocking the Taobench's real work progress. DAMON_tiering config improved the performance about 4.43 %. We believe this shows effectiveness of DAMON-based promotion that didn't block Taobench's real work progress due to its asynchronous nature. Also this means DAMON's monitoring results are accurate enough to provide visible amount of improvement. Evaluation Limitations ---------------------- As mentioned above, this evaluation shows only comparison of promotion mechanisms. DAMON-based tiering is recommended to be used together with reclaim-based demotion as a faster backup under significant memory pressure, though. From some perspective, the modified version of Taobench may seems making the picture distorted too much. It would be better to evaluate with more realistic workload, or more finely tuned micro benchmarks. Patch Sequence ============== The first patch (patch 1) implements two new quota goal metrics on core layer and expose it to DAMON core kernel API. The second and third ones (patches 2 and 3) further link it to DAMON sysfs interface. Three following patches (patches 4-6) document the new feature and sysfs file on design, usage, and ABI documents. The final one (patch 7) implements a working version of a self-tuned DAMON-based memory tiering solution in an incomplete but easy to understand form as a kernel module under samples/damon/ directory. References ========== [1] https://lore.kernel.org/20231112195602.61525-1-sj@kernel.org/ [2] https://lore.kernel.org/20250303221726.484227-1-sj@kernel.org [3] https://github.com/facebookresearch/DCPerf/blob/main/packages/tao_bench/README.md This patch (of 7): Used and free space ratios for specific NUMA nodes can be useful inputs for NUMA-specific DAMOS schemes' aggressiveness self-tuning feedback loop. Implement DAMOS quota goal metrics for such self-tuned schemes. Link: https://lkml.kernel.org/r/20250420194030.75838-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250420194030.75838-2-sj@kernel.org Signed-off-by: SeongJae Park Cc: Yunjeong Mun Cc: Jonathan Corbet Signed-off-by: Andrew Morton --- include/linux/damon.h | 6 ++++++ mm/damon/core.c | 27 +++++++++++++++++++++++++++ mm/damon/sysfs-schemes.c | 2 ++ 3 files changed, 35 insertions(+) diff --git a/include/linux/damon.h b/include/linux/damon.h index 47e36e6ea203..a4011726cb3b 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -145,6 +145,8 @@ enum damos_action { * * @DAMOS_QUOTA_USER_INPUT: User-input value. * @DAMOS_QUOTA_SOME_MEM_PSI_US: System level some memory PSI in us. + * @DAMOS_QUOTA_NODE_MEM_USED_BP: MemUsed ratio of a node. + * @DAMOS_QUOTA_NODE_MEM_FREE_BP: MemFree ratio of a node. * @NR_DAMOS_QUOTA_GOAL_METRICS: Number of DAMOS quota goal metrics. * * Metrics equal to larger than @NR_DAMOS_QUOTA_GOAL_METRICS are unsupported. @@ -152,6 +154,8 @@ enum damos_action { enum damos_quota_goal_metric { DAMOS_QUOTA_USER_INPUT, DAMOS_QUOTA_SOME_MEM_PSI_US, + DAMOS_QUOTA_NODE_MEM_USED_BP, + DAMOS_QUOTA_NODE_MEM_FREE_BP, NR_DAMOS_QUOTA_GOAL_METRICS, }; @@ -161,6 +165,7 @@ enum damos_quota_goal_metric { * @target_value: Target value of @metric to achieve with the tuning. * @current_value: Current value of @metric. * @last_psi_total: Last measured total PSI + * @nid: Node id. * @list: List head for siblings. * * Data structure for getting the current score of the quota tuning goal. The @@ -179,6 +184,7 @@ struct damos_quota_goal { /* metric-dependent fields */ union { u64 last_psi_total; + int nid; }; struct list_head list; }; diff --git a/mm/damon/core.c b/mm/damon/core.c index f0c1676f0599..587fb9a4fef8 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -1889,6 +1889,29 @@ static inline u64 damos_get_some_mem_psi_total(void) #endif /* CONFIG_PSI */ +#ifdef CONFIG_NUMA +static __kernel_ulong_t damos_get_node_mem_bp( + struct damos_quota_goal *goal) +{ + struct sysinfo i; + __kernel_ulong_t numerator; + + si_meminfo_node(&i, goal->nid); + if (goal->metric == DAMOS_QUOTA_NODE_MEM_USED_BP) + numerator = i.totalram - i.freeram; + else /* DAMOS_QUOTA_NODE_MEM_FREE_BP */ + numerator = i.freeram; + return numerator * 10000 / i.totalram; +} +#else +static __kernel_ulong_t damos_get_node_mem_bp( + struct damos_quota_goal *goal) +{ + return 0; +} +#endif + + static void damos_set_quota_goal_current_value(struct damos_quota_goal *goal) { u64 now_psi_total; @@ -1902,6 +1925,10 @@ static void damos_set_quota_goal_current_value(struct damos_quota_goal *goal) goal->current_value = now_psi_total - goal->last_psi_total; goal->last_psi_total = now_psi_total; break; + case DAMOS_QUOTA_NODE_MEM_USED_BP: + case DAMOS_QUOTA_NODE_MEM_FREE_BP: + goal->current_value = damos_get_node_mem_bp(goal); + break; default: break; } diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c index 23b562df0839..98108f082178 100644 --- a/mm/damon/sysfs-schemes.c +++ b/mm/damon/sysfs-schemes.c @@ -942,6 +942,8 @@ struct damos_sysfs_quota_goal { static const char * const damos_sysfs_quota_goal_metric_strs[] = { "user_input", "some_mem_psi_us", + "node_mem_used_bp", + "node_mem_free_bp", }; static struct damos_sysfs_quota_goal *damos_sysfs_quota_goal_alloc(void) From 0fbd59379d8fc8b04f3f7c81fd44fe7c0d33b077 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 20 Apr 2025 12:40:25 -0700 Subject: [PATCH 140/285] mm/damon/sysfs-schemes: implement file for quota goal nid parameter DAMOS_QUOTA_NODE_MEM_{USED,FREE}_BP DAMOS quota goal metrics require the node id parameter. However, there is no DAMON user ABI for setting it. Implement a DAMON sysfs file for that with name 'nid', under the quota goal directory. Link: https://lkml.kernel.org/r/20250420194030.75838-3-sj@kernel.org Signed-off-by: SeongJae Park Cc: Jonathan Corbet Cc: Yunjeong Mun Signed-off-by: Andrew Morton --- mm/damon/sysfs-schemes.c | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c index 98108f082178..7681ed293b62 100644 --- a/mm/damon/sysfs-schemes.c +++ b/mm/damon/sysfs-schemes.c @@ -936,6 +936,7 @@ struct damos_sysfs_quota_goal { enum damos_quota_goal_metric metric; unsigned long target_value; unsigned long current_value; + int nid; }; /* This should match with enum damos_action */ @@ -1016,6 +1017,28 @@ static ssize_t current_value_store(struct kobject *kobj, return err ? err : count; } +static ssize_t nid_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + struct damos_sysfs_quota_goal *goal = container_of(kobj, struct + damos_sysfs_quota_goal, kobj); + + /* todo: return error if the goal is not using nid */ + + return sysfs_emit(buf, "%d\n", goal->nid); +} + +static ssize_t nid_store(struct kobject *kobj, + struct kobj_attribute *attr, const char *buf, size_t count) +{ + struct damos_sysfs_quota_goal *goal = container_of(kobj, struct + damos_sysfs_quota_goal, kobj); + int err = kstrtoint(buf, 0, &goal->nid); + + /* feed callback should check existence of this file and read value */ + return err ? err : count; +} + static void damos_sysfs_quota_goal_release(struct kobject *kobj) { /* or, notify this release to the feed callback */ @@ -1031,10 +1054,14 @@ static struct kobj_attribute damos_sysfs_quota_goal_target_value_attr = static struct kobj_attribute damos_sysfs_quota_goal_current_value_attr = __ATTR_RW_MODE(current_value, 0600); +static struct kobj_attribute damos_sysfs_quota_goal_nid_attr = + __ATTR_RW_MODE(nid, 0600); + static struct attribute *damos_sysfs_quota_goal_attrs[] = { &damos_sysfs_quota_goal_target_metric_attr.attr, &damos_sysfs_quota_goal_target_value_attr.attr, &damos_sysfs_quota_goal_current_value_attr.attr, + &damos_sysfs_quota_goal_nid_attr.attr, NULL, }; ATTRIBUTE_GROUPS(damos_sysfs_quota_goal); From 85fcf0ffc4606a52192800bc9f2a5014a096b96f Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 20 Apr 2025 12:40:26 -0700 Subject: [PATCH 141/285] mm/damon/sysfs-schemes: connect damos_quota_goal nid with core layer DAMON sysfs interface file for DAMOS quota goal's node id argument is not passed to core layer. Implement the link. Link: https://lkml.kernel.org/r/20250420194030.75838-4-sj@kernel.org Signed-off-by: SeongJae Park Cc: Jonathan Corbet Cc: Yunjeong Mun Signed-off-by: Andrew Morton --- mm/damon/sysfs-schemes.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c index 7681ed293b62..729fe5f1ef30 100644 --- a/mm/damon/sysfs-schemes.c +++ b/mm/damon/sysfs-schemes.c @@ -2149,8 +2149,17 @@ static int damos_sysfs_add_quota_score( sysfs_goal->target_value); if (!goal) return -ENOMEM; - if (sysfs_goal->metric == DAMOS_QUOTA_USER_INPUT) + switch (sysfs_goal->metric) { + case DAMOS_QUOTA_USER_INPUT: goal->current_value = sysfs_goal->current_value; + break; + case DAMOS_QUOTA_NODE_MEM_USED_BP: + case DAMOS_QUOTA_NODE_MEM_FREE_BP: + goal->nid = sysfs_goal->nid; + break; + default: + break; + } damos_add_quota_goal(quota, goal); } return 0; From b3b95a3594530643972220a70d738db96544be45 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 20 Apr 2025 12:40:27 -0700 Subject: [PATCH 142/285] Docs/mm/damon/design: document node_mem_{used,free}_bp Add description of DAMOS quota goal metrics for NUMA node utilization on the DAMON deesign document. Link: https://lkml.kernel.org/r/20250420194030.75838-5-sj@kernel.org Signed-off-by: SeongJae Park Cc: Jonathan Corbet Cc: Yunjeong Mun Signed-off-by: Andrew Morton --- Documentation/mm/damon/design.rst | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst index f12d33749329..728bf5754343 100644 --- a/Documentation/mm/damon/design.rst +++ b/Documentation/mm/damon/design.rst @@ -550,10 +550,10 @@ aggressiveness (the quota) of the corresponding scheme. For example, if DAMOS is under achieving the goal, DAMOS automatically increases the quota. If DAMOS is over achieving the goal, it decreases the quota. -The goal can be specified with three parameters, namely ``target_metric``, -``target_value``, and ``current_value``. The auto-tuning mechanism tries to -make ``current_value`` of ``target_metric`` be same to ``target_value``. -Currently, two ``target_metric`` are provided. +The goal can be specified with four parameters, namely ``target_metric``, +``target_value``, ``current_value`` and ``nid``. The auto-tuning mechanism +tries to make ``current_value`` of ``target_metric`` be same to +``target_value``. - ``user_input``: User-provided value. Users could use any metric that they has interest in for the value. Use space main workload's latency or @@ -565,6 +565,11 @@ Currently, two ``target_metric`` are provided. in microseconds that measured from last quota reset to next quota reset. DAMOS does the measurement on its own, so only ``target_value`` need to be set by users at the initial time. In other words, DAMOS does self-feedback. +- ``node_mem_used_bp``: Specific NUMA node's used memory ratio in bp (1/10,000). +- ``node_mem_free_bp``: Specific NUMA node's free memory ratio in bp (1/10,000). + +``nid`` is optionally required for only ``node_mem_used_bp`` and +``node_mem_free_bp`` to point the specific NUMA node. To know how user-space can set the tuning goal metric, the target value, and/or the current value via :ref:`DAMON sysfs interface `, refer to From a7bb1e754559dc1de14c1927c39c5295ab363e01 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 20 Apr 2025 12:40:28 -0700 Subject: [PATCH 143/285] Docs/admin-guide/mm/damon/usage: document 'nid' file Add description of 'nid' file, which is optionally used for specific DAMOS quota goal metrics such as node_mem_{used,free}_bp on DAMON usage document. Link: https://lkml.kernel.org/r/20250420194030.75838-6-sj@kernel.org Signed-off-by: SeongJae Park Cc: Jonathan Corbet Cc: Yunjeong Mun Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/damon/usage.rst | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index ced2013db3df..d960aba72b82 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -81,7 +81,7 @@ comma (","). │ │ │ │ │ │ │ :ref:`quotas `/ms,bytes,reset_interval_ms,effective_bytes │ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil │ │ │ │ │ │ │ │ :ref:`goals `/nr_goals - │ │ │ │ │ │ │ │ │ 0/target_metric,target_value,current_value + │ │ │ │ │ │ │ │ │ 0/target_metric,target_value,current_value,nid │ │ │ │ │ │ │ :ref:`watermarks `/metric,interval_us,high,mid,low │ │ │ │ │ │ │ :ref:`{core_,ops_,}filters `/nr_filters │ │ │ │ │ │ │ │ 0/type,matching,allow,memcg_path,addr_start,addr_end,target_idx,min,max @@ -390,11 +390,11 @@ number (``N``) to the file creates the number of child directories named ``0`` to ``N-1``. Each directory represents each goal and current achievement. Among the multiple feedback, the best one is used. -Each goal directory contains three files, namely ``target_metric``, -``target_value`` and ``current_value``. Users can set and get the three -parameters for the quota auto-tuning goals that specified on the :ref:`design -doc ` by writing to and reading from each -of the files. Note that users should further write +Each goal directory contains four files, namely ``target_metric``, +``target_value``, ``current_value`` and ``nid``. Users can set and get the +four parameters for the quota auto-tuning goals that specified on the +:ref:`design doc ` by writing to and +reading from each of the files. Note that users should further write ``commit_schemes_quota_goals`` to the ``state`` file of the :ref:`kdamond directory ` to pass the feedback to DAMON. From f77cb462261bfe55fb0ae33eb911e3cad7694880 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 20 Apr 2025 12:40:29 -0700 Subject: [PATCH 144/285] Docs/ABI/damon: document nid file Add a description of 'nid' file, which is optionally used for specific DAMOS quota goal metrics such as node_mem_{used,free}_bp on the DAMON sysfs ABI document. Link: https://lkml.kernel.org/r/20250420194030.75838-7-sj@kernel.org Signed-off-by: SeongJae Park Cc: Jonathan Corbet Cc: Yunjeong Mun Signed-off-by: Andrew Morton --- Documentation/ABI/testing/sysfs-kernel-mm-damon | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-damon b/Documentation/ABI/testing/sysfs-kernel-mm-damon index 293197f180ad..5697ab154c1f 100644 --- a/Documentation/ABI/testing/sysfs-kernel-mm-damon +++ b/Documentation/ABI/testing/sysfs-kernel-mm-damon @@ -283,6 +283,12 @@ Contact: SeongJae Park Description: Writing to and reading from this file sets and gets the current value of the goal metric. +What: /sys/kernel/mm/damon/admin/kdamonds//contexts//schemes//quotas/goals//nid +Date: Apr 2025 +Contact: SeongJae Park +Description: Writing to and reading from this file sets and gets the nid + parameter of the goal. + What: /sys/kernel/mm/damon/admin/kdamonds//contexts//schemes//quotas/weights/sz_permil Date: Mar 2022 Contact: SeongJae Park From 82a08bde3cf7bbbe90de57baa181bebf676582c7 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 20 Apr 2025 12:40:30 -0700 Subject: [PATCH 145/285] samples/damon: implement a DAMON module for memory tiering Implement a sample DAMON module that shows how self-tuned DAMON-based memory tiering can be written. It is a sample since the purpose is to give an idea about how it can be implemented and perform, rather than be used on general production setups. Especially, it supports only two tiers memory setup having only one CPU-attached NUMA node. [sj@kernel.org: fix wrong DAMON attrs setting] Link: https://lkml.kernel.org/r/20250510220932.47722-1-sj@kernel.org [sj@kernel.org: trigger build even if only mtier is enabled] Link: https://lkml.kernel.org/r/20250426184054.11437-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250420194030.75838-8-sj@kernel.org Signed-off-by: SeongJae Park Cc: Jonathan Corbet Cc: Yunjeong Mun Signed-off-by: Andrew Morton --- samples/Makefile | 1 + samples/damon/Kconfig | 13 +++ samples/damon/Makefile | 1 + samples/damon/mtier.c | 178 +++++++++++++++++++++++++++++++++++++++++ 4 files changed, 193 insertions(+) create mode 100644 samples/damon/mtier.c diff --git a/samples/Makefile b/samples/Makefile index bf6e6fca5410..0545e6a0e84d 100644 --- a/samples/Makefile +++ b/samples/Makefile @@ -42,4 +42,5 @@ obj-$(CONFIG_SAMPLE_FPROBE) += fprobe/ obj-$(CONFIG_SAMPLES_RUST) += rust/ obj-$(CONFIG_SAMPLE_DAMON_WSSE) += damon/ obj-$(CONFIG_SAMPLE_DAMON_PRCL) += damon/ +obj-$(CONFIG_SAMPLE_DAMON_MTIER) += damon/ obj-$(CONFIG_SAMPLE_HUNG_TASK) += hung_task/ diff --git a/samples/damon/Kconfig b/samples/damon/Kconfig index 564c49ed69a2..cbf96fd8a8bf 100644 --- a/samples/damon/Kconfig +++ b/samples/damon/Kconfig @@ -27,4 +27,17 @@ config SAMPLE_DAMON_PRCL If unsure, say N. +config SAMPLE_DAMON_MTIER + bool "DAMON sample module for memory tiering" + depends on DAMON && DAMON_PADDR + help + Thps builds DAMON sample module for memory tierign. + + The module assumes the system is constructed with two NUMA nodes, + which seems as local and remote nodes to all CPUs. For example, + node0 is for DDR5 DRAMs connected via DIMM, while node1 is for DDR4 + DRAMs connected via CXL. + + If unsure, say N. + endmenu diff --git a/samples/damon/Makefile b/samples/damon/Makefile index 7f155143f237..72f68cbf422a 100644 --- a/samples/damon/Makefile +++ b/samples/damon/Makefile @@ -2,3 +2,4 @@ obj-$(CONFIG_SAMPLE_DAMON_WSSE) += wsse.o obj-$(CONFIG_SAMPLE_DAMON_PRCL) += prcl.o +obj-$(CONFIG_SAMPLE_DAMON_MTIER) += mtier.o diff --git a/samples/damon/mtier.c b/samples/damon/mtier.c new file mode 100644 index 000000000000..36d2cd933f5a --- /dev/null +++ b/samples/damon/mtier.c @@ -0,0 +1,178 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * memory tiering: migrate cold pages in node 0 and hot pages in node 1 to node + * 1 and node 0, respectively. Adjust the hotness/coldness threshold aiming + * resulting 99.6 % node 0 utilization ratio. + */ + +#define pr_fmt(fmt) "damon_sample_mtier: " fmt + +#include +#include +#include +#include + +static unsigned long node0_start_addr __read_mostly; +module_param(node0_start_addr, ulong, 0600); + +static unsigned long node0_end_addr __read_mostly; +module_param(node0_end_addr, ulong, 0600); + +static unsigned long node1_start_addr __read_mostly; +module_param(node1_start_addr, ulong, 0600); + +static unsigned long node1_end_addr __read_mostly; +module_param(node1_end_addr, ulong, 0600); + +static int damon_sample_mtier_enable_store( + const char *val, const struct kernel_param *kp); + +static const struct kernel_param_ops enable_param_ops = { + .set = damon_sample_mtier_enable_store, + .get = param_get_bool, +}; + +static bool enable __read_mostly; +module_param_cb(enable, &enable_param_ops, &enable, 0600); +MODULE_PARM_DESC(enable, "Enable of disable DAMON_SAMPLE_MTIER"); + +static struct damon_ctx *ctxs[2]; + +static struct damon_ctx *damon_sample_mtier_build_ctx(bool promote) +{ + struct damon_ctx *ctx; + struct damon_attrs attrs; + struct damon_target *target; + struct damon_region *region; + struct damos *scheme; + struct damos_quota_goal *quota_goal; + struct damos_filter *filter; + + ctx = damon_new_ctx(); + if (!ctx) + return NULL; + attrs = (struct damon_attrs) { + .sample_interval = 5 * USEC_PER_MSEC, + .aggr_interval = 100 * USEC_PER_MSEC, + .ops_update_interval = 60 * USEC_PER_MSEC * MSEC_PER_SEC, + .min_nr_regions = 10, + .max_nr_regions = 1000, + }; + + /* + * auto-tune sampling and aggregation interval aiming 4% DAMON-observed + * accesses ratio, keeping sampling interval in [5ms, 10s] range. + */ + attrs.intervals_goal = (struct damon_intervals_goal) { + .access_bp = 400, .aggrs = 3, + .min_sample_us = 5000, .max_sample_us = 10000000, + }; + if (damon_set_attrs(ctx, &attrs)) + goto free_out; + if (damon_select_ops(ctx, DAMON_OPS_PADDR)) + goto free_out; + + target = damon_new_target(); + if (!target) + goto free_out; + damon_add_target(ctx, target); + region = damon_new_region( + promote ? node1_start_addr : node0_start_addr, + promote ? node1_end_addr : node0_end_addr); + if (!region) + goto free_out; + damon_add_region(region, target); + + scheme = damon_new_scheme( + /* access pattern */ + &(struct damos_access_pattern) { + .min_sz_region = PAGE_SIZE, + .max_sz_region = ULONG_MAX, + .min_nr_accesses = promote ? 1 : 0, + .max_nr_accesses = promote ? UINT_MAX : 0, + .min_age_region = 0, + .max_age_region = UINT_MAX}, + /* action */ + promote ? DAMOS_MIGRATE_HOT : DAMOS_MIGRATE_COLD, + 1000000, /* apply interval (1s) */ + &(struct damos_quota){ + /* 200 MiB per sec by most */ + .reset_interval = 1000, + .sz = 200 * 1024 * 1024, + /* ignore size of region when prioritizing */ + .weight_sz = 0, + .weight_nr_accesses = 100, + .weight_age = 100, + }, + &(struct damos_watermarks){}, + promote ? 0 : 1); /* migrate target node id */ + if (!scheme) + goto free_out; + damon_set_schemes(ctx, &scheme, 1); + quota_goal = damos_new_quota_goal( + promote ? DAMOS_QUOTA_NODE_MEM_USED_BP : + DAMOS_QUOTA_NODE_MEM_FREE_BP, + promote ? 9970 : 50); + if (!quota_goal) + goto free_out; + quota_goal->nid = 0; + damos_add_quota_goal(&scheme->quota, quota_goal); + filter = damos_new_filter(DAMOS_FILTER_TYPE_YOUNG, true, promote); + if (!filter) + goto free_out; + damos_add_filter(scheme, filter); + return ctx; +free_out: + damon_destroy_ctx(ctx); + return NULL; +} + +static int damon_sample_mtier_start(void) +{ + struct damon_ctx *ctx; + + ctx = damon_sample_mtier_build_ctx(true); + if (!ctx) + return -ENOMEM; + ctxs[0] = ctx; + ctx = damon_sample_mtier_build_ctx(false); + if (!ctx) { + damon_destroy_ctx(ctxs[0]); + return -ENOMEM; + } + ctxs[1] = ctx; + return damon_start(ctxs, 2, true); +} + +static void damon_sample_mtier_stop(void) +{ + damon_stop(ctxs, 2); + damon_destroy_ctx(ctxs[0]); + damon_destroy_ctx(ctxs[1]); +} + +static int damon_sample_mtier_enable_store( + const char *val, const struct kernel_param *kp) +{ + bool enabled = enable; + int err; + + err = kstrtobool(val, &enable); + if (err) + return err; + + if (enable == enabled) + return 0; + + if (enable) + return damon_sample_mtier_start(); + damon_sample_mtier_stop(); + return 0; +} + +static int __init damon_sample_mtier_init(void) +{ + return 0; +} + +module_init(damon_sample_mtier_init); From 09b988a3826eb758d4b36dec4a9b23c18c209b34 Mon Sep 17 00:00:00 2001 From: Prabhav Kumar Vaish Date: Sun, 20 Apr 2025 19:34:40 +0530 Subject: [PATCH 146/285] mm: fix typos in comments in mm_init.c Corrected minor typos in comments: - 'contigious' -> 'contiguous' - 'hierarcy' -> 'hierarchy' This is a non-functional change in comment text only. Link: https://lkml.kernel.org/r/20250420140440.18817-1-pvkumar5749404@gmail.com Signed-off-by: Prabhav Kumar Vaish Reviewed-by: Mike Rapoport (Microsoft) Signed-off-by: Andrew Morton --- mm/mm_init.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/mm_init.c b/mm/mm_init.c index eedce9321e13..c275ae561b6f 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -828,7 +828,7 @@ overlap_memmap_init(unsigned long zone, unsigned long *pfn) * - physical memory bank size is not necessarily the exact multiple of the * arbitrary section size * - early reserved memory may not be listed in memblock.memory - * - non-memory regions covered by the contigious flatmem mapping + * - non-memory regions covered by the contiguous flatmem mapping * - memory layouts defined with memmap= kernel parameter may not align * nicely with memmap sections * @@ -1907,7 +1907,7 @@ void __init free_area_init(unsigned long *max_zone_pfn) free_area_init_node(nid); /* - * No sysfs hierarcy will be created via register_one_node() + * No sysfs hierarchy will be created via register_one_node() *for memory-less node because here it's not marked as N_MEMORY *and won't be set online later. The benefit is userspace *program won't be confused by sysfs files/directories of From 786d5cc2b92ac331d0654452c6c3cea611772e09 Mon Sep 17 00:00:00 2001 From: "Christoph Lameter (Ampere)" Date: Mon, 21 Apr 2025 13:58:06 -0700 Subject: [PATCH 147/285] Update Christoph's Email address and make it consistent Use cl@gentwo.org throughout and remove the old email addresses. Link: https://lkml.kernel.org/r/8b962f57-4d98-cbb0-cd82-b6ba456733e8@gentwo.org Signed-off-by: Christoph Lameter Signed-off-by: Andrew Morton --- CREDITS | 2 +- Documentation/ABI/testing/sysfs-kernel-slab | 96 +++++++++---------- .../admin-guide/cgroup-v1/cgroups.rst | 2 +- .../admin-guide/cgroup-v1/cpusets.rst | 2 +- Documentation/networking/arcnet-hardware.rst | 2 +- MAINTAINERS | 4 +- include/linux/percpu-defs.h | 2 +- mm/mmu_notifier.c | 2 +- mm/slab_common.c | 2 +- mm/vmstat.c | 2 +- 10 files changed, 58 insertions(+), 58 deletions(-) diff --git a/CREDITS b/CREDITS index f74d230992d6..45446ae322ec 100644 --- a/CREDITS +++ b/CREDITS @@ -2336,7 +2336,7 @@ D: Author of the dialog utility, foundation D: for Menuconfig's lxdialog. N: Christoph Lameter -E: christoph@lameter.com +E: cl@gentwo.org D: Digiboard PC/Xe and PC/Xi, Digiboard EPCA D: NUMA support, Slab allocators, Page migration D: Scalability, Time subsystem diff --git a/Documentation/ABI/testing/sysfs-kernel-slab b/Documentation/ABI/testing/sysfs-kernel-slab index cd5fb8fa3ddf..658999be5164 100644 --- a/Documentation/ABI/testing/sysfs-kernel-slab +++ b/Documentation/ABI/testing/sysfs-kernel-slab @@ -2,7 +2,7 @@ What: /sys/kernel/slab Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The /sys/kernel/slab directory contains a snapshot of the internal state of the SLUB allocator for each cache. Certain @@ -14,7 +14,7 @@ What: /sys/kernel/slab//aliases Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The aliases file is read-only and specifies how many caches have merged into this cache. @@ -23,7 +23,7 @@ What: /sys/kernel/slab//align Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The align file is read-only and specifies the cache's object alignment in bytes. @@ -32,7 +32,7 @@ What: /sys/kernel/slab//alloc_calls Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The alloc_calls file is read-only and lists the kernel code locations from which allocations for this cache were performed. @@ -43,7 +43,7 @@ What: /sys/kernel/slab//alloc_fastpath Date: February 2008 KernelVersion: 2.6.25 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The alloc_fastpath file shows how many objects have been allocated using the fast path. It can be written to clear the @@ -54,7 +54,7 @@ What: /sys/kernel/slab//alloc_from_partial Date: February 2008 KernelVersion: 2.6.25 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The alloc_from_partial file shows how many times a cpu slab has been full and it has been refilled by using a slab from the list @@ -66,7 +66,7 @@ What: /sys/kernel/slab//alloc_refill Date: February 2008 KernelVersion: 2.6.25 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The alloc_refill file shows how many times the per-cpu freelist was empty but there were objects available as the result of @@ -77,7 +77,7 @@ What: /sys/kernel/slab//alloc_slab Date: February 2008 KernelVersion: 2.6.25 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The alloc_slab file is shows how many times a new slab had to be allocated from the page allocator. It can be written to @@ -88,7 +88,7 @@ What: /sys/kernel/slab//alloc_slowpath Date: February 2008 KernelVersion: 2.6.25 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The alloc_slowpath file shows how many objects have been allocated using the slow path because of a refill or @@ -100,7 +100,7 @@ What: /sys/kernel/slab//cache_dma Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The cache_dma file is read-only and specifies whether objects are from ZONE_DMA. @@ -110,7 +110,7 @@ What: /sys/kernel/slab//cpu_slabs Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The cpu_slabs file is read-only and displays how many cpu slabs are active and their NUMA locality. @@ -119,7 +119,7 @@ What: /sys/kernel/slab//cpuslab_flush Date: April 2009 KernelVersion: 2.6.31 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The file cpuslab_flush shows how many times a cache's cpu slabs have been flushed as the result of destroying or shrinking a @@ -132,7 +132,7 @@ What: /sys/kernel/slab//ctor Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The ctor file is read-only and specifies the cache's object constructor function, which is invoked for each object when a @@ -142,7 +142,7 @@ What: /sys/kernel/slab//deactivate_empty Date: February 2008 KernelVersion: 2.6.25 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The deactivate_empty file shows how many times an empty cpu slab was deactivated. It can be written to clear the current count. @@ -152,7 +152,7 @@ What: /sys/kernel/slab//deactivate_full Date: February 2008 KernelVersion: 2.6.25 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The deactivate_full file shows how many times a full cpu slab was deactivated. It can be written to clear the current count. @@ -162,7 +162,7 @@ What: /sys/kernel/slab//deactivate_remote_frees Date: February 2008 KernelVersion: 2.6.25 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The deactivate_remote_frees file shows how many times a cpu slab has been deactivated and contained free objects that were freed @@ -173,7 +173,7 @@ What: /sys/kernel/slab//deactivate_to_head Date: February 2008 KernelVersion: 2.6.25 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The deactivate_to_head file shows how many times a partial cpu slab was deactivated and added to the head of its node's partial @@ -184,7 +184,7 @@ What: /sys/kernel/slab//deactivate_to_tail Date: February 2008 KernelVersion: 2.6.25 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The deactivate_to_tail file shows how many times a partial cpu slab was deactivated and added to the tail of its node's partial @@ -195,7 +195,7 @@ What: /sys/kernel/slab//destroy_by_rcu Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The destroy_by_rcu file is read-only and specifies whether slabs (not objects) are freed by rcu. @@ -204,7 +204,7 @@ What: /sys/kernel/slab//free_add_partial Date: February 2008 KernelVersion: 2.6.25 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The free_add_partial file shows how many times an object has been freed in a full slab so that it had to added to its node's @@ -215,7 +215,7 @@ What: /sys/kernel/slab//free_calls Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The free_calls file is read-only and lists the locations of object frees if slab debugging is enabled (see @@ -225,7 +225,7 @@ What: /sys/kernel/slab//free_fastpath Date: February 2008 KernelVersion: 2.6.25 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The free_fastpath file shows how many objects have been freed using the fast path because it was an object from the cpu slab. @@ -236,7 +236,7 @@ What: /sys/kernel/slab//free_frozen Date: February 2008 KernelVersion: 2.6.25 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The free_frozen file shows how many objects have been freed to a frozen slab (i.e. a remote cpu slab). It can be written to @@ -247,7 +247,7 @@ What: /sys/kernel/slab//free_remove_partial Date: February 2008 KernelVersion: 2.6.25 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The free_remove_partial file shows how many times an object has been freed to a now-empty slab so that it had to be removed from @@ -259,7 +259,7 @@ What: /sys/kernel/slab//free_slab Date: February 2008 KernelVersion: 2.6.25 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The free_slab file shows how many times an empty slab has been freed back to the page allocator. It can be written to clear @@ -270,7 +270,7 @@ What: /sys/kernel/slab//free_slowpath Date: February 2008 KernelVersion: 2.6.25 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The free_slowpath file shows how many objects have been freed using the slow path (i.e. to a full or partial slab). It can @@ -281,7 +281,7 @@ What: /sys/kernel/slab//hwcache_align Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The hwcache_align file is read-only and specifies whether objects are aligned on cachelines. @@ -301,7 +301,7 @@ What: /sys/kernel/slab//object_size Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The object_size file is read-only and specifies the cache's object size. @@ -310,7 +310,7 @@ What: /sys/kernel/slab//objects Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The objects file is read-only and displays how many objects are active and from which nodes they are from. @@ -319,7 +319,7 @@ What: /sys/kernel/slab//objects_partial Date: April 2008 KernelVersion: 2.6.26 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The objects_partial file is read-only and displays how many objects are on partial slabs and from which nodes they are @@ -329,7 +329,7 @@ What: /sys/kernel/slab//objs_per_slab Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The file objs_per_slab is read-only and specifies how many objects may be allocated from a single slab of the order @@ -339,7 +339,7 @@ What: /sys/kernel/slab//order Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The order file specifies the page order at which new slabs are allocated. It is writable and can be changed to increase the @@ -356,7 +356,7 @@ What: /sys/kernel/slab//order_fallback Date: April 2008 KernelVersion: 2.6.26 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The order_fallback file shows how many times an allocation of a new slab has not been possible at the cache's order and instead @@ -369,7 +369,7 @@ What: /sys/kernel/slab//partial Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The partial file is read-only and displays how long many partial slabs there are and how long each node's list is. @@ -378,7 +378,7 @@ What: /sys/kernel/slab//poison Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The poison file specifies whether objects should be poisoned when a new slab is allocated. @@ -387,7 +387,7 @@ What: /sys/kernel/slab//reclaim_account Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The reclaim_account file specifies whether the cache's objects are reclaimable (and grouped by their mobility). @@ -396,7 +396,7 @@ What: /sys/kernel/slab//red_zone Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The red_zone file specifies whether the cache's objects are red zoned. @@ -405,7 +405,7 @@ What: /sys/kernel/slab//remote_node_defrag_ratio Date: January 2008 KernelVersion: 2.6.25 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The file remote_node_defrag_ratio specifies the percentage of times SLUB will attempt to refill the cpu slab with a partial @@ -419,7 +419,7 @@ What: /sys/kernel/slab//sanity_checks Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The sanity_checks file specifies whether expensive checks should be performed on free and, at minimum, enables double free @@ -430,7 +430,7 @@ What: /sys/kernel/slab//shrink Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The shrink file is used to reclaim unused slab cache memory from a cache. Empty per-cpu or partial slabs @@ -446,7 +446,7 @@ What: /sys/kernel/slab//slab_size Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The slab_size file is read-only and specifies the object size with metadata (debugging information and alignment) in bytes. @@ -455,7 +455,7 @@ What: /sys/kernel/slab//slabs Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The slabs file is read-only and displays how long many slabs there are (both cpu and partial) and from which nodes they are @@ -465,7 +465,7 @@ What: /sys/kernel/slab//store_user Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The store_user file specifies whether the location of allocation or free should be tracked for a cache. @@ -474,7 +474,7 @@ What: /sys/kernel/slab//total_objects Date: April 2008 KernelVersion: 2.6.26 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The total_objects file is read-only and displays how many total objects a cache has and from which nodes they are from. @@ -483,7 +483,7 @@ What: /sys/kernel/slab//trace Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: The trace file specifies whether object allocations and frees should be traced. @@ -492,7 +492,7 @@ What: /sys/kernel/slab//validate Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg , - Christoph Lameter + Christoph Lameter Description: Writing to the validate file causes SLUB to traverse all of its cache's objects and check the validity of metadata. @@ -506,14 +506,14 @@ Description: What: /sys/kernel/slab//slabs_cpu_partial Date: Aug 2011 -Contact: Christoph Lameter +Contact: Christoph Lameter Description: This read-only file shows the number of partialli allocated frozen slabs. What: /sys/kernel/slab//cpu_partial Date: Aug 2011 -Contact: Christoph Lameter +Contact: Christoph Lameter Description: This read-only file shows the number of per cpu partial pages to keep around. diff --git a/Documentation/admin-guide/cgroup-v1/cgroups.rst b/Documentation/admin-guide/cgroup-v1/cgroups.rst index a3e2edb3d274..463f98453323 100644 --- a/Documentation/admin-guide/cgroup-v1/cgroups.rst +++ b/Documentation/admin-guide/cgroup-v1/cgroups.rst @@ -13,7 +13,7 @@ Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. Modified by Paul Jackson -Modified by Christoph Lameter +Modified by Christoph Lameter .. CONTENTS: diff --git a/Documentation/admin-guide/cgroup-v1/cpusets.rst b/Documentation/admin-guide/cgroup-v1/cpusets.rst index f401af5e2f09..c7909e5ac136 100644 --- a/Documentation/admin-guide/cgroup-v1/cpusets.rst +++ b/Documentation/admin-guide/cgroup-v1/cpusets.rst @@ -10,7 +10,7 @@ Written by Simon.Derr@bull.net - Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. - Modified by Paul Jackson -- Modified by Christoph Lameter +- Modified by Christoph Lameter - Modified by Paul Menage - Modified by Hidetoshi Seto diff --git a/Documentation/networking/arcnet-hardware.rst b/Documentation/networking/arcnet-hardware.rst index 982215723582..3bf7f99cd7bb 100644 --- a/Documentation/networking/arcnet-hardware.rst +++ b/Documentation/networking/arcnet-hardware.rst @@ -3152,7 +3152,7 @@ Tiara (model unknown) --------------- - - from Christoph Lameter + - from Christoph Lameter Here is information about my card as far as I could figure it out:: diff --git a/MAINTAINERS b/MAINTAINERS index d8e6c1beef5c..2f80c618d325 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -19044,7 +19044,7 @@ F: drivers/net/ethernet/pensando/ PER-CPU MEMORY ALLOCATOR M: Dennis Zhou M: Tejun Heo -M: Christoph Lameter +M: Christoph Lameter L: linux-mm@kvack.org S: Maintained T: git git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu.git @@ -22407,7 +22407,7 @@ F: Documentation/devicetree/bindings/nvmem/layouts/kontron,sl28-vpd.yaml F: drivers/nvmem/layouts/sl28vpd.c SLAB ALLOCATOR -M: Christoph Lameter +M: Christoph Lameter M: David Rientjes M: Andrew Morton M: Vlastimil Babka diff --git a/include/linux/percpu-defs.h b/include/linux/percpu-defs.h index 0aeb0e276a3e..c16cdeaa505e 100644 --- a/include/linux/percpu-defs.h +++ b/include/linux/percpu-defs.h @@ -375,7 +375,7 @@ do { \ } while (0) /* - * this_cpu operations (C) 2008-2013 Christoph Lameter + * this_cpu operations (C) 2008-2013 Christoph Lameter * * Optimized manipulation for memory allocated through the per cpu * allocator or for addresses of per cpu variables. diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index fc18fe274505..8e0125dc0522 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -4,7 +4,7 @@ * * Copyright (C) 2008 Qumranet, Inc. * Copyright (C) 2008 SGI - * Christoph Lameter + * Christoph Lameter */ #include diff --git a/mm/slab_common.c b/mm/slab_common.c index 5be257e03c7c..bfe7c40eeee1 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -2,7 +2,7 @@ /* * Slab allocator functions that are independent of the allocator strategy * - * (C) 2012 Christoph Lameter + * (C) 2012 Christoph Lameter */ #include diff --git a/mm/vmstat.c b/mm/vmstat.c index 4c268ce39ff2..d888c248d99f 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -7,7 +7,7 @@ * * zoned VM statistics * Copyright (C) 2006 Silicon Graphics, Inc., - * Christoph Lameter + * Christoph Lameter * Copyright (C) 2008-2014 Christoph Lameter */ #include From f7f68274e476c49e29bfb81daf4ad717fe9880c6 Mon Sep 17 00:00:00 2001 From: Baoquan He Date: Sat, 19 Apr 2025 06:36:49 +0800 Subject: [PATCH 148/285] mm/vmalloc.c: change purge_ndoes as local static variable Patch series "mm/vmalloc.c: code cleanup and improvements", v2. These changes were made from code inspection in mm/vmalloc.c. This patch (of 5): Static variable 'purge_ndoes' is defined in global scope, while it's only used in function __purge_vmap_area_lazy(). It mainly serves to avoid memory allocation repeatedly, especially when NR_CPUS is big. While a local static variable can also satisfy the demand, and can improve code readibility. Hence move its definition into __purge_vmap_area_lazy(). Link: https://lkml.kernel.org/r/20250418223653.243436-1-bhe@redhat.com Link: https://lkml.kernel.org/r/20250418223653.243436-2-bhe@redhat.com Signed-off-by: Baoquan He Reviewed-by: Uladzislau Rezki (Sony) Reviewed-by: Vishal Moola (Oracle) Reviewed-by: Shivank Garg Signed-off-by: Andrew Morton --- mm/vmalloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 3fd802134e4e..cd733387b1a0 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2127,7 +2127,6 @@ static DEFINE_MUTEX(vmap_purge_lock); /* for per-CPU blocks */ static void purge_fragmented_blocks_allcpus(void); -static cpumask_t purge_nodes; static void reclaim_list_global(struct list_head *head) @@ -2260,6 +2259,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end, { unsigned long nr_purged_areas = 0; unsigned int nr_purge_helpers; + static cpumask_t purge_nodes; unsigned int nr_purge_nodes; struct vmap_node *vn; int i; From 81262d85aef4dbdd878d6ce982fe51c06ed2720c Mon Sep 17 00:00:00 2001 From: Baoquan He Date: Sat, 19 Apr 2025 06:36:50 +0800 Subject: [PATCH 149/285] mm/vmalloc.c: find the vmap of vmap_nodes in reverse order When finding VA in vn->busy, if VA spans several zones and the passed addr is not the same as va->va_start, we should scan the vn in reverse odrdr because the starting address of VA must be smaller than the passed addr if it really resides in the VA. E.g on a system nr_vmap_nodes=100, <----va----> -|-----|-----|-----|-----|-----|-----|-----|-----|-----|- ... n-1 n n+1 n+2 ... 100 0 1 VA resides in node 'n' whereas it spans 'n', 'n+1' and 'n+2'. If passed addr is within 'n+2', we should try nodes backwards on 'n+1' and 'n', then succeed very soon. Meanwhile we still need loop around because VA could spans node from 'n' to node 100, node 0, node 1. Anyway, changing to find in reverse order can improve efficiency on many CPUs system. Link: https://lkml.kernel.org/r/20250418223653.243436-3-bhe@redhat.com Signed-off-by: Baoquan He Reviewed-by: Uladzislau Rezki (Sony) Reviewed-by: Shivank Garg Cc: Vishal Moola (Oracle) Signed-off-by: Andrew Morton --- mm/vmalloc.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index cd733387b1a0..26d86ab5e59e 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2435,7 +2435,7 @@ struct vmap_area *find_vmap_area(unsigned long addr) if (va) return va; - } while ((i = (i + 1) % nr_vmap_nodes) != j); + } while ((i = (i + nr_vmap_nodes - 1) % nr_vmap_nodes) != j); return NULL; } @@ -2461,7 +2461,7 @@ static struct vmap_area *find_unlink_vmap_area(unsigned long addr) if (va) return va; - } while ((i = (i + 1) % nr_vmap_nodes) != j); + } while ((i = (i + nr_vmap_nodes - 1) % nr_vmap_nodes) != j); return NULL; } From 4f05024eba021f1885c044c8515b6a2674b652de Mon Sep 17 00:00:00 2001 From: Baoquan He Date: Sat, 19 Apr 2025 06:36:51 +0800 Subject: [PATCH 150/285] mm/vmalloc.c: optimize code in decay_va_pool_node() a little bit When purge lazily freed vmap areas, VA stored in vn->pool[] will also be taken away into free vmap tree partially or completely accordingly, that is done in decay_va_pool_node(). When doing that, for each pool of node, the whole list is detached from the pool for handling. At this time, that pool is empty. It's not necessary to update the pool size each time when one VA is removed and addded into free vmap tree. Here change code to update the pool size when attaching the pool back. Link: https://lkml.kernel.org/r/20250418223653.243436-4-bhe@redhat.com Signed-off-by: Baoquan He Reviewed-by: Uladzislau Rezki (Sony) Cc: Shivank Garg Cc: Vishal Moola (Oracle) Signed-off-by: Andrew Morton --- mm/vmalloc.c | 23 +++++++++++------------ 1 file changed, 11 insertions(+), 12 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 26d86ab5e59e..e58c2ed4bcba 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2149,7 +2149,7 @@ decay_va_pool_node(struct vmap_node *vn, bool full_decay) LIST_HEAD(decay_list); struct rb_root decay_root = RB_ROOT; struct vmap_area *va, *nva; - unsigned long n_decay; + unsigned long n_decay, pool_len; int i; for (i = 0; i < MAX_VA_SIZE_PAGES; i++) { @@ -2163,22 +2163,20 @@ decay_va_pool_node(struct vmap_node *vn, bool full_decay) list_replace_init(&vn->pool[i].head, &tmp_list); spin_unlock(&vn->pool_lock); - if (full_decay) - WRITE_ONCE(vn->pool[i].len, 0); + pool_len = n_decay = vn->pool[i].len; + WRITE_ONCE(vn->pool[i].len, 0); /* Decay a pool by ~25% out of left objects. */ - n_decay = vn->pool[i].len >> 2; + if (!full_decay) + n_decay >>= 2; + pool_len -= n_decay; list_for_each_entry_safe(va, nva, &tmp_list, list) { + if (!n_decay--) + break; + list_del_init(&va->list); merge_or_add_vmap_area(va, &decay_root, &decay_list); - - if (!full_decay) { - WRITE_ONCE(vn->pool[i].len, vn->pool[i].len - 1); - - if (!--n_decay) - break; - } } /* @@ -2187,9 +2185,10 @@ decay_va_pool_node(struct vmap_node *vn, bool full_decay) * can populate the pool therefore a simple list replace * operation takes place here. */ - if (!full_decay && !list_empty(&tmp_list)) { + if (!list_empty(&tmp_list)) { spin_lock(&vn->pool_lock); list_replace_init(&tmp_list, &vn->pool[i].head); + WRITE_ONCE(vn->pool[i].len, pool_len); spin_unlock(&vn->pool_lock); } } From 8ab8442d44ee37cc21fdc11530158ee2b2be5ed5 Mon Sep 17 00:00:00 2001 From: Baoquan He Date: Sat, 19 Apr 2025 06:36:52 +0800 Subject: [PATCH 151/285] mm/vmalloc: optimize function vm_unmap_aliases() Remove unneeded local variables and replace them with values. Link: https://lkml.kernel.org/r/20250418223653.243436-5-bhe@redhat.com Signed-off-by: Baoquan He Reviewed-by: Uladzislau Rezki (Sony) Reviewed-by: Vishal Moola (Oracle) Reviewed-by: Shivank Garg Signed-off-by: Andrew Morton --- mm/vmalloc.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index e58c2ed4bcba..bfc2f2c45d89 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2929,10 +2929,7 @@ static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush) */ void vm_unmap_aliases(void) { - unsigned long start = ULONG_MAX, end = 0; - int flush = 0; - - _vm_unmap_aliases(start, end, flush); + _vm_unmap_aliases(ULONG_MAX, 0, 0); } EXPORT_SYMBOL_GPL(vm_unmap_aliases); From b25f97d0f8043903a64e3a203b79219cc69f2f77 Mon Sep 17 00:00:00 2001 From: Baoquan He Date: Sat, 19 Apr 2025 06:36:53 +0800 Subject: [PATCH 152/285] mm/vmalloc.c: return explicit error value in alloc_vmap_area() In codes of alloc_vmap_area(), it returns the upper bound 'vend' to indicate if the allocation is successful or failed. That is not very clear. Here change to return explicit error values and check them to judge if allocation is successful. Link: https://lkml.kernel.org/r/20250418223653.243436-6-bhe@redhat.com Signed-off-by: Baoquan He Reviewed-by: Shivank Garg Reviewed-by: Uladzislau Rezki (Sony) Cc: Vishal Moola (Oracle) Signed-off-by: Andrew Morton --- mm/vmalloc.c | 27 +++++++++++++-------------- 1 file changed, 13 insertions(+), 14 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index bfc2f2c45d89..701ea4ec8950 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1716,7 +1716,7 @@ va_clip(struct rb_root *root, struct list_head *head, */ lva = kmem_cache_alloc(vmap_area_cachep, GFP_NOWAIT); if (!lva) - return -1; + return -ENOMEM; } /* @@ -1730,7 +1730,7 @@ va_clip(struct rb_root *root, struct list_head *head, */ va->va_start = nva_start_addr + size; } else { - return -1; + return -EINVAL; } if (type != FL_FIT_TYPE) { @@ -1759,19 +1759,19 @@ va_alloc(struct vmap_area *va, /* Check the "vend" restriction. */ if (nva_start_addr + size > vend) - return vend; + return -ERANGE; /* Update the free vmap_area. */ ret = va_clip(root, head, va, nva_start_addr, size); if (WARN_ON_ONCE(ret)) - return vend; + return ret; return nva_start_addr; } /* * Returns a start address of the newly allocated area, if success. - * Otherwise a vend is returned that indicates failure. + * Otherwise an error value is returned that indicates failure. */ static __always_inline unsigned long __alloc_vmap_area(struct rb_root *root, struct list_head *head, @@ -1796,14 +1796,13 @@ __alloc_vmap_area(struct rb_root *root, struct list_head *head, va = find_vmap_lowest_match(root, size, align, vstart, adjust_search_size); if (unlikely(!va)) - return vend; + return -ENOENT; nva_start_addr = va_alloc(va, root, head, size, align, vstart, vend); - if (nva_start_addr == vend) - return vend; #if DEBUG_AUGMENT_LOWEST_MATCH_CHECK - find_vmap_lowest_match_check(root, head, size, align); + if (!IS_ERR_VALUE(nva_start_addr)) + find_vmap_lowest_match_check(root, head, size, align); #endif return nva_start_addr; @@ -1933,7 +1932,7 @@ node_alloc(unsigned long size, unsigned long align, struct vmap_area *va; *vn_id = 0; - *addr = vend; + *addr = -EINVAL; /* * Fallback to a global heap if not vmalloc or there @@ -2013,20 +2012,20 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, } retry: - if (addr == vend) { + if (IS_ERR_VALUE(addr)) { preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, node); addr = __alloc_vmap_area(&free_vmap_area_root, &free_vmap_area_list, size, align, vstart, vend); spin_unlock(&free_vmap_area_lock); } - trace_alloc_vmap_area(addr, size, align, vstart, vend, addr == vend); + trace_alloc_vmap_area(addr, size, align, vstart, vend, IS_ERR_VALUE(addr)); /* - * If an allocation fails, the "vend" address is + * If an allocation fails, the error value is * returned. Therefore trigger the overflow path. */ - if (unlikely(addr == vend)) + if (IS_ERR_VALUE(addr)) goto overflow; va->va_start = addr; From 6bbf0e728528addd5f9dd140b03b06438f032166 Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Wed, 23 Apr 2025 17:48:07 +0300 Subject: [PATCH 153/285] execmem: enforce allocation size aligment to PAGE_SIZE Before introduction of ROX cache execmem allocation size was always implicitly aligned to PAGE_SIZE inside vmalloc. However, when allocation happens from the ROX cache, this is not enforced. Make sure that the allocation size is always consistently aligned to PAGE_SIZE. Mike said: : Right now it'll make the maple trees in execmem_cache more compact. : And it's a precaution for the case when execmem callers would want to : change permissions on unaligned range because that would WARN_ON() : loudly. Peter said : It should not have a runtime effect -- currently all this code is used : with PAGE_SIZE multiples and everything just works. But whilst I was : perusing this code, I noticed that nothing actually enforced this. If : someone were to break this assumption things will go sideways. Link: https://lkml.kernel.org/r/20250423144808.1619863-1-rppt@kernel.org Fixes: 2e45474ab14f ("execmem: add support for cache of large ROX pages") Signed-off-by: Mike Rapoport (Microsoft) Suggested-by: Peter Zijlstra (Intel) Acked-by: Peter Zijlstra (Intel) Signed-off-by: Andrew Morton --- mm/execmem.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/execmem.c b/mm/execmem.c index e6c4f5076ca8..2b683e7d864d 100644 --- a/mm/execmem.c +++ b/mm/execmem.c @@ -377,6 +377,8 @@ void *execmem_alloc(enum execmem_type type, size_t size) pgprot_t pgprot = range->pgprot; void *p; + size = PAGE_ALIGN(size); + if (use_cache) p = execmem_cache_alloc(range, size); else From 8adce0857769d596c5b000d118119e3bb1d63a32 Mon Sep 17 00:00:00 2001 From: Gregory Price Date: Thu, 24 Apr 2025 16:28:05 -0400 Subject: [PATCH 154/285] cpuset: rename cpuset_node_allowed to cpuset_current_node_allowed MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Patch series "vmscan: enforce mems_effective during demotion", v5. Change reclaim to respect cpuset.mems_effective during demotion when possible. Presently, reclaim explicitly ignores cpuset.mems_effective when demoting, which may cause the cpuset settings to violated. Implement cpuset_node_allowed() to check the cpuset.mems_effective associated wih the mem_cgroup of the lruvec being scanned. This only applies to cgroup/cpuset v2, as cpuset exists in a different hierarchy than mem_cgroup in v1. This requires renaming the existing cpuset_node_allowed() to be cpuset_current_now_allowed() - which is more descriptive anyway - to implement the new cpuset_node_allowed() which takes a target cgroup. This patch (of 2): Rename cpuset_node_allowed to reflect that the function checks the current task's cpuset.mems. This allows us to make a new cpuset_node_allowed function that checks a target cgroup's cpuset.mems. Link: https://lkml.kernel.org/r/20250424202806.52632-1-gourry@gourry.net Link: https://lkml.kernel.org/r/20250424202806.52632-2-gourry@gourry.net Signed-off-by: Gregory Price Acked-by: Waiman Long Acked-by: Tejun Heo Acked-by: Johannes Weiner Reviewed-by: Shakeel Butt Cc: Michal Hocko Cc: Michal Koutný Cc: Muchun Song Cc: Roman Gushchin Signed-off-by: Andrew Morton --- include/linux/cpuset.h | 4 ++-- kernel/cgroup/cpuset.c | 4 ++-- mm/page_alloc.c | 4 ++-- 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index 5466c96a33db..d6a4fe5c3b6e 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -82,11 +82,11 @@ extern nodemask_t cpuset_mems_allowed(struct task_struct *p); void cpuset_init_current_mems_allowed(void); int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask); -extern bool cpuset_node_allowed(int node, gfp_t gfp_mask); +extern bool cpuset_current_node_allowed(int node, gfp_t gfp_mask); static inline bool __cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask) { - return cpuset_node_allowed(zone_to_nid(z), gfp_mask); + return cpuset_current_node_allowed(zone_to_nid(z), gfp_mask); } static inline bool cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 306b60430091..54f6af362191 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -4164,7 +4164,7 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs) } /* - * cpuset_node_allowed - Can we allocate on a memory node? + * cpuset_current_node_allowed - Can current task allocate on a memory node? * @node: is this an allowed node? * @gfp_mask: memory allocation flags * @@ -4203,7 +4203,7 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs) * GFP_KERNEL - any node in enclosing hardwalled cpuset ok * GFP_USER - only nodes in current tasks mems allowed ok. */ -bool cpuset_node_allowed(int node, gfp_t gfp_mask) +bool cpuset_current_node_allowed(int node, gfp_t gfp_mask) { struct cpuset *cs; /* current cpuset ancestors */ bool allowed; /* is allocation in zone z allowed? */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 237dcca69e51..4ee55edf1ad7 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3543,7 +3543,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, retry: /* * Scan zonelist, looking for a zone with enough free. - * See also cpuset_node_allowed() comment in kernel/cgroup/cpuset.c. + * See also cpuset_current_node_allowed() comment in kernel/cgroup/cpuset.c. */ no_fallback = alloc_flags & ALLOC_NOFRAGMENT; z = ac->preferred_zoneref; @@ -4230,7 +4230,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order) /* * Ignore cpuset mems for non-blocking __GFP_HIGH (probably * GFP_ATOMIC) rather than fail, see the comment for - * cpuset_node_allowed(). + * cpuset_current_node_allowed(). */ if (alloc_flags & ALLOC_MIN_RESERVE) alloc_flags &= ~ALLOC_CPUSET; From 7d709f49babc28907b0ac60228f522d2e6216add Mon Sep 17 00:00:00 2001 From: Gregory Price Date: Thu, 24 Apr 2025 16:28:06 -0400 Subject: [PATCH 155/285] vmscan,cgroup: apply mems_effective to reclaim MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit It is possible for a reclaimer to cause demotions of an lruvec belonging to a cgroup with cpuset.mems set to exclude some nodes. Attempt to apply this limitation based on the lruvec's memcg and prevent demotion. Notably, this may still allow demotion of shared libraries or any memory first instantiated in another cgroup. This means cpusets still cannot cannot guarantee complete isolation when demotion is enabled, and the docs have been updated to reflect this. This is useful for isolating workloads on a multi-tenant system from certain classes of memory more consistently - with the noted exceptions. Note on locking: The cgroup_get_e_css reference protects the css->effective_mems, and calls of this interface would be subject to the same race conditions associated with a non-atomic access to cs->effective_mems. So while this interface cannot make strong guarantees of correctness, it can therefore avoid taking a global or rcu_read_lock for performance. Link: https://lkml.kernel.org/r/20250424202806.52632-3-gourry@gourry.net Signed-off-by: Gregory Price Suggested-by: Shakeel Butt Suggested-by: Waiman Long Acked-by: Tejun Heo Acked-by: Johannes Weiner Reviewed-by: Shakeel Butt Reviewed-by: Waiman Long Cc: Michal Hocko Cc: Michal Koutný Cc: Muchun Song Cc: Roman Gushchin Signed-off-by: Andrew Morton --- .../ABI/testing/sysfs-kernel-mm-numa | 16 +++++--- include/linux/cpuset.h | 5 +++ include/linux/memcontrol.h | 7 ++++ kernel/cgroup/cpuset.c | 36 ++++++++++++++++ mm/memcontrol.c | 6 +++ mm/vmscan.c | 41 +++++++++++-------- 6 files changed, 89 insertions(+), 22 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-numa b/Documentation/ABI/testing/sysfs-kernel-mm-numa index 77e559d4ed80..90e375ff54cb 100644 --- a/Documentation/ABI/testing/sysfs-kernel-mm-numa +++ b/Documentation/ABI/testing/sysfs-kernel-mm-numa @@ -16,9 +16,13 @@ Description: Enable/disable demoting pages during reclaim Allowing page migration during reclaim enables these systems to migrate pages from fast tiers to slow tiers when the fast tier is under pressure. This migration - is performed before swap. It may move data to a NUMA - node that does not fall into the cpuset of the - allocating process which might be construed to violate - the guarantees of cpusets. This should not be enabled - on systems which need strict cpuset location - guarantees. + is performed before swap if an eligible numa node is + present in cpuset.mems for the cgroup (or if cpuset v1 + is being used). If cpusets.mems changes at runtime, it + may move data to a NUMA node that does not fall into the + cpuset of the new cpusets.mems, which might be construed + to violate the guarantees of cpusets. Shared memory, + such as libraries, owned by another cgroup may still be + demoted and result in memory use on a node not present + in cpusets.mem. This should not be enabled on systems + which need strict cpuset location guarantees. diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index d6a4fe5c3b6e..2ddb256187b5 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -173,6 +173,7 @@ static inline void set_mems_allowed(nodemask_t nodemask) task_unlock(current); } +extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid); #else /* !CONFIG_CPUSETS */ static inline bool cpusets_enabled(void) { return false; } @@ -293,6 +294,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq) return false; } +static inline bool cpuset_node_allowed(struct cgroup *cgroup, int nid) +{ + return true; +} #endif /* !CONFIG_CPUSETS */ #endif /* _LINUX_CPUSET_H */ diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 5264d148bdd9..ac1b003ee5b8 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1736,6 +1736,8 @@ static inline void count_objcg_events(struct obj_cgroup *objcg, rcu_read_unlock(); } +bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid); + #else static inline bool mem_cgroup_kmem_disabled(void) { @@ -1797,6 +1799,11 @@ static inline ino_t page_cgroup_ino(struct page *page) { return 0; } + +static inline bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid) +{ + return true; +} #endif /* CONFIG_MEMCG */ #if defined(CONFIG_MEMCG) && defined(CONFIG_ZSWAP) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 54f6af362191..83639a12883d 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -4237,6 +4237,42 @@ bool cpuset_current_node_allowed(int node, gfp_t gfp_mask) return allowed; } +bool cpuset_node_allowed(struct cgroup *cgroup, int nid) +{ + struct cgroup_subsys_state *css; + struct cpuset *cs; + bool allowed; + + /* + * In v1, mem_cgroup and cpuset are unlikely in the same hierarchy + * and mems_allowed is likely to be empty even if we could get to it, + * so return true to avoid taking a global lock on the empty check. + */ + if (!cpuset_v2()) + return true; + + css = cgroup_get_e_css(cgroup, &cpuset_cgrp_subsys); + if (!css) + return true; + + /* + * Normally, accessing effective_mems would require the cpuset_mutex + * or callback_lock - but node_isset is atomic and the reference + * taken via cgroup_get_e_css is sufficient to protect css. + * + * Since this interface is intended for use by migration paths, we + * relax locking here to avoid taking global locks - while accepting + * there may be rare scenarios where the result may be innaccurate. + * + * Reclaim and migration are subject to these same race conditions, and + * cannot make strong isolation guarantees, so this is acceptable. + */ + cs = container_of(css, struct cpuset, css); + allowed = node_isset(nid, cs->effective_mems); + css_put(css); + return allowed; +} + /** * cpuset_spread_node() - On which node to begin search for a page * @rotor: round robin rotor diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 3020bb82c94c..e17e0a9ceee0 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -29,6 +29,7 @@ #include #include #include +#include #include #include #include @@ -5523,3 +5524,8 @@ static int __init mem_cgroup_swap_init(void) subsys_initcall(mem_cgroup_swap_init); #endif /* CONFIG_SWAP */ + +bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid) +{ + return memcg ? cpuset_node_allowed(memcg->css.cgroup, nid) : true; +} diff --git a/mm/vmscan.c b/mm/vmscan.c index ceffb20bb8c9..a4fbd52a82d4 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -342,16 +342,22 @@ static void flush_reclaim_state(struct scan_control *sc) } } -static bool can_demote(int nid, struct scan_control *sc) +static bool can_demote(int nid, struct scan_control *sc, + struct mem_cgroup *memcg) { + int demotion_nid; + if (!numa_demotion_enabled) return false; if (sc && sc->no_demotion) return false; - if (next_demotion_node(nid) == NUMA_NO_NODE) + + demotion_nid = next_demotion_node(nid); + if (demotion_nid == NUMA_NO_NODE) return false; - return true; + /* If demotion node isn't in the cgroup's mems_allowed, fall back */ + return mem_cgroup_node_allowed(memcg, demotion_nid); } static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg, @@ -376,7 +382,7 @@ static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg, * * Can it be reclaimed from this node via demotion? */ - return can_demote(nid, sc); + return can_demote(nid, sc, memcg); } /* @@ -1096,7 +1102,8 @@ static bool may_enter_fs(struct folio *folio, gfp_t gfp_mask) */ static unsigned int shrink_folio_list(struct list_head *folio_list, struct pglist_data *pgdat, struct scan_control *sc, - struct reclaim_stat *stat, bool ignore_references) + struct reclaim_stat *stat, bool ignore_references, + struct mem_cgroup *memcg) { struct folio_batch free_folios; LIST_HEAD(ret_folios); @@ -1109,7 +1116,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, folio_batch_init(&free_folios); memset(stat, 0, sizeof(*stat)); cond_resched(); - do_demote_pass = can_demote(pgdat->node_id, sc); + do_demote_pass = can_demote(pgdat->node_id, sc, memcg); retry: while (!list_empty(folio_list)) { @@ -1658,7 +1665,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone, */ noreclaim_flag = memalloc_noreclaim_save(); nr_reclaimed = shrink_folio_list(&clean_folios, zone->zone_pgdat, &sc, - &stat, true); + &stat, true, NULL); memalloc_noreclaim_restore(noreclaim_flag); list_splice(&clean_folios, folio_list); @@ -2029,7 +2036,8 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, if (nr_taken == 0) return 0; - nr_reclaimed = shrink_folio_list(&folio_list, pgdat, sc, &stat, false); + nr_reclaimed = shrink_folio_list(&folio_list, pgdat, sc, &stat, false, + lruvec_memcg(lruvec)); spin_lock_irq(&lruvec->lru_lock); move_folios_to_lru(lruvec, &folio_list); @@ -2212,7 +2220,7 @@ static unsigned int reclaim_folio_list(struct list_head *folio_list, .no_demotion = 1, }; - nr_reclaimed = shrink_folio_list(folio_list, pgdat, &sc, &stat, true); + nr_reclaimed = shrink_folio_list(folio_list, pgdat, &sc, &stat, true, NULL); while (!list_empty(folio_list)) { folio = lru_to_folio(folio_list); list_del(&folio->lru); @@ -2644,7 +2652,7 @@ out: * Anonymous LRU management is a waste if there is * ultimately no way to reclaim the memory. */ -static bool can_age_anon_pages(struct pglist_data *pgdat, +static bool can_age_anon_pages(struct lruvec *lruvec, struct scan_control *sc) { /* Aging the anon LRU is valuable if swap is present: */ @@ -2652,7 +2660,8 @@ static bool can_age_anon_pages(struct pglist_data *pgdat, return true; /* Also valuable if anon pages can be demoted: */ - return can_demote(pgdat->node_id, sc); + return can_demote(lruvec_pgdat(lruvec)->node_id, sc, + lruvec_memcg(lruvec)); } #ifdef CONFIG_LRU_GEN @@ -2730,7 +2739,7 @@ static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc) if (!sc->may_swap) return 0; - if (!can_demote(pgdat->node_id, sc) && + if (!can_demote(pgdat->node_id, sc, memcg) && mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH) return 0; @@ -4693,7 +4702,7 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap if (list_empty(&list)) return scanned; retry: - reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false); + reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false, memcg); sc->nr.unqueued_dirty += stat.nr_unqueued_dirty; sc->nr_reclaimed += reclaimed; trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id, @@ -5848,7 +5857,7 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) * Even if we did not try to evict anon pages at all, we want to * rebalance the anon lru active/inactive ratio. */ - if (can_age_anon_pages(lruvec_pgdat(lruvec), sc) && + if (can_age_anon_pages(lruvec, sc) && inactive_is_low(lruvec, LRU_INACTIVE_ANON)) shrink_active_list(SWAP_CLUSTER_MAX, lruvec, sc, LRU_ACTIVE_ANON); @@ -6679,10 +6688,10 @@ static void kswapd_age_node(struct pglist_data *pgdat, struct scan_control *sc) return; } - if (!can_age_anon_pages(pgdat, sc)) + lruvec = mem_cgroup_lruvec(NULL, pgdat); + if (!can_age_anon_pages(lruvec, sc)) return; - lruvec = mem_cgroup_lruvec(NULL, pgdat); if (!inactive_is_low(lruvec, LRU_INACTIVE_ANON)) return; From 60fbb14396d5ee1f430c55883d8e603d3229163c Mon Sep 17 00:00:00 2001 From: Gavin Guo Date: Fri, 25 Apr 2025 18:38:58 +0800 Subject: [PATCH 156/285] mm/huge_memory: adjust try_to_migrate_one() and split_huge_pmd_locked() Patch series "Clean up split_huge_pmd_locked() and remove unnecessary folio pointers", v2. The patch series enhances the folio verification by leveraging the existing page_vma_mapped_walk() mechanism and removing redundant folio pointer passing. This patch (of 2): split_huge_pmd_locked() currently performs redundant checks for migration entries and folio validation that are already handled by the page_vma_mapped_walk mechanism in try_to_migrate_one. Specifically, page_vma_mapped_walk already ensures that: - The folio is properly mapped in the given VMA area - pmd_trans_huge, pmd_devmap, and migration entry validation are performed To leverage page_vma_mapped_walk's work, moving TTU_SPLIT_HUGE_PMD handling to the while loop checking and removing these duplicate checks from split_huge_pmd_locked. Link: https://lkml.kernel.org/r/20250425103859.825879-1-gavinguo@igalia.com Link: https://lkml.kernel.org/r/20250425103859.825879-2-gavinguo@igalia.com Link: https://lore.kernel.org/all/98d1d195-7821-4627-b518-83103ade56c0@redhat.com/ Link: https://lore.kernel.org/all/91599a3c-e69e-4d79-bac5-5013c96203d7@redhat.com/ Signed-off-by: Gavin Guo Suggested-by: David Hildenbrand Acked-by: David Hildenbrand Reviewed-by: Zi Yan Reviewed-by: Baolin Wang Cc: Florent Revest Cc: Gavin Shan Cc: Hugh Dickins Cc: Matthew Wilcox (Oracle) Cc: Miaohe Lin Signed-off-by: Andrew Morton --- mm/huge_memory.c | 21 ++------------------- mm/rmap.c | 18 +++++++++--------- 2 files changed, 11 insertions(+), 28 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index fdcf0a6049b9..d8a5bb602008 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3082,27 +3082,10 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, bool freeze, struct folio *folio) { - bool pmd_migration = is_pmd_migration_entry(*pmd); - - VM_WARN_ON_ONCE(folio && !folio_test_pmd_mappable(folio)); VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE)); - VM_WARN_ON_ONCE(folio && !folio_test_locked(folio)); - VM_BUG_ON(freeze && !folio); - - /* - * When the caller requests to set up a migration entry, we - * require a folio to check the PMD against. Otherwise, there - * is a risk of replacing the wrong folio. - */ - if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) || pmd_migration) { - /* - * Do not apply pmd_folio() to a migration entry; and folio lock - * guarantees that it must be of the wrong folio anyway. - */ - if (folio && (pmd_migration || folio != pmd_folio(*pmd))) - return; + if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) || + is_pmd_migration_entry(*pmd)) __split_huge_pmd_locked(vma, pmd, address, freeze); - } } void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, diff --git a/mm/rmap.c b/mm/rmap.c index 67bb273dfb80..b53a4dcaeaae 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -2291,13 +2291,6 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, if (flags & TTU_SYNC) pvmw.flags = PVMW_SYNC; - /* - * unmap_page() in mm/huge_memory.c is the only user of migration with - * TTU_SPLIT_HUGE_PMD and it wants to freeze. - */ - if (flags & TTU_SPLIT_HUGE_PMD) - split_huge_pmd_address(vma, address, true, folio); - /* * For THP, we have to assume the worse case ie pmd for invalidation. * For hugetlb, it could be much worse if we need to do pud @@ -2323,9 +2316,16 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, mmu_notifier_invalidate_range_start(&range); while (page_vma_mapped_walk(&pvmw)) { -#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION /* PMD-mapped THP migration entry */ if (!pvmw.pte) { + if (flags & TTU_SPLIT_HUGE_PMD) { + split_huge_pmd_locked(vma, pvmw.address, + pvmw.pmd, true, NULL); + ret = false; + page_vma_mapped_walk_done(&pvmw); + break; + } +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION subpage = folio_page(folio, pmd_pfn(*pvmw.pmd) - folio_pfn(folio)); VM_BUG_ON_FOLIO(folio_test_hugetlb(folio) || @@ -2337,8 +2337,8 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, break; } continue; - } #endif + } /* Unexpected PMD-mapped THP? */ VM_BUG_ON_FOLIO(!pvmw.pte, folio); From b960818d51b3c8275d0f80c5fc4156eb2ff2fde6 Mon Sep 17 00:00:00 2001 From: Gavin Guo Date: Fri, 25 Apr 2025 18:38:59 +0800 Subject: [PATCH 157/285] mm/huge_memory: remove useless folio pointers passing Since the previous commit "mm/huge_memory: Adjust try_to_migrate_one() and split_huge_pmd_locked()" has simplified the logic by leveraging the folio verification in page_vma_mapped_walk(), this patch removes the unnecessary folio pointers passing. Link: https://lkml.kernel.org/r/20250425103859.825879-3-gavinguo@igalia.com Link: https://lore.kernel.org/all/98d1d195-7821-4627-b518-83103ade56c0@redhat.com/ Link: https://lore.kernel.org/all/91599a3c-e69e-4d79-bac5-5013c96203d7@redhat.com/ Signed-off-by: Gavin Guo Suggested-by: David Hildenbrand Acked-by: David Hildenbrand Reviewed-by: Zi Yan Reviewed-by: Baolin Wang Cc: Florent Revest Cc: Gavin Shan Cc: Hugh Dickins Cc: Matthew Wilcox (Oracle) Cc: Miaohe Lin Signed-off-by: Andrew Morton --- include/linux/huge_mm.h | 15 +++++++-------- mm/huge_memory.c | 16 ++++++++-------- mm/memory.c | 4 ++-- mm/mprotect.c | 2 +- mm/rmap.c | 4 ++-- 5 files changed, 20 insertions(+), 21 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index f190998b2ebd..2f190c90192d 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -395,7 +395,7 @@ static inline int split_huge_page(struct page *page) void deferred_split_folio(struct folio *folio, bool partially_mapped); void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long address, bool freeze, struct folio *folio); + unsigned long address, bool freeze); #define split_huge_pmd(__vma, __pmd, __address) \ do { \ @@ -403,12 +403,11 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, if (is_swap_pmd(*____pmd) || pmd_trans_huge(*____pmd) \ || pmd_devmap(*____pmd)) \ __split_huge_pmd(__vma, __pmd, __address, \ - false, NULL); \ + false); \ } while (0) - void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address, - bool freeze, struct folio *folio); + bool freeze); void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud, unsigned long address); @@ -501,7 +500,7 @@ static inline bool thp_migration_supported(void) } void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, - pmd_t *pmd, bool freeze, struct folio *folio); + pmd_t *pmd, bool freeze); bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmdp, struct folio *folio); @@ -576,12 +575,12 @@ static inline void deferred_split_folio(struct folio *folio, bool partially_mapp do { } while (0) static inline void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long address, bool freeze, struct folio *folio) {} + unsigned long address, bool freeze) {} static inline void split_huge_pmd_address(struct vm_area_struct *vma, - unsigned long address, bool freeze, struct folio *folio) {} + unsigned long address, bool freeze) {} static inline void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, - bool freeze, struct folio *folio) {} + bool freeze) {} static inline bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmdp, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index d8a5bb602008..2780a12b25f0 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1785,7 +1785,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, pte_free(dst_mm, pgtable); spin_unlock(src_ptl); spin_unlock(dst_ptl); - __split_huge_pmd(src_vma, src_pmd, addr, false, NULL); + __split_huge_pmd(src_vma, src_pmd, addr, false); return -EAGAIN; } add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); @@ -2007,7 +2007,7 @@ unlock_fallback: folio_unlock(folio); spin_unlock(vmf->ptl); fallback: - __split_huge_pmd(vma, vmf->pmd, vmf->address, false, NULL); + __split_huge_pmd(vma, vmf->pmd, vmf->address, false); return VM_FAULT_FALLBACK; } @@ -3080,7 +3080,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, } void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, - pmd_t *pmd, bool freeze, struct folio *folio) + pmd_t *pmd, bool freeze) { VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE)); if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) || @@ -3089,7 +3089,7 @@ void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, } void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long address, bool freeze, struct folio *folio) + unsigned long address, bool freeze) { spinlock_t *ptl; struct mmu_notifier_range range; @@ -3099,20 +3099,20 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, (address & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE); mmu_notifier_invalidate_range_start(&range); ptl = pmd_lock(vma->vm_mm, pmd); - split_huge_pmd_locked(vma, range.start, pmd, freeze, folio); + split_huge_pmd_locked(vma, range.start, pmd, freeze); spin_unlock(ptl); mmu_notifier_invalidate_range_end(&range); } void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address, - bool freeze, struct folio *folio) + bool freeze) { pmd_t *pmd = mm_find_pmd(vma->vm_mm, address); if (!pmd) return; - __split_huge_pmd(vma, pmd, address, freeze, folio); + __split_huge_pmd(vma, pmd, address, freeze); } static inline void split_huge_pmd_if_needed(struct vm_area_struct *vma, unsigned long address) @@ -3124,7 +3124,7 @@ static inline void split_huge_pmd_if_needed(struct vm_area_struct *vma, unsigned if (!IS_ALIGNED(address, HPAGE_PMD_SIZE) && range_in_vma(vma, ALIGN_DOWN(address, HPAGE_PMD_SIZE), ALIGN(address, HPAGE_PMD_SIZE))) - split_huge_pmd_address(vma, address, false, NULL); + split_huge_pmd_address(vma, address, false); } void vma_adjust_trans_huge(struct vm_area_struct *vma, diff --git a/mm/memory.c b/mm/memory.c index f18266b5a0a9..be124dadec9e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1808,7 +1808,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb, next = pmd_addr_end(addr, end); if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) { if (next - addr != HPAGE_PMD_SIZE) - __split_huge_pmd(vma, pmd, addr, false, NULL); + __split_huge_pmd(vma, pmd, addr, false); else if (zap_huge_pmd(tlb, vma, pmd, addr)) { addr = next; continue; @@ -5932,7 +5932,7 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf) split: /* COW or write-notify handled on pte level: split pmd. */ - __split_huge_pmd(vma, vmf->pmd, vmf->address, false, NULL); + __split_huge_pmd(vma, vmf->pmd, vmf->address, false); return VM_FAULT_FALLBACK; } diff --git a/mm/mprotect.c b/mm/mprotect.c index 62c1f7945741..88608d0dc2c2 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -379,7 +379,7 @@ again: if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd) || pmd_devmap(_pmd)) { if ((next - addr != HPAGE_PMD_SIZE) || pgtable_split_needed(vma, cp_flags)) { - __split_huge_pmd(vma, pmd, addr, false, NULL); + __split_huge_pmd(vma, pmd, addr, false); /* * For file-backed, the pmd could have been * cleared; make sure pmd populated if diff --git a/mm/rmap.c b/mm/rmap.c index b53a4dcaeaae..4992005885ef 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1944,7 +1944,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, * restart so we can process the PTE-mapped THP. */ split_huge_pmd_locked(vma, pvmw.address, - pvmw.pmd, false, folio); + pvmw.pmd, false); flags &= ~TTU_SPLIT_HUGE_PMD; page_vma_mapped_walk_restart(&pvmw); continue; @@ -2320,7 +2320,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, if (!pvmw.pte) { if (flags & TTU_SPLIT_HUGE_PMD) { split_huge_pmd_locked(vma, pvmw.address, - pvmw.pmd, true, NULL); + pvmw.pmd, true); ret = false; page_vma_mapped_walk_done(&pvmw); break; From bc9817bb7a21f64fbca2c4b83811d943036ec870 Mon Sep 17 00:00:00 2001 From: Huan Yang Date: Fri, 25 Apr 2025 11:19:23 +0800 Subject: [PATCH 158/285] mm/memcg: move mem_cgroup_init() ahead of cgroup_init() Patch series "Use kmem_cache for memcg alloc", v3. (willy tldr: "you've gone from allocating 8 objects per 32KiB to allocating 13 objects per 32KiB, a 62% improvement in memory consumption" [1]) The mem_cgroup_alloc function creates mem_cgroup struct and it's associated structures including mem_cgroup_per_node. Through detailed analysis on our test machine (Arm64, 16GB RAM, 6.6 kernel, 1 NUMA node, memcgv2 with nokmem,nosocket,cgroup_disable=pressure), we can observe the memory allocation for these structures using the following shell commands: # Enable tracing echo 1 > /sys/kernel/tracing/events/kmem/kmalloc/enable echo 1 > /sys/kernel/tracing/tracing_on cat /sys/kernel/tracing/trace_pipe | grep kmalloc | grep mem_cgroup # Trigger allocation if cgroup subtree do not enable memcg echo +memory > /sys/fs/cgroup/cgroup.subtree_control Ftrace Output: # mem_cgroup struct allocation sh-6312 [000] ..... 58015.698365: kmalloc: call_site=mem_cgroup_css_alloc+0xd8/0x5b4 ptr=000000003e4c3799 bytes_req=2312 bytes_alloc=4096 gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1 accounted=false # mem_cgroup_per_node allocation sh-6312 [000] ..... 58015.698389: kmalloc: call_site=mem_cgroup_css_alloc+0x1d8/0x5b4 ptr=00000000d798700c bytes_req=2896 bytes_alloc=4096 gfp_flags=GFP_KERNEL|__GFP_ZERO node=0 accounted=false Key Observations: 1. Both structures use kmalloc with requested sizes between 2KB-4KB 2. Allocation alignment forces 4KB slab usage due to pre-defined sizes (64B, 128B,..., 2KB, 4KB, 8KB) 3. Memory waste per memcg instance: Base struct: 4096 - 2312 = 1784 bytes Per-node struct: 4096 - 2896 = 1200 bytes Total waste: 2984 bytes (1-node system) NUMA scaling: (1200 + 8) * nr_node_ids bytes So, it's a little waste. This patchset introduces dedicated kmem_cache: Patch2 - mem_cgroup kmem_cache - memcg_cachep Patch3 - mem_cgroup_per_node kmem_cache - memcg_pn_cachep The benefits of this change can be observed with the following tracing commands: # Enable tracing echo 1 > /sys/kernel/tracing/events/kmem/kmem_cache_alloc/enable echo 1 > /sys/kernel/tracing/tracing_on cat /sys/kernel/tracing/trace_pipe | grep kmem_cache_alloc | grep mem_cgroup # In another terminal: echo +memory > /sys/fs/cgroup/cgroup.subtree_control The output might now look like this: # mem_cgroup struct allocation sh-9827 [000] ..... 289.513598: kmem_cache_alloc: call_site=mem_cgroup_css_alloc+0xbc/0x5d4 ptr=00000000695c1806 bytes_req=2312 bytes_alloc=2368 gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1 accounted=false # mem_cgroup_per_node allocation sh-9827 [000] ..... 289.513602: kmem_cache_alloc: call_site=mem_cgroup_css_alloc+0x1b8/0x5d4 ptr=000000002989e63a bytes_req=2896 bytes_alloc=2944 gfp_flags=GFP_KERNEL|__GFP_ZERO node=0 accounted=false This indicates that the `mem_cgroup` struct now requests 2312 bytes and is allocated 2368 bytes, while `mem_cgroup_per_node` requests 2896 bytes and is allocated 2944 bytes. The slight increase in allocated size is due to `SLAB_HWCACHE_ALIGN` in the `kmem_cache`. Without `SLAB_HWCACHE_ALIGN`, the allocation might appear as: # mem_cgroup struct allocation sh-9269 [003] ..... 80.396366: kmem_cache_alloc: call_site=mem_cgroup_css_alloc+0xbc/0x5d4 ptr=000000005b12b475 bytes_req=2312 bytes_alloc=2312 gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1 accounted=false # mem_cgroup_per_node allocation sh-9269 [003] ..... 80.396411: kmem_cache_alloc: call_site=mem_cgroup_css_alloc+0x1b8/0x5d4 ptr=00000000f347adc6 bytes_req=2896 bytes_alloc=2896 gfp_flags=GFP_KERNEL|__GFP_ZERO node=0 accounted=false While the `bytes_alloc` now matches the `bytes_req`, this patchset defaults to using `SLAB_HWCACHE_ALIGN` as it is generally considered more beneficial for performance. Please let me know if there are any issues or if I've misunderstood anything. This patchset also move mem_cgroup_init ahead of cgroup_init() due to cgroup_init() will allocate root_mem_cgroup, but each initcall invoke after cgroup_init, so if each kmem_cache do not prepare, we need testing NULL before use it. This patch (of 3): When cgroup_init() creates root_mem_cgroup through css_alloc callback, some critical resources might not be fully initialized, forcing later operations to perform conditional checks for resource availability. This patch move mem_cgroup_init() to address the init order, it invoke before cgroup_init, so, compare to subsys_initcall, it can use to prepare some key resources before root_mem_cgroup alloc. Link: https://lkml.kernel.org/r/aAsRCj-niMMTtmK8@casper.infradead.org [1] Link: https://lkml.kernel.org/r/20250425031935.76411-1-link@vivo.com Link: https://lkml.kernel.org/r/20250425031935.76411-2-link@vivo.com Signed-off-by: Huan Yang Suggested-by: Shakeel Butt Acked-by: Shakeel Butt Acked-by: Johannes Weiner Cc: Francesco Valla Cc: guoweikang Cc: Huang Shijie Cc: KP Singh Cc: Michal Hocko Cc: Muchun Song Cc: "Paul E . McKenney" Cc: Petr Mladek Cc: Rasmus Villemoes Cc: Raul E Rangel Cc: Roman Gushchin Cc: "Uladzislau Rezki (Sony)" Cc: Vlastimil Babka Cc: Matthew Wilcox Signed-off-by: Andrew Morton --- include/linux/memcontrol.h | 3 +++ init/main.c | 2 ++ mm/memcontrol.c | 5 ++--- 3 files changed, 7 insertions(+), 3 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index ac1b003ee5b8..9ed75f82b858 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1057,6 +1057,7 @@ static inline u64 cgroup_id_from_mm(struct mm_struct *mm) return id; } +extern int mem_cgroup_init(void); #else /* CONFIG_MEMCG */ #define MEM_CGROUP_ID_SHIFT 0 @@ -1472,6 +1473,8 @@ static inline u64 cgroup_id_from_mm(struct mm_struct *mm) { return 0; } + +static inline int mem_cgroup_init(void) { return 0; } #endif /* CONFIG_MEMCG */ /* diff --git a/init/main.c b/init/main.c index 7f0a2a3dbd29..782320da2e88 100644 --- a/init/main.c +++ b/init/main.c @@ -50,6 +50,7 @@ #include #include #include +#include #include #include #include @@ -1087,6 +1088,7 @@ void start_kernel(void) nsfs_init(); pidfs_init(); cpuset_init(); + mem_cgroup_init(); cgroup_init(); taskstats_init_early(); delayacct_init(); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e17e0a9ceee0..dcb07a52b4ed 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5042,14 +5042,14 @@ static int __init cgroup_memory(char *s) __setup("cgroup.memory=", cgroup_memory); /* - * subsys_initcall() for memory controller. + * Memory controller init before cgroup_init() initialize root_mem_cgroup. * * Some parts like memcg_hotplug_cpu_dead() have to be initialized from this * context because of lock dependencies (cgroup_lock -> cpu hotplug) but * basically everything that doesn't depend on a specific mem_cgroup structure * should be initialized from here. */ -static int __init mem_cgroup_init(void) +int __init mem_cgroup_init(void) { int cpu; @@ -5070,7 +5070,6 @@ static int __init mem_cgroup_init(void) return 0; } -subsys_initcall(mem_cgroup_init); #ifdef CONFIG_SWAP /** From 97e4fc4b35dc1b98f28671fda35bd37d4f401bca Mon Sep 17 00:00:00 2001 From: Huan Yang Date: Fri, 25 Apr 2025 11:19:24 +0800 Subject: [PATCH 159/285] mm/memcg: use kmem_cache when alloc memcg When tracing mem_cgroup_alloc() with kmalloc ftrace, we observe: kmalloc: call_site=mem_cgroup_css_alloc+0xd8/0x5b4 ptr=000000003e4c3799 bytes_req=2312 bytes_alloc=4096 gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1 accounted=false The output indicates that while allocating mem_cgroup struct (2312 bytes), the slab allocator actually provides 4096-byte chunks. This occurs because: 1. The slab allocator predefines bucket sizes from 64B to 8096B 2. The mem_cgroup allocation size (2312B) falls between the 2KB and 4KB slabs 3. The allocator rounds up to the nearest larger slab (4KB), resulting in ~1KB wasted memory per allocation This patch introduces a dedicated kmem_cache for mem_cgroup structs, achieving precise memory allocation. Post-patch ftrace verification shows: kmem_cache_alloc: call_site=mem_cgroup_css_alloc+0xbc/0x5d4 ptr=00000000695c1806 bytes_req=2312 bytes_alloc=2368 gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1 accounted=false Each memcg alloc offer 2368bytes(include hw cacheline align), compare to 4096, avoid waste. Link: https://lkml.kernel.org/r/20250425031935.76411-3-link@vivo.com Signed-off-by: Huan Yang Acked-by: Shakeel Butt Acked-by: Johannes Weiner Cc: Francesco Valla Cc: guoweikang Cc: Huang Shijie Cc: KP Singh Cc: Michal Hocko Cc: Muchun Song Cc: "Paul E . McKenney" Cc: Petr Mladek Cc: Rasmus Villemoes Cc: Raul E Rangel Cc: Roman Gushchin Cc: "Uladzislau Rezki (Sony)" Cc: Vlastimil Babka Cc: Matthew Wilcox Signed-off-by: Andrew Morton --- mm/memcontrol.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index dcb07a52b4ed..6d713fe10221 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -96,6 +96,8 @@ static bool cgroup_memory_nokmem __ro_after_init; /* BPF memory accounting disabled? */ static bool cgroup_memory_nobpf __ro_after_init; +static struct kmem_cache *memcg_cachep; + #ifdef CONFIG_CGROUP_WRITEBACK static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq); #endif @@ -3665,7 +3667,7 @@ static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent) int __maybe_unused i; long error; - memcg = kzalloc(struct_size(memcg, nodeinfo, nr_node_ids), GFP_KERNEL); + memcg = kmem_cache_zalloc(memcg_cachep, GFP_KERNEL); if (!memcg) return ERR_PTR(-ENOMEM); @@ -5051,6 +5053,7 @@ __setup("cgroup.memory=", cgroup_memory); */ int __init mem_cgroup_init(void) { + unsigned int memcg_size; int cpu; /* @@ -5068,6 +5071,10 @@ int __init mem_cgroup_init(void) INIT_WORK(&per_cpu_ptr(&memcg_stock, cpu)->work, drain_local_stock); + memcg_size = struct_size_t(struct mem_cgroup, nodeinfo, nr_node_ids); + memcg_cachep = kmem_cache_create("mem_cgroup", memcg_size, 0, + SLAB_PANIC | SLAB_HWCACHE_ALIGN, NULL); + return 0; } From 1b6a58e205ed0bbeeeca46388f0649f322b04f06 Mon Sep 17 00:00:00 2001 From: Huan Yang Date: Fri, 25 Apr 2025 11:19:25 +0800 Subject: [PATCH 160/285] mm/memcg: use kmem_cache when alloc memcg pernode info When tracing mem_cgroup_per_node allocations with kmalloc ftrace: kmalloc: call_site=mem_cgroup_css_alloc+0x1d8/0x5b4 ptr=00000000d798700c bytes_req=2896 bytes_alloc=4096 gfp_flags=GFP_KERNEL|__GFP_ZERO node=0 accounted=false This reveals the slab allocator provides 4096B chunks for 2896B mem_cgroup_per_node due to: 1. The slab allocator predefines bucket sizes from 64B to 8096B 2. The mem_cgroup allocation size (2312B) falls between the 2KB and 4KB slabs 3. The allocator rounds up to the nearest larger slab (4KB), resulting in ~1KB wasted memory per memcg alloc - per node. This patch introduces a dedicated kmem_cache for mem_cgroup structs, achieving precise memory allocation. Post-patch ftrace verification shows: kmem_cache_alloc: call_site=mem_cgroup_css_alloc+0x1b8/0x5d4 ptr=000000002989e63a bytes_req=2896 bytes_alloc=2944 gfp_flags=GFP_KERNEL|__GFP_ZERO node=0 accounted=false Each mem_cgroup_per_node alloc 2944bytes(include hw cacheline align), compare to 4096, it avoid waste. Link: https://lkml.kernel.org/r/20250425031935.76411-4-link@vivo.com Signed-off-by: Huan Yang Acked-by: Shakeel Butt Acked-by: Johannes Weiner Cc: Francesco Valla Cc: guoweikang Cc: Huang Shijie Cc: KP Singh Cc: Michal Hocko Cc: Muchun Song Cc: "Paul E . McKenney" Cc: Petr Mladek Cc: Rasmus Villemoes Cc: Raul E Rangel Cc: Roman Gushchin Cc: "Uladzislau Rezki (Sony)" Cc: Vlastimil Babka Cc: Matthew Wilcox Signed-off-by: Andrew Morton --- mm/memcontrol.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6d713fe10221..8ed265852423 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -97,6 +97,7 @@ static bool cgroup_memory_nokmem __ro_after_init; static bool cgroup_memory_nobpf __ro_after_init; static struct kmem_cache *memcg_cachep; +static struct kmem_cache *memcg_pn_cachep; #ifdef CONFIG_CGROUP_WRITEBACK static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq); @@ -3614,7 +3615,8 @@ static bool alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node) { struct mem_cgroup_per_node *pn; - pn = kzalloc_node(sizeof(*pn), GFP_KERNEL, node); + pn = kmem_cache_alloc_node(memcg_pn_cachep, GFP_KERNEL | __GFP_ZERO, + node); if (!pn) return false; @@ -5075,6 +5077,9 @@ int __init mem_cgroup_init(void) memcg_cachep = kmem_cache_create("mem_cgroup", memcg_size, 0, SLAB_PANIC | SLAB_HWCACHE_ALIGN, NULL); + memcg_pn_cachep = KMEM_CACHE(mem_cgroup_per_node, + SLAB_PANIC | SLAB_HWCACHE_ALIGN); + return 0; } From 8d88b0769e256c238f80615f08fc5b7aebc29439 Mon Sep 17 00:00:00 2001 From: Frank van der Linden Date: Wed, 2 Apr 2025 20:56:13 +0000 Subject: [PATCH 161/285] mm/hugetlb: use separate nodemask for bootmem allocations Hugetlb boot allocation has used online nodes for allocation since commit de55996d7188 ("mm/hugetlb: use online nodes for bootmem allocation"). This was needed to be able to do the allocations earlier in boot, before N_MEMORY was set. This might lead to a different distribution of gigantic hugepages across NUMA nodes if there are memoryless nodes in the system. What happens is that the memoryless nodes are tried, but then the memblock allocation fails and falls back, which usually means that the node that has the highest physical address available will be used (top-down allocation). While this will end up getting the same number of hugetlb pages, they might not be be distributed the same way. The fallback for each memoryless node might not end up coming from the same node as the successful round-robin allocation from N_MEMORY nodes. While administrators that rely on having a specific number of hugepages per node should use the hugepages=N:X syntax, it's better not to change the old behavior for the plain hugepages=N case. To do this, construct a nodemask for hugetlb bootmem purposes only, containing nodes that have memory. Then use that for round-robin bootmem allocations. This saves some cycles, and the added advantage here is that hugetlb_cma can use it too, avoiding the older issue of pointless attempts to create a CMA area for memoryless nodes (which will also cause the per-node CMA area size to be too small). Link: https://lkml.kernel.org/r/20250402205613.3086864-1-fvdl@google.com Fixes: de55996d7188 ("mm/hugetlb: use online nodes for bootmem allocation") Signed-off-by: Frank van der Linden Reviewed-by: Oscar Salvador Reviewed-by: Luiz Capitulino Cc: David Hildenbrand Cc: Muchun Song Signed-off-by: Andrew Morton --- include/linux/hugetlb.h | 3 +++ mm/hugetlb.c | 30 ++++++++++++++++++++++++++++-- mm/hugetlb_cma.c | 11 +++++++---- 3 files changed, 38 insertions(+), 6 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index a57bed83c657..23ebf49c5d6a 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -14,6 +14,7 @@ #include #include #include +#include struct ctl_table; struct user_struct; @@ -176,6 +177,8 @@ extern struct list_head huge_boot_pages[MAX_NUMNODES]; void hugetlb_bootmem_alloc(void); bool hugetlb_bootmem_allocated(void); +extern nodemask_t hugetlb_bootmem_nodes; +void hugetlb_bootmem_set_nodes(void); /* arch callbacks */ diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 351254ad6ef8..0057d1f1dc9a 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -58,6 +58,7 @@ int hugetlb_max_hstate __read_mostly; unsigned int default_hstate_idx; struct hstate hstates[HUGE_MAX_HSTATE]; +__initdata nodemask_t hugetlb_bootmem_nodes; __initdata struct list_head huge_boot_pages[MAX_NUMNODES]; static unsigned long hstate_boot_nrinvalid[HUGE_MAX_HSTATE] __initdata; @@ -3219,7 +3220,8 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid) } /* allocate from next node when distributing huge pages */ - for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node, &node_states[N_ONLINE]) { + for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node, + &hugetlb_bootmem_nodes) { m = alloc_bootmem(h, node, false); if (!m) return 0; @@ -3683,6 +3685,15 @@ static void __init hugetlb_init_hstates(void) struct hstate *h, *h2; for_each_hstate(h) { + /* + * Always reset to first_memory_node here, even if + * next_nid_to_alloc was set before - we can't + * reference hugetlb_bootmem_nodes after init, and + * first_memory_node is right for all further allocations. + */ + h->next_nid_to_alloc = first_memory_node; + h->next_nid_to_free = first_memory_node; + /* oversize hugepages were init'ed in early boot */ if (!hstate_is_gigantic(h)) hugetlb_hstate_alloc_pages(h); @@ -4995,6 +5006,20 @@ static int __init default_hugepagesz_setup(char *s) } hugetlb_early_param("default_hugepagesz", default_hugepagesz_setup); +void __init hugetlb_bootmem_set_nodes(void) +{ + int i, nid; + unsigned long start_pfn, end_pfn; + + if (!nodes_empty(hugetlb_bootmem_nodes)) + return; + + for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) { + if (end_pfn > start_pfn) + node_set(nid, hugetlb_bootmem_nodes); + } +} + static bool __hugetlb_bootmem_allocated __initdata; bool __init hugetlb_bootmem_allocated(void) @@ -5010,6 +5035,8 @@ void __init hugetlb_bootmem_alloc(void) if (__hugetlb_bootmem_allocated) return; + hugetlb_bootmem_set_nodes(); + for (i = 0; i < MAX_NUMNODES; i++) INIT_LIST_HEAD(&huge_boot_pages[i]); @@ -5017,7 +5044,6 @@ void __init hugetlb_bootmem_alloc(void) for_each_hstate(h) { h->next_nid_to_alloc = first_online_node; - h->next_nid_to_free = first_online_node; if (hstate_is_gigantic(h)) hugetlb_hstate_alloc_pages(h); diff --git a/mm/hugetlb_cma.c b/mm/hugetlb_cma.c index e0f2d5c3a84c..f58ef4969e7a 100644 --- a/mm/hugetlb_cma.c +++ b/mm/hugetlb_cma.c @@ -66,7 +66,7 @@ hugetlb_cma_alloc_bootmem(struct hstate *h, int *nid, bool node_exact) if (node_exact) return NULL; - for_each_online_node(node) { + for_each_node_mask(node, hugetlb_bootmem_nodes) { cma = hugetlb_cma[node]; if (!cma || node == *nid) continue; @@ -153,11 +153,13 @@ void __init hugetlb_cma_reserve(int order) if (!hugetlb_cma_size) return; + hugetlb_bootmem_set_nodes(); + for (nid = 0; nid < MAX_NUMNODES; nid++) { if (hugetlb_cma_size_in_node[nid] == 0) continue; - if (!node_online(nid)) { + if (!node_isset(nid, hugetlb_bootmem_nodes)) { pr_warn("hugetlb_cma: invalid node %d specified\n", nid); hugetlb_cma_size -= hugetlb_cma_size_in_node[nid]; hugetlb_cma_size_in_node[nid] = 0; @@ -190,13 +192,14 @@ void __init hugetlb_cma_reserve(int order) * If 3 GB area is requested on a machine with 4 numa nodes, * let's allocate 1 GB on first three nodes and ignore the last one. */ - per_node = DIV_ROUND_UP(hugetlb_cma_size, nr_online_nodes); + per_node = DIV_ROUND_UP(hugetlb_cma_size, + nodes_weight(hugetlb_bootmem_nodes)); pr_info("hugetlb_cma: reserve %lu MiB, up to %lu MiB per node\n", hugetlb_cma_size / SZ_1M, per_node / SZ_1M); } reserved = 0; - for_each_online_node(nid) { + for_each_node_mask(nid, hugetlb_bootmem_nodes) { int res; char name[CMA_MAX_NAME]; From c8e6002bd611c68dab892565039b60c7ad5b8d6a Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Sat, 19 Apr 2025 11:35:45 -0700 Subject: [PATCH 162/285] memcg: introduce non-blocking limit setting option MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Setting the max and high limits can trigger synchronous reclaim and/or oom-kill if the usage is higher than the given limit. This behavior is fine for newly created cgroups but it can cause issues for the node controller while setting limits for existing cgroups. In our production multi-tenant and overcommitted environment, we are seeing priority inversion when the node controller dynamically adjusts the limits of running jobs of different priorities. Based on the system situation, the node controller may reduce the limits of lower priority jobs and increase the limits of higher priority jobs. However we are seeing node controller getting stuck for long period of time while reclaiming from lower priority jobs while setting their limits and also spends a lot of its own CPU. One of the workaround we are trying is to fork a new process which sets the limit of the lower priority job along with setting an alarm to get itself killed if it get stuck in the reclaim for lower priority job. However we are finding it very unreliable and costly. Either we need a good enough time buffer for the alarm to be delivered after setting limit and potentialy spend a lot of CPU in the reclaim or be unreliable in setting the limit for much shorter but cheaper (less reclaim) alarms. Let's introduce new limit setting option which does not trigger reclaim and/or oom-kill and let the processes in the target cgroup to trigger reclaim and/or throttling and/or oom-kill in their next charge request. This will make the node controller on multi-tenant overcommitted environment much more reliable. Explanation from Johannes on side-effects of O_NONBLOCK limit change: It's usually the allocating tasks inside the group bearing the cost of limit enforcement and reclaim. This allows a (privileged) updater from outside the group to keep that cost in there - instead of having to help, from a context that doesn't necessarily make sense. I suppose the tradeoff with that - and the reason why this was doing sync reclaim in the first place - is that, if the group is idle and not trying to allocate more, it can take indefinitely for the new limit to actually be met. It should be okay in most scenarios in practice. As the capacity is reallocated from group A to B, B will exert pressure on A once it tries to claim it and thereby shrink it down. If A is idle, that shouldn't be hard. If A is running, it's likely to fault/allocate soon-ish and then join the effort. It does leave a (malicious) corner case where A is just busy-hitting its memory to interfere with the clawback. This is comparable to reclaiming memory.low overage from the outside, though, which is an acceptable risk. Users of O_NONBLOCK just need to be aware. Link: https://lkml.kernel.org/r/20250419183545.1982187-1-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Acked-by: Roman Gushchin Acked-by: Johannes Weiner Acked-by: Michal Hocko Cc: Greg Thelen Cc: Michal Koutný Cc: Muchun Song Cc: Tejun Heo Cc: Christian Brauner Cc: Yosry Ahmed Signed-off-by: Andrew Morton --- Documentation/admin-guide/cgroup-v2.rst | 14 ++++++++++++++ mm/memcontrol.c | 10 ++++++++-- 2 files changed, 22 insertions(+), 2 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 1a16ce68a4d7..b34f1dd969e0 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1299,6 +1299,13 @@ PAGE_SIZE multiple when read back. monitors the limited cgroup to alleviate heavy reclaim pressure. + If memory.high is opened with O_NONBLOCK then the synchronous + reclaim is bypassed. This is useful for admin processes that + need to dynamically adjust the job's memory limits without + expending their own CPU resources on memory reclamation. The + job will trigger the reclaim and/or get throttled on its + next charge request. + memory.max A read-write single value file which exists on non-root cgroups. The default is "max". @@ -1316,6 +1323,13 @@ PAGE_SIZE multiple when read back. Caller could retry them differently, return into userspace as -ENOMEM or silently ignore in cases like disk readahead. + If memory.max is opened with O_NONBLOCK, then the synchronous + reclaim and oom-kill are bypassed. This is useful for admin + processes that need to dynamically adjust the job's memory limits + without expending their own CPU resources on memory reclamation. + The job will trigger the reclaim and/or oom-kill on its next + charge request. + memory.reclaim A write-only nested-keyed file which exists for all cgroups. diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 8ed265852423..d3b6f50e00d4 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4269,6 +4269,9 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, page_counter_set_high(&memcg->memory, high); + if (of->file->f_flags & O_NONBLOCK) + goto out; + for (;;) { unsigned long nr_pages = page_counter_read(&memcg->memory); unsigned long reclaimed; @@ -4291,7 +4294,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, if (!reclaimed && !nr_retries--) break; } - +out: memcg_wb_domain_size_changed(memcg); return nbytes; } @@ -4318,6 +4321,9 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, xchg(&memcg->memory.max, max); + if (of->file->f_flags & O_NONBLOCK) + goto out; + for (;;) { unsigned long nr_pages = page_counter_read(&memcg->memory); @@ -4345,7 +4351,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, break; cond_resched(); } - +out: memcg_wb_domain_size_changed(memcg); return nbytes; } From c6c895cf2d323373a66d61c6acc331431de2ae5a Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Tue, 6 May 2025 16:28:33 -0700 Subject: [PATCH 163/285] memcg-introduce-non-blocking-limit-setting-option-v3 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit add more explanation in doc and commit message on O_NONBLOCK side-effects (Johannes) Link: https://lkml.kernel.org/r/20250506232833.3109790-1-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Acked-by: Roman Gushchin Acked-by: Johannes Weiner Cc: Greg Thelen Cc: Michal Hocko Cc: Michal Koutný Cc: Muchun Song Cc: Tejun Heo Cc: Yosry Ahmed Cc: Christian Brauner Signed-off-by: Andrew Morton --- Documentation/admin-guide/cgroup-v2.rst | 34 ++++++++++++++++--------- 1 file changed, 22 insertions(+), 12 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index b34f1dd969e0..d537f3d3fed9 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1299,12 +1299,17 @@ PAGE_SIZE multiple when read back. monitors the limited cgroup to alleviate heavy reclaim pressure. - If memory.high is opened with O_NONBLOCK then the synchronous - reclaim is bypassed. This is useful for admin processes that - need to dynamically adjust the job's memory limits without - expending their own CPU resources on memory reclamation. The - job will trigger the reclaim and/or get throttled on its - next charge request. + If memory.high is opened with O_NONBLOCK then the synchronous + reclaim is bypassed. This is useful for admin processes that + need to dynamically adjust the job's memory limits without + expending their own CPU resources on memory reclamation. The + job will trigger the reclaim and/or get throttled on its + next charge request. + + Please note that with O_NONBLOCK, there is a chance that the + target memory cgroup may take indefinite amount of time to + reduce usage below the limit due to delayed charge request or + busy-hitting its memory to slow down reclaim. memory.max A read-write single value file which exists on non-root @@ -1323,12 +1328,17 @@ PAGE_SIZE multiple when read back. Caller could retry them differently, return into userspace as -ENOMEM or silently ignore in cases like disk readahead. - If memory.max is opened with O_NONBLOCK, then the synchronous - reclaim and oom-kill are bypassed. This is useful for admin - processes that need to dynamically adjust the job's memory limits - without expending their own CPU resources on memory reclamation. - The job will trigger the reclaim and/or oom-kill on its next - charge request. + If memory.max is opened with O_NONBLOCK, then the synchronous + reclaim and oom-kill are bypassed. This is useful for admin + processes that need to dynamically adjust the job's memory limits + without expending their own CPU resources on memory reclamation. + The job will trigger the reclaim and/or oom-kill on its next + charge request. + + Please note that with O_NONBLOCK, there is a chance that the + target memory cgroup may take indefinite amount of time to + reduce usage below the limit due to delayed charge request or + busy-hitting its memory to slow down reclaim. memory.reclaim A write-only nested-keyed file which exists for all cgroups. From 68a1436bde00b40327efe27926076b53088775a3 Mon Sep 17 00:00:00 2001 From: Zhongkun He Date: Mon, 21 Apr 2025 17:13:28 +0800 Subject: [PATCH 164/285] mm: add swappiness=max arg to memory.reclaim for only anon reclaim Patch series "add max arg to swappiness in memory.reclaim and lru_gen", v4. This patchset adds max arg to swappiness in memory.reclaim and lru_gen for anon only proactive memory reclaim. With commit <68cd9050d871> ("mm: add swappiness= arg to memory.reclaim") we can submit an additional swappiness= argument to memory.reclaim. It is very useful because we can dynamically adjust the reclamation ratio based on the anonymous folios and file folios of each cgroup. For example,when swappiness is set to 0, we only reclaim from file folios. But we can not relciam memory just from anon folios. This patchset introduces a new macro, SWAPPINESS_ANON_ONLY, defined as MAX_SWAPPINESS + 1, represent the max arg semantics. It specifically indicates that reclamation should occur only from anonymous pages. Patch 1 adds swappiness=max arg to memory.reclaim suggested-by: Yosry Ahmed Patch 2 add more comments for cache_trim_mode from Johannes Weiner in [1]. Patch 3 add max arg to lru_gen for proactive memory reclaim in MGLRU. The MGLRU already supports reclaiming exclusively from anonymous pages. This patch formalizes that behavior by introducing a max parameter to represent the corresponding semantics. Patch 4 using SWAPPINESS_ANON_ONLY in MGLRU Using SWAPPINESS_ANON_ONLY instead of MAX_SWAPPINESS + 1 to indicate reclaiming only from anonymous pages makes the code more readable and explicit Here is the previous discussion: https://lore.kernel.org/all/20250314033350.1156370-1-hezhongkun.hzk@bytedance.com/ https://lore.kernel.org/all/20250312094337.2296278-1-hezhongkun.hzk@bytedance.com/ https://lore.kernel.org/all/20250318135330.3358345-1-hezhongkun.hzk@bytedance.com/ This patch (of 4): With commit <68cd9050d871> ("mm: add swappiness= arg to memory.reclaim") we can submit an additional swappiness= argument to memory.reclaim. It is very useful because we can dynamically adjust the reclamation ratio based on the anonymous folios and file folios of each cgroup. For example,when swappiness is set to 0, we only reclaim from file folios. However,we have also encountered a new issue: when swappiness is set to the MAX_SWAPPINESS, it may still only reclaim file folios. So, we hope to add a new arg 'swappiness=max' in memory.reclaim where proactive memory reclaim only reclaims from anonymous folios when swappiness is set to max. The swappiness semantics from a user perspective remain unchanged. For example, something like this: echo "2M swappiness=max" > /sys/fs/cgroup/memory.reclaim will perform reclaim on the rootcg with a swappiness setting of 'max' (a new mode) regardless of the file folios. Users have a more comprehensive view of the application's memory distribution because there are many metrics available. For example, if we find that a certain cgroup has a large number of inactive anon folios, we can reclaim only those and skip file folios, because with the zram/zswap, the IO tradeoff that cache_trim_mode or other file first logic is making doesn't hold - file refaults will cause IO, whereas anon decompression will not. With this patch, the swappiness argument of memory.reclaim has a new mode 'max', means reclaiming just from anonymous folios both in traditional LRU and MGLRU. Link: https://lkml.kernel.org/r/cover.1745225696.git.hezhongkun.hzk@bytedance.com Link: https://lore.kernel.org/all/20250314141833.GA1316033@cmpxchg.org/ [1] Link: https://lkml.kernel.org/r/519e12b9b1f8c31a01e228c8b4b91a2419684f77.1745225696.git.hezhongkun.hzk@bytedance.com Signed-off-by: Zhongkun He Suggested-by: Yosry Ahmed Acked-by: Muchun Song Cc: Johannes Weiner Cc: Michal Hocko Cc: Yu Zhao Signed-off-by: Andrew Morton --- Documentation/admin-guide/cgroup-v2.rst | 3 +++ include/linux/swap.h | 4 ++++ mm/memcontrol.c | 5 +++++ mm/vmscan.c | 7 +++++++ 4 files changed, 19 insertions(+) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index d537f3d3fed9..acf855851c03 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1372,6 +1372,9 @@ The following nested keys are defined. same semantics as vm.swappiness applied to memcg reclaim with all the existing limitations and potential future extensions. + The valid range for swappiness is [0-200, max], setting + swappiness=max exclusively reclaims anonymous memory. + memory.peak A read-write single value file which exists on non-root cgroups. diff --git a/include/linux/swap.h b/include/linux/swap.h index 4e4e27d3ce3d..bc0e1c275fc0 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -414,6 +414,10 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, #define MEMCG_RECLAIM_PROACTIVE (1 << 2) #define MIN_SWAPPINESS 0 #define MAX_SWAPPINESS 200 + +/* Just recliam from anon folios in proactive memory reclaim */ +#define SWAPPINESS_ANON_ONLY (MAX_SWAPPINESS + 1) + extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d3b6f50e00d4..4108ff00124b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4474,11 +4474,13 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, enum { MEMORY_RECLAIM_SWAPPINESS = 0, + MEMORY_RECLAIM_SWAPPINESS_MAX, MEMORY_RECLAIM_NULL, }; static const match_table_t tokens = { { MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"}, + { MEMORY_RECLAIM_SWAPPINESS_MAX, "swappiness=max"}, { MEMORY_RECLAIM_NULL, NULL }, }; @@ -4512,6 +4514,9 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, if (swappiness < MIN_SWAPPINESS || swappiness > MAX_SWAPPINESS) return -EINVAL; break; + case MEMORY_RECLAIM_SWAPPINESS_MAX: + swappiness = SWAPPINESS_ANON_ONLY; + break; default: return -EINVAL; } diff --git a/mm/vmscan.c b/mm/vmscan.c index a4fbd52a82d4..495889f621b5 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2509,6 +2509,13 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, goto out; } + /* Proactive reclaim initiated by userspace for anonymous memory only */ + if (swappiness == SWAPPINESS_ANON_ONLY) { + WARN_ON_ONCE(!sc->proactive); + scan_balance = SCAN_ANON; + goto out; + } + /* * Do not apply any pressure balancing cleverness when the * system is close to OOM, scan both anon and file equally From aded729f64d3eb1543abc7e6a3a20069716efb7d Mon Sep 17 00:00:00 2001 From: Zhongkun He Date: Mon, 21 Apr 2025 17:13:29 +0800 Subject: [PATCH 165/285] mm: vmscan: add more comments about cache_trim_mode Add more comments for cache_trim_mode, and the annotations provided by Johannes Weiner in [1]. Link: https://lore.kernel.org/all/20250314141833.GA1316033@cmpxchg.org/ Link: https://lkml.kernel.org/r/4baad87ba637f1e6f666e9b99b3fdcb7ab39171b.1745225696.git.hezhongkun.hzk@bytedance.com Signed-off-by: Zhongkun He Cc: Johannes Weiner Cc: Michal Hocko Cc: Muchun Song Cc: Yosry Ahmed Cc: Yu Zhao Signed-off-by: Andrew Morton --- mm/vmscan.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 495889f621b5..ca786575b6c2 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2536,7 +2536,8 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, /* * If there is enough inactive page cache, we do not reclaim - * anything from the anonymous working right now. + * anything from the anonymous working right now to make sure + * a streaming file access pattern doesn't cause swapping. */ if (sc->cache_trim_mode) { scan_balance = SCAN_FILE; From b40599930f001b651cc3e1863e268dd35cbbf806 Mon Sep 17 00:00:00 2001 From: Zhongkun He Date: Mon, 21 Apr 2025 17:13:30 +0800 Subject: [PATCH 166/285] mm: add max swappiness arg to lru_gen for anonymous memory only The MGLRU already supports reclaiming only from anonymous memory via the /sys/kernel/debug/lru_gen interface. Now, memory.reclaim also supports the swappiness=max parameter to enable reclaiming solely from anonymous memory. To unify the semantics of proactive reclaiming from anonymous folios, the max parameter is introduced. [hezhongkun.hzk@bytedance.com: use strcmp instead of strncmp, if swappiness is not set, use the default value] Link: https://lkml.kernel.org/r/20250507071057.3184240-1-hezhongkun.hzk@bytedance.com [akpm@linux-foundation.org: tweak coding style] Link: https://lkml.kernel.org/r/65181f7745d657d664d833c26d8a94cae40538b9.1745225696.git.hezhongkun.hzk@bytedance.com Signed-off-by: Zhongkun He Acked-by: Muchun Song Cc: Johannes Weiner Cc: Michal Hocko Cc: Yosry Ahmed Cc: Yu Zhao Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/multigen_lru.rst | 5 +++-- mm/vmscan.c | 19 +++++++++++++++---- 2 files changed, 18 insertions(+), 6 deletions(-) diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst index 33e068830497..9cb54b4ff5d9 100644 --- a/Documentation/admin-guide/mm/multigen_lru.rst +++ b/Documentation/admin-guide/mm/multigen_lru.rst @@ -151,8 +151,9 @@ generations less than or equal to ``min_gen_nr``. ``min_gen_nr`` should be less than ``max_gen_nr-1``, since ``max_gen_nr`` and ``max_gen_nr-1`` are not fully aged (equivalent to the active list) and therefore cannot be evicted. ``swappiness`` -overrides the default value in ``/proc/sys/vm/swappiness``. -``nr_to_reclaim`` limits the number of pages to evict. +overrides the default value in ``/proc/sys/vm/swappiness`` and the valid +range is [0-200, max], with max being exclusively used for the reclamation +of anonymous memory. ``nr_to_reclaim`` limits the number of pages to evict. A typical use case is that a job scheduler runs this command before it tries to land a new job on a server. If it fails to materialize enough diff --git a/mm/vmscan.c b/mm/vmscan.c index ca786575b6c2..d1ad528186dd 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -5588,24 +5588,35 @@ static ssize_t lru_gen_seq_write(struct file *file, const char __user *src, while ((cur = strsep(&next, ",;\n"))) { int n; int end; - char cmd; + char cmd, swap_string[5]; unsigned int memcg_id; unsigned int nid; unsigned long seq; - unsigned int swappiness = -1; + unsigned int swappiness; unsigned long opt = -1; cur = skip_spaces(cur); if (!*cur) continue; - n = sscanf(cur, "%c %u %u %lu %n %u %n %lu %n", &cmd, &memcg_id, &nid, - &seq, &end, &swappiness, &end, &opt, &end); + n = sscanf(cur, "%c %u %u %lu %n %4s %n %lu %n", &cmd, &memcg_id, &nid, + &seq, &end, swap_string, &end, &opt, &end); if (n < 4 || cur[end]) { err = -EINVAL; break; } + if (n == 4) { + swappiness = -1; + } else if (!strcmp("max", swap_string)) { + /* set by userspace for anonymous memory only */ + swappiness = SWAPPINESS_ANON_ONLY; + } else { + err = kstrtouint(swap_string, 0, &swappiness); + if (err) + break; + } + err = run_cmd(cmd, memcg_id, nid, seq, &sc, swappiness, opt); if (err) break; From a73dbc851cbc5aed70477cfbfbdae7ea2d3fcea4 Mon Sep 17 00:00:00 2001 From: Zhongkun He Date: Mon, 21 Apr 2025 17:13:31 +0800 Subject: [PATCH 167/285] mm: use SWAPPINESS_ANON_ONLY in MGLRU Using SWAPPINESS_ANON_ONLY instead of MAX_SWAPPINESS + 1 to indicate reclaiming only from anonymous pages makes the code more readable and explicit. Add some comment in the SWAPPINESS_ANON_ONLY context. Link: https://lkml.kernel.org/r/529db7ae6098ee712b81e4df290622e4e64ac50c.1745225696.git.hezhongkun.hzk@bytedance.com Signed-off-by: Zhongkun He Cc: Johannes Weiner Cc: Michal Hocko Cc: Muchun Song Cc: Yosry Ahmed Cc: Yu Zhao Signed-off-by: Andrew Morton --- mm/vmscan.c | 17 +++++++++++++---- 1 file changed, 13 insertions(+), 4 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index d1ad528186dd..7d6d1ce3921e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2705,8 +2705,12 @@ static bool should_clear_pmd_young(void) READ_ONCE((lruvec)->lrugen.min_seq[LRU_GEN_FILE]), \ } +/* Get the min/max evictable type based on swappiness */ +#define min_type(swappiness) (!(swappiness)) +#define max_type(swappiness) ((swappiness) < SWAPPINESS_ANON_ONLY) + #define evictable_min_seq(min_seq, swappiness) \ - min((min_seq)[!(swappiness)], (min_seq)[(swappiness) <= MAX_SWAPPINESS]) + min((min_seq)[min_type(swappiness)], (min_seq)[max_type(swappiness)]) #define for_each_gen_type_zone(gen, type, zone) \ for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++) \ @@ -2714,7 +2718,7 @@ static bool should_clear_pmd_young(void) for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++) #define for_each_evictable_type(type, swappiness) \ - for ((type) = !(swappiness); (type) <= ((swappiness) <= MAX_SWAPPINESS); (type)++) + for ((type) = min_type(swappiness); (type) <= max_type(swappiness); (type)++) #define get_memcg_gen(seq) ((seq) % MEMCG_NR_GENS) #define get_memcg_bin(bin) ((bin) % MEMCG_NR_BINS) @@ -3865,7 +3869,12 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, int swappiness) int hist = lru_hist_from_seq(lrugen->min_seq[type]); int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]); - if (type ? swappiness > MAX_SWAPPINESS : !swappiness) + /* For file type, skip the check if swappiness is anon only */ + if (type && (swappiness == SWAPPINESS_ANON_ONLY)) + goto done; + + /* For anon type, skip the check if swappiness is zero (file only) */ + if (!type && !swappiness) goto done; /* prevent cold/hot inversion if the type is evictable */ @@ -5531,7 +5540,7 @@ static int run_cmd(char cmd, int memcg_id, int nid, unsigned long seq, if (swappiness < MIN_SWAPPINESS) swappiness = get_swappiness(lruvec, sc); - else if (swappiness > MAX_SWAPPINESS + 1) + else if (swappiness > SWAPPINESS_ANON_ONLY) goto done; switch (cmd) { From f04cc63dc7d0b7b5ce71477584377c99ba2ba4bf Mon Sep 17 00:00:00 2001 From: Ye Liu Date: Mon, 21 Apr 2025 16:57:28 +0800 Subject: [PATCH 168/285] mm/rmap: rename page__anon_vma to anon_vma for consistency Patch series "mm: minor cleanups in rmap", v3. Minor cleanups in mm/rmap.c: - Rename a local variable for consistency - Fix a typo in a comment No functional changes. This patch (of 2): Rename local variable page__anon_vma in page_address_in_vma() to anon_vma. The previous naming convention of using double underscores (__) is unnecessary and inconsistent with typical kernel style, which uses single underscores to denote local variables. Also updated comments to reflect the new variable name. Functionality unchanged. Link: https://lkml.kernel.org/r/20250421085729.127845-1-ye.liu@linux.dev Link: https://lkml.kernel.org/r/20250421085729.127845-2-ye.liu@linux.dev Signed-off-by: Ye Liu Reviewed-by: Lorenzo Stoakes Acked-by: David Hildenbrand Reviewed-by: Harry Yoo Cc: Liam Howlett Cc: Liu Ye Cc: Miaohe Lin Cc: Naoya Horiguchi Cc: Rik van Riel Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/rmap.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/mm/rmap.c b/mm/rmap.c index 4992005885ef..11f3c0fc769f 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -789,13 +789,13 @@ unsigned long page_address_in_vma(const struct folio *folio, const struct page *page, const struct vm_area_struct *vma) { if (folio_test_anon(folio)) { - struct anon_vma *page__anon_vma = folio_anon_vma(folio); + struct anon_vma *anon_vma = folio_anon_vma(folio); /* * Note: swapoff's unuse_vma() is more efficient with this * check, and needs it to match anon_vma when KSM is active. */ - if (!vma->anon_vma || !page__anon_vma || - vma->anon_vma->root != page__anon_vma->root) + if (!vma->anon_vma || !anon_vma || + vma->anon_vma->root != anon_vma->root) return -EFAULT; } else if (!vma->vm_file) { return -EFAULT; @@ -803,7 +803,7 @@ unsigned long page_address_in_vma(const struct folio *folio, return -EFAULT; } - /* KSM folios don't reach here because of the !page__anon_vma check */ + /* KSM folios don't reach here because of the !anon_vma check */ return vma_address(vma, page_pgoff(folio, page), 1); } From 0ca954046c9397c652d568a87a3d4646bd3b73db Mon Sep 17 00:00:00 2001 From: Ye Liu Date: Mon, 21 Apr 2025 16:57:29 +0800 Subject: [PATCH 169/285] mm/rmap: fix typo in comment in page_address_in_vma MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Fix a minor typo in the comment above page_address_in_vma(): "responsibililty" → "responsibility" Link: https://lkml.kernel.org/r/20250421085729.127845-3-ye.liu@linux.dev Signed-off-by: Ye Liu Reviewed-by: Lorenzo Stoakes Reviewed-by: Harry Yoo Cc: David Hildenbrand Cc: Liam Howlett Cc: Miaohe Lin Cc: Naoya Horiguchi Cc: Rik van Riel Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/rmap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/rmap.c b/mm/rmap.c index 11f3c0fc769f..fb63d9256f09 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -774,7 +774,7 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags) * @vma: The VMA we need to know the address in. * * Calculates the user virtual address of this page in the specified VMA. - * It is the caller's responsibililty to check the page is actually + * It is the caller's responsibility to check the page is actually * within the VMA. There may not currently be a PTE pointing at this * page, but if a page fault occurs at this address, this is the page * which will be accessed. From a3365bdca2200e24d815b19382d3f05deb8567ab Mon Sep 17 00:00:00 2001 From: Cheng-Han Wu Date: Sun, 27 Apr 2025 22:50:04 +0800 Subject: [PATCH 170/285] mm: remove unused macro INIT_PASID The macro INIT_PASID was originally used by mm_init_pasid. However, since commit a6cbd44093ef ("kernel/fork: Initialize mm's PASID"), mm_init_pasid has been removed. Therefore, INIT_PASID is no longer needed and is removed. Link: https://lkml.kernel.org/r/20250427145004.13049-1-hank20010209@gmail.com Signed-off-by: Cheng-Han Wu Reviewed-by: Anshuman Khandual Signed-off-by: Andrew Morton --- include/linux/mm_types.h | 1 - 1 file changed, 1 deletion(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 56d07edd01f9..e76bade9ebb1 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -28,7 +28,6 @@ #endif #define AT_VECTOR_SIZE (2*(AT_VECTOR_SIZE_ARCH + AT_VECTOR_SIZE_BASE + 1)) -#define INIT_PASID 0 struct address_space; struct mem_cgroup; From d48e8d27cd61d8485ef2c1187f3048de77af2d11 Mon Sep 17 00:00:00 2001 From: Siddarth G Date: Sun, 27 Apr 2025 15:56:39 +0530 Subject: [PATCH 171/285] selftests/mm: use long for dwRegionSize Change the type of 'dwRegionSize' in wp_init() and wp_free() from int to long to match callers that pass long or unsigned long long values. wp_addr_range function is left unchanged because it passes 'dwRegionSize' parameter directly to pagemap_ioctl, which expects an int. This patch does not fix any actual known issues. It aligns parameter types with their actual usage and avoids any potential future issues. Link: https://lkml.kernel.org/r/20250427102639.39978-1-siddarthsgml@gmail.com Signed-off-by: Siddarth G Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/pagemap_ioctl.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/mm/pagemap_ioctl.c b/tools/testing/selftests/mm/pagemap_ioctl.c index fe5ae8b25ff6..b07acc86f4f0 100644 --- a/tools/testing/selftests/mm/pagemap_ioctl.c +++ b/tools/testing/selftests/mm/pagemap_ioctl.c @@ -112,7 +112,7 @@ int init_uffd(void) return 0; } -int wp_init(void *lpBaseAddress, int dwRegionSize) +int wp_init(void *lpBaseAddress, long dwRegionSize) { struct uffdio_register uffdio_register; struct uffdio_writeprotect wp; @@ -136,7 +136,7 @@ int wp_init(void *lpBaseAddress, int dwRegionSize) return 0; } -int wp_free(void *lpBaseAddress, int dwRegionSize) +int wp_free(void *lpBaseAddress, long dwRegionSize) { struct uffdio_register uffdio_register; From b94bff767f77bd726e61797bbc1cf4b8cff264ae Mon Sep 17 00:00:00 2001 From: Ye Liu Date: Sun, 27 Apr 2025 18:04:40 +0800 Subject: [PATCH 172/285] mm/io-mapping: precompute remap protection flags for clarity Patch series "mm: small cleanups for io-mapping, debug_page_alloc and numa". This series includes three small cleanups to mm/: - io-mapping: simplify remap protection flag calculation - debug_page_alloc: improve error message by printing invalid input - numa: remove unnecessary variable for clarity No functional changes. This patch (of 3): In io_mapping_map_user(), precompute the page protection flags in a local variable before calling remap_pfn_range_notrack(). No functional change. Link: https://lkml.kernel.org/r/20250427100442.958352-1-ye.liu@linux.dev Link: https://lkml.kernel.org/r/20250427100442.958352-2-ye.liu@linux.dev Signed-off-by: Ye Liu Reviewed-by: David Hildenbrand Reviewed-by: Anshuman Khandual Reviewed-by: Mike Rapoport (Microsoft) Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Rik van Riel Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/io-mapping.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/mm/io-mapping.c b/mm/io-mapping.c index 01b362799930..f44a6a134712 100644 --- a/mm/io-mapping.c +++ b/mm/io-mapping.c @@ -21,9 +21,10 @@ int io_mapping_map_user(struct io_mapping *iomap, struct vm_area_struct *vma, if (WARN_ON_ONCE((vma->vm_flags & expected_flags) != expected_flags)) return -EINVAL; + pgprot_t remap_prot = __pgprot((pgprot_val(iomap->prot) & _PAGE_CACHE_MASK) | + (pgprot_val(vma->vm_page_prot) & ~_PAGE_CACHE_MASK)); + /* We rely on prevalidation of the io-mapping to skip track_pfn(). */ - return remap_pfn_range_notrack(vma, addr, pfn, size, - __pgprot((pgprot_val(iomap->prot) & _PAGE_CACHE_MASK) | - (pgprot_val(vma->vm_page_prot) & ~_PAGE_CACHE_MASK))); + return remap_pfn_range_notrack(vma, addr, pfn, size, remap_prot); } EXPORT_SYMBOL_GPL(io_mapping_map_user); From 4048774ea5af1721e5925b7cc0a74f1efad05c4d Mon Sep 17 00:00:00 2001 From: Ye Liu Date: Sun, 27 Apr 2025 18:04:41 +0800 Subject: [PATCH 173/285] mm/debug_page_alloc: improve error message for invalid guardpage minorder When an invalid debug_guardpage_minorder value is provided, include the user input in the error message. This helps users and developers diagnose configuration issues more easily. No functional change. Link: https://lkml.kernel.org/r/20250427100442.958352-3-ye.liu@linux.dev Signed-off-by: Ye Liu Acked-by: David Hildenbrand Reviewed-by: Anshuman Khandual Acked-by: Mike Rapoport (Microsoft) Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Rik van Riel Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/debug_page_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/debug_page_alloc.c b/mm/debug_page_alloc.c index d46acf989dde..6a26eca546c3 100644 --- a/mm/debug_page_alloc.c +++ b/mm/debug_page_alloc.c @@ -23,7 +23,7 @@ static int __init debug_guardpage_minorder_setup(char *buf) unsigned long res; if (kstrtoul(buf, 10, &res) < 0 || res > MAX_PAGE_ORDER / 2) { - pr_err("Bad debug_guardpage_minorder value\n"); + pr_err("Bad debug_guardpage_minorder value: %s\n", buf); return 0; } _debug_guardpage_minorder = res; From a4b79af6c74c581f3f559bde8ae580a0545cfc96 Mon Sep 17 00:00:00 2001 From: Ye Liu Date: Sun, 27 Apr 2025 18:04:42 +0800 Subject: [PATCH 174/285] mm/numa: remove unnecessary local variable in alloc_node_data() The temporary local variable 'nd' is redundant. Directly assign the virtual address to node_data[nid] to simplify the code. No functional change. Link: https://lkml.kernel.org/r/20250427100442.958352-4-ye.liu@linux.dev Signed-off-by: Ye Liu Acked-by: David Hildenbrand Reviewed-by: Mike Rapoport (Microsoft) Reviewed-by: Anshuman Khandual Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Rik van Riel Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/numa.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/mm/numa.c b/mm/numa.c index f1787d7713a6..7d5e06fe5bd4 100644 --- a/mm/numa.c +++ b/mm/numa.c @@ -13,7 +13,6 @@ void __init alloc_node_data(int nid) { const size_t nd_size = roundup(sizeof(pg_data_t), SMP_CACHE_BYTES); u64 nd_pa; - void *nd; int tnid; /* Allocate node data. Try node-local memory and then any node. */ @@ -21,7 +20,6 @@ void __init alloc_node_data(int nid) if (!nd_pa) panic("Cannot allocate %zu bytes for node %d data\n", nd_size, nid); - nd = __va(nd_pa); /* report and initialize */ pr_info("NODE_DATA(%d) allocated [mem %#010Lx-%#010Lx]\n", nid, @@ -30,7 +28,7 @@ void __init alloc_node_data(int nid) if (tnid != nid) pr_info(" NODE_DATA(%d) on node %d\n", nid, tnid); - node_data[nid] = nd; + node_data[nid] = __va(nd_pa); memset(NODE_DATA(nid), 0, sizeof(pg_data_t)); } From 50dbe531291abfa4513c1283d66fad420c1fd299 Mon Sep 17 00:00:00 2001 From: Fan Ni Date: Thu, 24 Apr 2025 17:16:51 -0700 Subject: [PATCH 175/285] khugepaged: pass folio instead of head page to trace events The trace functions trace_mm_collapse_huge_page_isolate() and trace_mm_khugepaged_scan_pmd() each have a single user, which always passes in the head page of a folio. Refactor both functions to take a folio directly. Link: https://lkml.kernel.org/r/20250425002425.533698-1-nifan.cxl@gmail.com Signed-off-by: Fan Ni Reviewed-by: Nico Pache Reviewed-by: Davidlohr Bueso Reviewed-by: Baolin Wang Reviewed-by: Yang Shi Acked-by: David Hildenbrand Reviewed-by: Matthew Wilcox (Oracle) Cc: Adam Manzanares Cc: Luis Chamberalin Cc: Mariano Pache Cc: "Masami Hiramatsu (Google)" Cc: Steven Rostedt Signed-off-by: Andrew Morton --- include/trace/events/huge_memory.h | 12 ++++++------ mm/khugepaged.c | 6 +++--- 2 files changed, 9 insertions(+), 9 deletions(-) diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h index 9d5c00b0285c..2305df6cb485 100644 --- a/include/trace/events/huge_memory.h +++ b/include/trace/events/huge_memory.h @@ -55,10 +55,10 @@ SCAN_STATUS TRACE_EVENT(mm_khugepaged_scan_pmd, - TP_PROTO(struct mm_struct *mm, struct page *page, bool writable, + TP_PROTO(struct mm_struct *mm, struct folio *folio, bool writable, int referenced, int none_or_zero, int status, int unmapped), - TP_ARGS(mm, page, writable, referenced, none_or_zero, status, unmapped), + TP_ARGS(mm, folio, writable, referenced, none_or_zero, status, unmapped), TP_STRUCT__entry( __field(struct mm_struct *, mm) @@ -72,7 +72,7 @@ TRACE_EVENT(mm_khugepaged_scan_pmd, TP_fast_assign( __entry->mm = mm; - __entry->pfn = page ? page_to_pfn(page) : -1; + __entry->pfn = folio ? folio_pfn(folio) : -1; __entry->writable = writable; __entry->referenced = referenced; __entry->none_or_zero = none_or_zero; @@ -116,10 +116,10 @@ TRACE_EVENT(mm_collapse_huge_page, TRACE_EVENT(mm_collapse_huge_page_isolate, - TP_PROTO(struct page *page, int none_or_zero, + TP_PROTO(struct folio *folio, int none_or_zero, int referenced, bool writable, int status), - TP_ARGS(page, none_or_zero, referenced, writable, status), + TP_ARGS(folio, none_or_zero, referenced, writable, status), TP_STRUCT__entry( __field(unsigned long, pfn) @@ -130,7 +130,7 @@ TRACE_EVENT(mm_collapse_huge_page_isolate, ), TP_fast_assign( - __entry->pfn = page ? page_to_pfn(page) : -1; + __entry->pfn = folio ? folio_pfn(folio) : -1; __entry->none_or_zero = none_or_zero; __entry->referenced = referenced; __entry->writable = writable; diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 5cf204ab6af0..b04b6a770afe 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -696,13 +696,13 @@ next: result = SCAN_LACK_REFERENCED_PAGE; } else { result = SCAN_SUCCEED; - trace_mm_collapse_huge_page_isolate(&folio->page, none_or_zero, + trace_mm_collapse_huge_page_isolate(folio, none_or_zero, referenced, writable, result); return result; } out: release_pte_pages(pte, _pte, compound_pagelist); - trace_mm_collapse_huge_page_isolate(&folio->page, none_or_zero, + trace_mm_collapse_huge_page_isolate(folio, none_or_zero, referenced, writable, result); return result; } @@ -1435,7 +1435,7 @@ out_unmap: *mmap_locked = false; } out: - trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced, + trace_mm_khugepaged_scan_pmd(mm, folio, writable, referenced, none_or_zero, result, unmapped); return result; } From 4c78cc596bb8d39532f059e0198eeabf370c50f5 Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Fri, 9 May 2025 00:46:19 -0700 Subject: [PATCH 176/285] memblock: add MEMBLOCK_RSRV_KERN flag Patch series "kexec: introduce Kexec HandOver (KHO)", v8. Kexec today considers itself purely a boot loader: When we enter the new kernel, any state the previous kernel left behind is irrelevant and the new kernel reinitializes the system. However, there are use cases where this mode of operation is not what we actually want. In virtualization hosts for example, we want to use kexec to update the host kernel while virtual machine memory stays untouched. When we add device assignment to the mix, we also need to ensure that IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we need to do the same for the PCI subsystem. If we want to kexec while an SEV-SNP enabled virtual machine is running, we need to preserve the VM context pages and physical memory. See "pkernfs: Persisting guest memory and kernel/device state safely across kexec" Linux Plumbers Conference 2023 presentation for details: https://lpc.events/event/17/contributions/1485/ To start us on the journey to support all the use cases above, this patch implements basic infrastructure to allow hand over of kernel state across kexec (Kexec HandOver, aka KHO). As a really simple example target, we use memblock's reserve_mem. With this patchset applied, memory that was reserved using "reserve_mem" command line options remains intact after kexec and it is guaranteed to reside at the same physical address. == Alternatives == There are alternative approaches to (parts of) the problems above: * Memory Pools [1] - preallocated persistent memory region + allocator * PRMEM [2] - resizable persistent memory regions with fixed metadata pointer on the kernel command line + allocator * Pkernfs [3] - preallocated file system for in-kernel data with fixed address location on the kernel command line * PKRAM [4] - handover of user space pages using a fixed metadata page specified via command line All of the approaches above fundamentally have the same problem: They require the administrator to explicitly carve out a physical memory location because they have no mechanism outside of the kernel command line to pass data (including memory reservations) between kexec'ing kernels. KHO provides that base foundation. We will determine later whether we still need any of the approaches above for fast bulk memory handover of for example IOMMU page tables. But IMHO they would all be users of KHO, with KHO providing the foundational primitive to pass metadata and bulk memory reservations as well as provide easy versioning for data. == Overview == We introduce a metadata file that the kernels pass between each other. How they pass it is architecture specific. The file's format is a Flattened Device Tree (fdt) which has a generator and parser already included in Linux. KHO is enabled in the kernel command line by `kho=on`. When the root user enables KHO through /sys/kernel/debug/kho/out/finalize, the kernel invokes callbacks to every KHO users to register preserved memory regions, which contain drivers' states. When the actual kexec happens, the fdt is part of the image set that we boot into. In addition, we keep "scratch regions" available for kexec: physically contiguous memory regions that are guaranteed to not have any memory that KHO would preserve. The new kernel bootstraps itself using the scratch regions and sets all handed over memory as in use. When drivers initialize that support KHO, they introspect the fdt, restore preserved memory regions, and retrieve their states stored in the preserved memory. == Limitations == Currently KHO is only implemented for file based kexec. The kernel interfaces in the patch set are already in place to support user space kexec as well, but it is still not implemented it yet inside kexec tools. == How to Use == To use the code, please boot the kernel with the "kho=on" command line parameter. KHO will automatically create scratch regions. If you want to set the scratch size explicitly you can use "kho_scratch=" command line parameter. For instance, "kho_scratch=16M,512M,256M" will reserve a 16 MiB low memory scratch area, a 512 MiB global scratch region, and 256 MiB per NUMA node scratch regions on boot. Make sure to have a reserved memory range requested with reserv_mem command line option, for example, "reserve_mem=64m:4k:n1". Then before you invoke file based "kexec -l", finalize KHO FDT: # echo 1 > /sys/kernel/debug/kho/out/finalize You can preview the generated FDT using `dtc`, # dtc /sys/kernel/debug/kho/out/fdt # dtc /sys/kernel/debug/kho/out/sub_fdts/memblock `dtc` is available on ubuntu by `sudo apt-get install device-tree-compiler`. Now kexec into the new kernel, # kexec -l Image --initrd=initrd -s # kexec -e (The order of KHO finalization and "kexec -l" does not matter.) The new kernel will boot up and contain the previous kernel's reserve_mem contents at the same physical address as the first kernel. You can also review the FDT passed from the old kernel, # dtc /sys/kernel/debug/kho/in/fdt # dtc /sys/kernel/debug/kho/in/sub_fdts/memblock This patch (of 17): To denote areas that were reserved for kernel use either directly with memblock_reserve_kern() or via memblock allocations. Link: https://lore.kernel.org/lkml/20250424083258.2228122-1-changyuanl@google.com/ Link: https://lore.kernel.org/lkml/aAeaJ2iqkrv_ffhT@kernel.org/ Link: https://lore.kernel.org/lkml/35c58191-f774-40cf-8d66-d1e2aaf11a62@intel.com/ Link: https://lore.kernel.org/lkml/20250424093302.3894961-1-arnd@kernel.org/ Link: https://lkml.kernel.org/r/20250509074635.3187114-1-changyuanl@google.com Link: https://lkml.kernel.org/r/20250509074635.3187114-2-changyuanl@google.com Signed-off-by: Mike Rapoport (Microsoft) Co-developed-by: Changyuan Lyu Signed-off-by: Changyuan Lyu Cc: Alexander Graf Cc: Andy Lutomirski Cc: Anthony Yznaga Cc: Arnd Bergmann Cc: Ashish Kalra Cc: Ben Herrenschmidt Cc: Borislav Betkov Cc: Catalin Marinas Cc: David Woodhouse Cc: Eric Biederman Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: James Gowans Cc: Jonathan Corbet Cc: Krzysztof Kozlowski Cc: Marc Rutland Cc: Paolo Bonzini Cc: Pasha Tatashin Cc: Peter Zijlstra Cc: Pratyush Yadav Cc: Rob Herring Cc: Saravana Kannan Cc: Stanislav Kinsburskii Cc: Steven Rostedt Cc: Thomas Gleinxer Cc: Thomas Lendacky Cc: Will Deacon Cc: Dave Hansen Cc: Jason Gunthorpe Signed-off-by: Andrew Morton --- include/linux/memblock.h | 19 ++++++++- mm/memblock.c | 40 +++++++++++++++---- tools/testing/memblock/tests/alloc_api.c | 22 +++++----- .../memblock/tests/alloc_helpers_api.c | 4 +- tools/testing/memblock/tests/alloc_nid_api.c | 20 +++++----- 5 files changed, 73 insertions(+), 32 deletions(-) diff --git a/include/linux/memblock.h b/include/linux/memblock.h index ef5a1ecc6e59..6c00fbc08513 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -42,6 +42,9 @@ extern unsigned long long max_possible_pfn; * kernel resource tree. * @MEMBLOCK_RSRV_NOINIT: memory region for which struct pages are * not initialized (only for reserved regions). + * @MEMBLOCK_RSRV_KERN: memory region that is reserved for kernel use, + * either explictitly with memblock_reserve_kern() or via memblock + * allocation APIs. All memblock allocations set this flag. */ enum memblock_flags { MEMBLOCK_NONE = 0x0, /* No special request */ @@ -50,6 +53,7 @@ enum memblock_flags { MEMBLOCK_NOMAP = 0x4, /* don't add to kernel direct mapping */ MEMBLOCK_DRIVER_MANAGED = 0x8, /* always detected via a driver */ MEMBLOCK_RSRV_NOINIT = 0x10, /* don't initialize struct pages */ + MEMBLOCK_RSRV_KERN = 0x20, /* memory reserved for kernel use */ }; /** @@ -116,7 +120,19 @@ int memblock_add_node(phys_addr_t base, phys_addr_t size, int nid, int memblock_add(phys_addr_t base, phys_addr_t size); int memblock_remove(phys_addr_t base, phys_addr_t size); int memblock_phys_free(phys_addr_t base, phys_addr_t size); -int memblock_reserve(phys_addr_t base, phys_addr_t size); +int __memblock_reserve(phys_addr_t base, phys_addr_t size, int nid, + enum memblock_flags flags); + +static __always_inline int memblock_reserve(phys_addr_t base, phys_addr_t size) +{ + return __memblock_reserve(base, size, NUMA_NO_NODE, 0); +} + +static __always_inline int memblock_reserve_kern(phys_addr_t base, phys_addr_t size) +{ + return __memblock_reserve(base, size, NUMA_NO_NODE, MEMBLOCK_RSRV_KERN); +} + #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP int memblock_physmem_add(phys_addr_t base, phys_addr_t size); #endif @@ -476,6 +492,7 @@ static inline __init_memblock bool memblock_bottom_up(void) phys_addr_t memblock_phys_mem_size(void); phys_addr_t memblock_reserved_size(void); +phys_addr_t memblock_reserved_kern_size(phys_addr_t limit, int nid); unsigned long memblock_estimated_nr_free_pages(void); phys_addr_t memblock_start_of_DRAM(void); phys_addr_t memblock_end_of_DRAM(void); diff --git a/mm/memblock.c b/mm/memblock.c index 0e9ebb8aa7fe..ac377cd61029 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -499,7 +499,7 @@ static int __init_memblock memblock_double_array(struct memblock_type *type, * needn't do it */ if (!use_slab) - BUG_ON(memblock_reserve(addr, new_alloc_size)); + BUG_ON(memblock_reserve_kern(addr, new_alloc_size)); /* Update slab flag */ *in_slab = use_slab; @@ -649,7 +649,7 @@ repeat: #ifdef CONFIG_NUMA WARN_ON(nid != memblock_get_region_node(rgn)); #endif - WARN_ON(flags != rgn->flags); + WARN_ON(flags != MEMBLOCK_NONE && flags != rgn->flags); nr_new++; if (insert) { if (start_rgn == -1) @@ -909,14 +909,15 @@ int __init_memblock memblock_phys_free(phys_addr_t base, phys_addr_t size) return memblock_remove_range(&memblock.reserved, base, size); } -int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size) +int __init_memblock __memblock_reserve(phys_addr_t base, phys_addr_t size, + int nid, enum memblock_flags flags) { phys_addr_t end = base + size - 1; - memblock_dbg("%s: [%pa-%pa] %pS\n", __func__, - &base, &end, (void *)_RET_IP_); + memblock_dbg("%s: [%pa-%pa] nid=%d flags=%x %pS\n", __func__, + &base, &end, nid, flags, (void *)_RET_IP_); - return memblock_add_range(&memblock.reserved, base, size, MAX_NUMNODES, 0); + return memblock_add_range(&memblock.reserved, base, size, nid, flags); } #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP @@ -1467,14 +1468,14 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size, again: found = memblock_find_in_range_node(size, align, start, end, nid, flags); - if (found && !memblock_reserve(found, size)) + if (found && !__memblock_reserve(found, size, nid, MEMBLOCK_RSRV_KERN)) goto done; if (numa_valid_node(nid) && !exact_nid) { found = memblock_find_in_range_node(size, align, start, end, NUMA_NO_NODE, flags); - if (found && !memblock_reserve(found, size)) + if (found && !memblock_reserve_kern(found, size)) goto done; } @@ -1759,6 +1760,28 @@ phys_addr_t __init_memblock memblock_reserved_size(void) return memblock.reserved.total_size; } +phys_addr_t __init_memblock memblock_reserved_kern_size(phys_addr_t limit, int nid) +{ + struct memblock_region *r; + phys_addr_t total = 0; + + for_each_reserved_mem_region(r) { + phys_addr_t size = r->size; + + if (r->base > limit) + break; + + if (r->base + r->size > limit) + size = limit - r->base; + + if (nid == memblock_get_region_node(r) || !numa_valid_node(nid)) + if (r->flags & MEMBLOCK_RSRV_KERN) + total += size; + } + + return total; +} + /** * memblock_estimated_nr_free_pages - return estimated number of free pages * from memblock point of view @@ -2458,6 +2481,7 @@ static const char * const flagname[] = { [ilog2(MEMBLOCK_NOMAP)] = "NOMAP", [ilog2(MEMBLOCK_DRIVER_MANAGED)] = "DRV_MNG", [ilog2(MEMBLOCK_RSRV_NOINIT)] = "RSV_NIT", + [ilog2(MEMBLOCK_RSRV_KERN)] = "RSV_KERN", }; static int memblock_debug_show(struct seq_file *m, void *private) diff --git a/tools/testing/memblock/tests/alloc_api.c b/tools/testing/memblock/tests/alloc_api.c index 68f1a75cd72c..c55f67dd367d 100644 --- a/tools/testing/memblock/tests/alloc_api.c +++ b/tools/testing/memblock/tests/alloc_api.c @@ -134,7 +134,7 @@ static int alloc_top_down_before_check(void) PREFIX_PUSH(); setup_memblock(); - memblock_reserve(memblock_end_of_DRAM() - total_size, r1_size); + memblock_reserve_kern(memblock_end_of_DRAM() - total_size, r1_size); allocated_ptr = run_memblock_alloc(r2_size, SMP_CACHE_BYTES); @@ -182,7 +182,7 @@ static int alloc_top_down_after_check(void) total_size = r1.size + r2_size; - memblock_reserve(r1.base, r1.size); + memblock_reserve_kern(r1.base, r1.size); allocated_ptr = run_memblock_alloc(r2_size, SMP_CACHE_BYTES); @@ -231,8 +231,8 @@ static int alloc_top_down_second_fit_check(void) total_size = r1.size + r2.size + r3_size; - memblock_reserve(r1.base, r1.size); - memblock_reserve(r2.base, r2.size); + memblock_reserve_kern(r1.base, r1.size); + memblock_reserve_kern(r2.base, r2.size); allocated_ptr = run_memblock_alloc(r3_size, SMP_CACHE_BYTES); @@ -285,8 +285,8 @@ static int alloc_in_between_generic_check(void) total_size = r1.size + r2.size + r3_size; - memblock_reserve(r1.base, r1.size); - memblock_reserve(r2.base, r2.size); + memblock_reserve_kern(r1.base, r1.size); + memblock_reserve_kern(r2.base, r2.size); allocated_ptr = run_memblock_alloc(r3_size, SMP_CACHE_BYTES); @@ -422,7 +422,7 @@ static int alloc_limited_space_generic_check(void) setup_memblock(); /* Simulate almost-full memory */ - memblock_reserve(memblock_start_of_DRAM(), reserved_size); + memblock_reserve_kern(memblock_start_of_DRAM(), reserved_size); allocated_ptr = run_memblock_alloc(available_size, SMP_CACHE_BYTES); @@ -608,7 +608,7 @@ static int alloc_bottom_up_before_check(void) PREFIX_PUSH(); setup_memblock(); - memblock_reserve(memblock_start_of_DRAM() + r1_size, r2_size); + memblock_reserve_kern(memblock_start_of_DRAM() + r1_size, r2_size); allocated_ptr = run_memblock_alloc(r1_size, SMP_CACHE_BYTES); @@ -655,7 +655,7 @@ static int alloc_bottom_up_after_check(void) total_size = r1.size + r2_size; - memblock_reserve(r1.base, r1.size); + memblock_reserve_kern(r1.base, r1.size); allocated_ptr = run_memblock_alloc(r2_size, SMP_CACHE_BYTES); @@ -705,8 +705,8 @@ static int alloc_bottom_up_second_fit_check(void) total_size = r1.size + r2.size + r3_size; - memblock_reserve(r1.base, r1.size); - memblock_reserve(r2.base, r2.size); + memblock_reserve_kern(r1.base, r1.size); + memblock_reserve_kern(r2.base, r2.size); allocated_ptr = run_memblock_alloc(r3_size, SMP_CACHE_BYTES); diff --git a/tools/testing/memblock/tests/alloc_helpers_api.c b/tools/testing/memblock/tests/alloc_helpers_api.c index 3ef9486da8a0..e5362cfd2ff3 100644 --- a/tools/testing/memblock/tests/alloc_helpers_api.c +++ b/tools/testing/memblock/tests/alloc_helpers_api.c @@ -163,7 +163,7 @@ static int alloc_from_top_down_no_space_above_check(void) min_addr = memblock_end_of_DRAM() - SMP_CACHE_BYTES * 2; /* No space above this address */ - memblock_reserve(min_addr, r2_size); + memblock_reserve_kern(min_addr, r2_size); allocated_ptr = memblock_alloc_from(r1_size, SMP_CACHE_BYTES, min_addr); @@ -199,7 +199,7 @@ static int alloc_from_top_down_min_addr_cap_check(void) start_addr = (phys_addr_t)memblock_start_of_DRAM(); min_addr = start_addr - SMP_CACHE_BYTES * 3; - memblock_reserve(start_addr + r1_size, MEM_SIZE - r1_size); + memblock_reserve_kern(start_addr + r1_size, MEM_SIZE - r1_size); allocated_ptr = memblock_alloc_from(r1_size, SMP_CACHE_BYTES, min_addr); diff --git a/tools/testing/memblock/tests/alloc_nid_api.c b/tools/testing/memblock/tests/alloc_nid_api.c index 49bb416d34ff..562e4701b0e0 100644 --- a/tools/testing/memblock/tests/alloc_nid_api.c +++ b/tools/testing/memblock/tests/alloc_nid_api.c @@ -324,7 +324,7 @@ static int alloc_nid_min_reserved_generic_check(void) min_addr = max_addr - r2_size; reserved_base = min_addr - r1_size; - memblock_reserve(reserved_base, r1_size); + memblock_reserve_kern(reserved_base, r1_size); allocated_ptr = run_memblock_alloc_nid(r2_size, SMP_CACHE_BYTES, min_addr, max_addr, @@ -374,7 +374,7 @@ static int alloc_nid_max_reserved_generic_check(void) max_addr = memblock_end_of_DRAM() - r1_size; min_addr = max_addr - r2_size; - memblock_reserve(max_addr, r1_size); + memblock_reserve_kern(max_addr, r1_size); allocated_ptr = run_memblock_alloc_nid(r2_size, SMP_CACHE_BYTES, min_addr, max_addr, @@ -436,8 +436,8 @@ static int alloc_nid_top_down_reserved_with_space_check(void) min_addr = r2.base + r2.size; max_addr = r1.base; - memblock_reserve(r1.base, r1.size); - memblock_reserve(r2.base, r2.size); + memblock_reserve_kern(r1.base, r1.size); + memblock_reserve_kern(r2.base, r2.size); allocated_ptr = run_memblock_alloc_nid(r3_size, SMP_CACHE_BYTES, min_addr, max_addr, @@ -499,8 +499,8 @@ static int alloc_nid_reserved_full_merge_generic_check(void) min_addr = r2.base + r2.size; max_addr = r1.base; - memblock_reserve(r1.base, r1.size); - memblock_reserve(r2.base, r2.size); + memblock_reserve_kern(r1.base, r1.size); + memblock_reserve_kern(r2.base, r2.size); allocated_ptr = run_memblock_alloc_nid(r3_size, SMP_CACHE_BYTES, min_addr, max_addr, @@ -563,8 +563,8 @@ static int alloc_nid_top_down_reserved_no_space_check(void) min_addr = r2.base + r2.size; max_addr = r1.base; - memblock_reserve(r1.base, r1.size); - memblock_reserve(r2.base, r2.size); + memblock_reserve_kern(r1.base, r1.size); + memblock_reserve_kern(r2.base, r2.size); allocated_ptr = run_memblock_alloc_nid(r3_size, SMP_CACHE_BYTES, min_addr, max_addr, @@ -909,8 +909,8 @@ static int alloc_nid_bottom_up_reserved_with_space_check(void) min_addr = r2.base + r2.size; max_addr = r1.base; - memblock_reserve(r1.base, r1.size); - memblock_reserve(r2.base, r2.size); + memblock_reserve_kern(r1.base, r1.size); + memblock_reserve_kern(r2.base, r2.size); allocated_ptr = run_memblock_alloc_nid(r3_size, SMP_CACHE_BYTES, min_addr, max_addr, From d59f43b5748092557d34244e29a618221a250501 Mon Sep 17 00:00:00 2001 From: Alexander Graf Date: Fri, 9 May 2025 00:46:20 -0700 Subject: [PATCH 177/285] memblock: add support for scratch memory With KHO (Kexec HandOver), we need a way to ensure that the new kernel does not allocate memory on top of any memory regions that the previous kernel was handing over. But to know where those are, we need to include them in the memblock.reserved array which may not be big enough to hold all ranges that need to be persisted across kexec. To resize the array, we need to allocate memory. That brings us into a catch 22 situation. The solution to that is limit memblock allocations to the scratch regions: safe regions to operate in the case when there is memory that should remain intact across kexec. KHO provides several "scratch regions" as part of its metadata. These scratch regions are contiguous memory blocks that known not to contain any memory that should be persisted across kexec. These regions should be large enough to accommodate all memblock allocations done by the kexeced kernel. We introduce a new memblock_set_scratch_only() function that allows KHO to indicate that any memblock allocation must happen from the scratch regions. Later, we may want to perform another KHO kexec. For that, we reuse the same scratch regions. To ensure that no eventually handed over data gets allocated inside a scratch region, we flip the semantics of the scratch region with memblock_clear_scratch_only(): After that call, no allocations may happen from scratch memblock regions. We will lift that restriction in the next patch. Link: https://lkml.kernel.org/r/20250509074635.3187114-3-changyuanl@google.com Signed-off-by: Alexander Graf Co-developed-by: Mike Rapoport (Microsoft) Signed-off-by: Mike Rapoport (Microsoft) Signed-off-by: Changyuan Lyu Cc: Andy Lutomirski Cc: Anthony Yznaga Cc: Arnd Bergmann Cc: Ashish Kalra Cc: Ben Herrenschmidt Cc: Borislav Betkov Cc: Catalin Marinas Cc: Dave Hansen Cc: David Woodhouse Cc: Eric Biederman Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: James Gowans Cc: Jason Gunthorpe Cc: Jonathan Corbet Cc: Krzysztof Kozlowski Cc: Marc Rutland Cc: Paolo Bonzini Cc: Pasha Tatashin Cc: Peter Zijlstra Cc: Pratyush Yadav Cc: Rob Herring Cc: Saravana Kannan Cc: Stanislav Kinsburskii Cc: Steven Rostedt Cc: Thomas Gleinxer Cc: Thomas Lendacky Cc: Will Deacon Signed-off-by: Andrew Morton --- include/linux/memblock.h | 20 +++++++++++++ mm/Kconfig | 4 +++ mm/memblock.c | 61 ++++++++++++++++++++++++++++++++++++++++ 3 files changed, 85 insertions(+) diff --git a/include/linux/memblock.h b/include/linux/memblock.h index 6c00fbc08513..993937a6b962 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -45,6 +45,11 @@ extern unsigned long long max_possible_pfn; * @MEMBLOCK_RSRV_KERN: memory region that is reserved for kernel use, * either explictitly with memblock_reserve_kern() or via memblock * allocation APIs. All memblock allocations set this flag. + * @MEMBLOCK_KHO_SCRATCH: memory region that kexec can pass to the next + * kernel in handover mode. During early boot, we do not know about all + * memory reservations yet, so we get scratch memory from the previous + * kernel that we know is good to use. It is the only memory that + * allocations may happen from in this phase. */ enum memblock_flags { MEMBLOCK_NONE = 0x0, /* No special request */ @@ -54,6 +59,7 @@ enum memblock_flags { MEMBLOCK_DRIVER_MANAGED = 0x8, /* always detected via a driver */ MEMBLOCK_RSRV_NOINIT = 0x10, /* don't initialize struct pages */ MEMBLOCK_RSRV_KERN = 0x20, /* memory reserved for kernel use */ + MEMBLOCK_KHO_SCRATCH = 0x40, /* scratch memory for kexec handover */ }; /** @@ -148,6 +154,8 @@ int memblock_mark_mirror(phys_addr_t base, phys_addr_t size); int memblock_mark_nomap(phys_addr_t base, phys_addr_t size); int memblock_clear_nomap(phys_addr_t base, phys_addr_t size); int memblock_reserved_mark_noinit(phys_addr_t base, phys_addr_t size); +int memblock_mark_kho_scratch(phys_addr_t base, phys_addr_t size); +int memblock_clear_kho_scratch(phys_addr_t base, phys_addr_t size); void memblock_free(void *ptr, size_t size); void reset_all_zones_managed_pages(void); @@ -291,6 +299,11 @@ static inline bool memblock_is_driver_managed(struct memblock_region *m) return m->flags & MEMBLOCK_DRIVER_MANAGED; } +static inline bool memblock_is_kho_scratch(struct memblock_region *m) +{ + return m->flags & MEMBLOCK_KHO_SCRATCH; +} + int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn, unsigned long *end_pfn); void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn, @@ -619,5 +632,12 @@ static inline void early_memtest(phys_addr_t start, phys_addr_t end) { } static inline void memtest_report_meminfo(struct seq_file *m) { } #endif +#ifdef CONFIG_MEMBLOCK_KHO_SCRATCH +void memblock_set_kho_scratch_only(void); +void memblock_clear_kho_scratch_only(void); +#else +static inline void memblock_set_kho_scratch_only(void) { } +static inline void memblock_clear_kho_scratch_only(void) { } +#endif #endif /* _LINUX_MEMBLOCK_H */ diff --git a/mm/Kconfig b/mm/Kconfig index e113f713b493..60ea9eba4814 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -469,6 +469,10 @@ config HAVE_GUP_FAST depends on MMU bool +# Enable memblock support for scratch memory which is needed for kexec handover +config MEMBLOCK_KHO_SCRATCH + bool + # Don't discard allocated memory used to track "memory" and "reserved" memblocks # after early boot, so it can still be used to test for validity of memory. # Also, memblocks are updated with memory hot(un)plug. diff --git a/mm/memblock.c b/mm/memblock.c index ac377cd61029..58cb82d444b1 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -107,6 +107,13 @@ unsigned long min_low_pfn; unsigned long max_pfn; unsigned long long max_possible_pfn; +#ifdef CONFIG_MEMBLOCK_KHO_SCRATCH +/* When set to true, only allocate from MEMBLOCK_KHO_SCRATCH ranges */ +static bool kho_scratch_only; +#else +#define kho_scratch_only false +#endif + static struct memblock_region memblock_memory_init_regions[INIT_MEMBLOCK_MEMORY_REGIONS] __initdata_memblock; static struct memblock_region memblock_reserved_init_regions[INIT_MEMBLOCK_RESERVED_REGIONS] __initdata_memblock; #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP @@ -166,6 +173,10 @@ bool __init_memblock memblock_has_mirror(void) static enum memblock_flags __init_memblock choose_memblock_flags(void) { + /* skip non-scratch memory for kho early boot allocations */ + if (kho_scratch_only) + return MEMBLOCK_KHO_SCRATCH; + return system_has_some_mirror ? MEMBLOCK_MIRROR : MEMBLOCK_NONE; } @@ -932,6 +943,18 @@ int __init_memblock memblock_physmem_add(phys_addr_t base, phys_addr_t size) } #endif +#ifdef CONFIG_MEMBLOCK_KHO_SCRATCH +__init void memblock_set_kho_scratch_only(void) +{ + kho_scratch_only = true; +} + +__init void memblock_clear_kho_scratch_only(void) +{ + kho_scratch_only = false; +} +#endif + /** * memblock_setclr_flag - set or clear flag for a memory region * @type: memblock type to set/clear flag for @@ -1057,6 +1080,36 @@ int __init_memblock memblock_reserved_mark_noinit(phys_addr_t base, phys_addr_t MEMBLOCK_RSRV_NOINIT); } +/** + * memblock_mark_kho_scratch - Mark a memory region as MEMBLOCK_KHO_SCRATCH. + * @base: the base phys addr of the region + * @size: the size of the region + * + * Only memory regions marked with %MEMBLOCK_KHO_SCRATCH will be considered + * for allocations during early boot with kexec handover. + * + * Return: 0 on success, -errno on failure. + */ +__init int memblock_mark_kho_scratch(phys_addr_t base, phys_addr_t size) +{ + return memblock_setclr_flag(&memblock.memory, base, size, 1, + MEMBLOCK_KHO_SCRATCH); +} + +/** + * memblock_clear_kho_scratch - Clear MEMBLOCK_KHO_SCRATCH flag for a + * specified region. + * @base: the base phys addr of the region + * @size: the size of the region + * + * Return: 0 on success, -errno on failure. + */ +__init int memblock_clear_kho_scratch(phys_addr_t base, phys_addr_t size) +{ + return memblock_setclr_flag(&memblock.memory, base, size, 0, + MEMBLOCK_KHO_SCRATCH); +} + static bool should_skip_region(struct memblock_type *type, struct memblock_region *m, int nid, int flags) @@ -1088,6 +1141,13 @@ static bool should_skip_region(struct memblock_type *type, if (!(flags & MEMBLOCK_DRIVER_MANAGED) && memblock_is_driver_managed(m)) return true; + /* + * In early alloc during kexec handover, we can only consider + * MEMBLOCK_KHO_SCRATCH regions for the allocations + */ + if ((flags & MEMBLOCK_KHO_SCRATCH) && !memblock_is_kho_scratch(m)) + return true; + return false; } @@ -2482,6 +2542,7 @@ static const char * const flagname[] = { [ilog2(MEMBLOCK_DRIVER_MANAGED)] = "DRV_MNG", [ilog2(MEMBLOCK_RSRV_NOINIT)] = "RSV_NIT", [ilog2(MEMBLOCK_RSRV_KERN)] = "RSV_KERN", + [ilog2(MEMBLOCK_KHO_SCRATCH)] = "KHO_SCRATCH", }; static int memblock_debug_show(struct seq_file *m, void *private) From b8a8f96a6dce527ad316184ff1e20f238ed413d8 Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Fri, 9 May 2025 00:46:21 -0700 Subject: [PATCH 178/285] memblock: introduce memmap_init_kho_scratch() With deferred initialization of struct page it will be necessary to initialize memory map for KHO scratch regions early. Add memmap_init_kho_scratch() method that will allow such initialization in upcoming patches. Link: https://lkml.kernel.org/r/20250509074635.3187114-4-changyuanl@google.com Signed-off-by: Mike Rapoport (Microsoft) Signed-off-by: Changyuan Lyu Cc: Alexander Graf Cc: Andy Lutomirski Cc: Anthony Yznaga Cc: Arnd Bergmann Cc: Ashish Kalra Cc: Ben Herrenschmidt Cc: Borislav Betkov Cc: Catalin Marinas Cc: Dave Hansen Cc: David Woodhouse Cc: Eric Biederman Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: James Gowans Cc: Jason Gunthorpe Cc: Jonathan Corbet Cc: Krzysztof Kozlowski Cc: Marc Rutland Cc: Paolo Bonzini Cc: Pasha Tatashin Cc: Peter Zijlstra Cc: Pratyush Yadav Cc: Rob Herring Cc: Saravana Kannan Cc: Stanislav Kinsburskii Cc: Steven Rostedt Cc: Thomas Gleinxer Cc: Thomas Lendacky Cc: Will Deacon Signed-off-by: Andrew Morton --- include/linux/memblock.h | 2 ++ mm/internal.h | 2 ++ mm/memblock.c | 22 ++++++++++++++++++++++ mm/mm_init.c | 11 ++++++++--- 4 files changed, 34 insertions(+), 3 deletions(-) diff --git a/include/linux/memblock.h b/include/linux/memblock.h index 993937a6b962..bb19a2534224 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -635,9 +635,11 @@ static inline void memtest_report_meminfo(struct seq_file *m) { } #ifdef CONFIG_MEMBLOCK_KHO_SCRATCH void memblock_set_kho_scratch_only(void); void memblock_clear_kho_scratch_only(void); +void memmap_init_kho_scratch_pages(void); #else static inline void memblock_set_kho_scratch_only(void) { } static inline void memblock_clear_kho_scratch_only(void) { } +static inline void memmap_init_kho_scratch_pages(void) {} #endif #endif /* _LINUX_MEMBLOCK_H */ diff --git a/mm/internal.h b/mm/internal.h index 780481a8be0e..cf7c0e9ef7ec 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1119,6 +1119,8 @@ DECLARE_STATIC_KEY_TRUE(deferred_pages); bool __init deferred_grow_zone(struct zone *zone, unsigned int order); #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */ +void init_deferred_page(unsigned long pfn, int nid); + enum mminit_level { MMINIT_WARNING, MMINIT_VERIFY, diff --git a/mm/memblock.c b/mm/memblock.c index 58cb82d444b1..ec30d850e195 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -953,6 +953,28 @@ __init void memblock_clear_kho_scratch_only(void) { kho_scratch_only = false; } + +__init void memmap_init_kho_scratch_pages(void) +{ + phys_addr_t start, end; + unsigned long pfn; + int nid; + u64 i; + + if (!IS_ENABLED(CONFIG_DEFERRED_STRUCT_PAGE_INIT)) + return; + + /* + * Initialize struct pages for free scratch memory. + * The struct pages for reserved scratch memory will be set up in + * reserve_bootmem_region() + */ + __for_each_mem_range(i, &memblock.memory, NULL, NUMA_NO_NODE, + MEMBLOCK_KHO_SCRATCH, &start, &end, &nid) { + for (pfn = PFN_UP(start); pfn < PFN_DOWN(end); pfn++) + init_deferred_page(pfn, nid); + } +} #endif /** diff --git a/mm/mm_init.c b/mm/mm_init.c index c275ae561b6f..62d7f551b295 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -743,7 +743,7 @@ defer_init(int nid, unsigned long pfn, unsigned long end_pfn) return false; } -static void __meminit init_deferred_page(unsigned long pfn, int nid) +static void __meminit __init_deferred_page(unsigned long pfn, int nid) { if (early_page_initialised(pfn, nid)) return; @@ -763,11 +763,16 @@ static inline bool defer_init(int nid, unsigned long pfn, unsigned long end_pfn) return false; } -static inline void init_deferred_page(unsigned long pfn, int nid) +static inline void __init_deferred_page(unsigned long pfn, int nid) { } #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */ +void __meminit init_deferred_page(unsigned long pfn, int nid) +{ + __init_deferred_page(pfn, nid); +} + /* * Initialised pages do not have PageReserved set. This function is * called for each range allocated by the bootmem allocator and @@ -784,7 +789,7 @@ void __meminit reserve_bootmem_region(phys_addr_t start, if (pfn_valid(start_pfn)) { struct page *page = pfn_to_page(start_pfn); - init_deferred_page(start_pfn, nid); + __init_deferred_page(start_pfn, nid); /* * no need for atomic set_bit because the struct From 3dc92c311498c4d307cfdd0c6c3ac9355b50f683 Mon Sep 17 00:00:00 2001 From: Alexander Graf Date: Fri, 9 May 2025 00:46:22 -0700 Subject: [PATCH 179/285] kexec: add Kexec HandOver (KHO) generation helpers Add the infrastructure to generate Kexec HandOver metadata. Kexec HandOver is a mechanism that allows Linux to preserve state - arbitrary properties as well as memory locations - across kexec. It does so using 2 concepts: 1) KHO FDT - Every KHO kexec carries a KHO specific flattened device tree blob that describes preserved memory regions. Device drivers can register to KHO to serialize and preserve their states before kexec. 2) Scratch Regions - CMA regions that we allocate in the first kernel. CMA gives us the guarantee that no handover pages land in those regions, because handover pages must be at a static physical memory location. We use these regions as the place to load future kexec images so that they won't collide with any handover data. Link: https://lkml.kernel.org/r/20250509074635.3187114-5-changyuanl@google.com Signed-off-by: Alexander Graf Co-developed-by: Mike Rapoport (Microsoft) Signed-off-by: Mike Rapoport (Microsoft) Co-developed-by: Pratyush Yadav Signed-off-by: Pratyush Yadav Co-developed-by: Changyuan Lyu Signed-off-by: Changyuan Lyu Cc: Andy Lutomirski Cc: Anthony Yznaga Cc: Arnd Bergmann Cc: Ashish Kalra Cc: Ben Herrenschmidt Cc: Borislav Betkov Cc: Catalin Marinas Cc: Dave Hansen Cc: David Woodhouse Cc: Eric Biederman Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: James Gowans Cc: Jason Gunthorpe Cc: Jonathan Corbet Cc: Krzysztof Kozlowski Cc: Marc Rutland Cc: Paolo Bonzini Cc: Pasha Tatashin Cc: Peter Zijlstra Cc: Rob Herring Cc: Saravana Kannan Cc: Stanislav Kinsburskii Cc: Steven Rostedt Cc: Thomas Gleinxer Cc: Thomas Lendacky Cc: Will Deacon Signed-off-by: Andrew Morton --- MAINTAINERS | 9 + include/linux/kexec_handover.h | 59 ++++ kernel/Makefile | 1 + kernel/kexec_handover.c | 557 +++++++++++++++++++++++++++++++++ mm/mm_init.c | 8 + 5 files changed, 634 insertions(+) create mode 100644 include/linux/kexec_handover.h create mode 100644 kernel/kexec_handover.c diff --git a/MAINTAINERS b/MAINTAINERS index 2f80c618d325..943b23fc3442 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -13139,6 +13139,15 @@ F: include/linux/kexec.h F: include/uapi/linux/kexec.h F: kernel/kexec* +KEXEC HANDOVER (KHO) +M: Alexander Graf +M: Mike Rapoport +M: Changyuan Lyu +L: kexec@lists.infradead.org +S: Maintained +F: include/linux/kexec_handover.h +F: kernel/kexec_handover.c + KEYS-ENCRYPTED M: Mimi Zohar L: linux-integrity@vger.kernel.org diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h new file mode 100644 index 000000000000..2e19004776f6 --- /dev/null +++ b/include/linux/kexec_handover.h @@ -0,0 +1,59 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef LINUX_KEXEC_HANDOVER_H +#define LINUX_KEXEC_HANDOVER_H + +#include +#include + +struct kho_scratch { + phys_addr_t addr; + phys_addr_t size; +}; + +/* KHO Notifier index */ +enum kho_event { + KEXEC_KHO_FINALIZE = 0, + KEXEC_KHO_ABORT = 1, +}; + +struct notifier_block; + +struct kho_serialization; + +#ifdef CONFIG_KEXEC_HANDOVER +bool kho_is_enabled(void); + +int kho_add_subtree(struct kho_serialization *ser, const char *name, void *fdt); + +int register_kho_notifier(struct notifier_block *nb); +int unregister_kho_notifier(struct notifier_block *nb); + +void kho_memory_init(void); +#else +static inline bool kho_is_enabled(void) +{ + return false; +} + +static inline int kho_add_subtree(struct kho_serialization *ser, + const char *name, void *fdt) +{ + return -EOPNOTSUPP; +} + +static inline int register_kho_notifier(struct notifier_block *nb) +{ + return -EOPNOTSUPP; +} + +static inline int unregister_kho_notifier(struct notifier_block *nb) +{ + return -EOPNOTSUPP; +} + +static inline void kho_memory_init(void) +{ +} +#endif /* CONFIG_KEXEC_HANDOVER */ + +#endif /* LINUX_KEXEC_HANDOVER_H */ diff --git a/kernel/Makefile b/kernel/Makefile index 434929de17ef..97c09847db42 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -80,6 +80,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_core.o obj-$(CONFIG_KEXEC) += kexec.o obj-$(CONFIG_KEXEC_FILE) += kexec_file.o obj-$(CONFIG_KEXEC_ELF) += kexec_elf.o +obj-$(CONFIG_KEXEC_HANDOVER) += kexec_handover.o obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o obj-$(CONFIG_COMPAT) += compat.o obj-$(CONFIG_CGROUPS) += cgroup/ diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c new file mode 100644 index 000000000000..e541d3d5003d --- /dev/null +++ b/kernel/kexec_handover.c @@ -0,0 +1,557 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * kexec_handover.c - kexec handover metadata processing + * Copyright (C) 2023 Alexander Graf + * Copyright (C) 2025 Microsoft Corporation, Mike Rapoport + * Copyright (C) 2025 Google LLC, Changyuan Lyu + */ + +#define pr_fmt(fmt) "KHO: " fmt + +#include +#include +#include +#include +#include +#include +#include +#include +#include +/* + * KHO is tightly coupled with mm init and needs access to some of mm + * internal APIs. + */ +#include "../mm/internal.h" + +#define KHO_FDT_COMPATIBLE "kho-v1" +#define PROP_PRESERVED_MEMORY_MAP "preserved-memory-map" +#define PROP_SUB_FDT "fdt" + +static bool kho_enable __ro_after_init; + +bool kho_is_enabled(void) +{ + return kho_enable; +} +EXPORT_SYMBOL_GPL(kho_is_enabled); + +static int __init kho_parse_enable(char *p) +{ + return kstrtobool(p, &kho_enable); +} +early_param("kho", kho_parse_enable); + +struct kho_serialization { + struct page *fdt; + struct list_head fdt_list; + struct dentry *sub_fdt_dir; +}; + +/* + * With KHO enabled, memory can become fragmented because KHO regions may + * be anywhere in physical address space. The scratch regions give us a + * safe zones that we will never see KHO allocations from. This is where we + * can later safely load our new kexec images into and then use the scratch + * area for early allocations that happen before page allocator is + * initialized. + */ +static struct kho_scratch *kho_scratch; +static unsigned int kho_scratch_cnt; + +/* + * The scratch areas are scaled by default as percent of memory allocated from + * memblock. A user can override the scale with command line parameter: + * + * kho_scratch=N% + * + * It is also possible to explicitly define size for a lowmem, a global and + * per-node scratch areas: + * + * kho_scratch=l[KMG],n[KMG],m[KMG] + * + * The explicit size definition takes precedence over scale definition. + */ +static unsigned int scratch_scale __initdata = 200; +static phys_addr_t scratch_size_global __initdata; +static phys_addr_t scratch_size_pernode __initdata; +static phys_addr_t scratch_size_lowmem __initdata; + +static int __init kho_parse_scratch_size(char *p) +{ + size_t len; + unsigned long sizes[3]; + int i; + + if (!p) + return -EINVAL; + + len = strlen(p); + if (!len) + return -EINVAL; + + /* parse nn% */ + if (p[len - 1] == '%') { + /* unsigned int max is 4,294,967,295, 10 chars */ + char s_scale[11] = {}; + int ret = 0; + + if (len > ARRAY_SIZE(s_scale)) + return -EINVAL; + + memcpy(s_scale, p, len - 1); + ret = kstrtouint(s_scale, 10, &scratch_scale); + if (!ret) + pr_notice("scratch scale is %d%%\n", scratch_scale); + return ret; + } + + /* parse ll[KMG],mm[KMG],nn[KMG] */ + for (i = 0; i < ARRAY_SIZE(sizes); i++) { + char *endp = p; + + if (i > 0) { + if (*p != ',') + return -EINVAL; + p += 1; + } + + sizes[i] = memparse(p, &endp); + if (!sizes[i] || endp == p) + return -EINVAL; + p = endp; + } + + scratch_size_lowmem = sizes[0]; + scratch_size_global = sizes[1]; + scratch_size_pernode = sizes[2]; + scratch_scale = 0; + + pr_notice("scratch areas: lowmem: %lluMiB global: %lluMiB pernode: %lldMiB\n", + (u64)(scratch_size_lowmem >> 20), + (u64)(scratch_size_global >> 20), + (u64)(scratch_size_pernode >> 20)); + + return 0; +} +early_param("kho_scratch", kho_parse_scratch_size); + +static void __init scratch_size_update(void) +{ + phys_addr_t size; + + if (!scratch_scale) + return; + + size = memblock_reserved_kern_size(ARCH_LOW_ADDRESS_LIMIT, + NUMA_NO_NODE); + size = size * scratch_scale / 100; + scratch_size_lowmem = round_up(size, CMA_MIN_ALIGNMENT_BYTES); + + size = memblock_reserved_kern_size(MEMBLOCK_ALLOC_ANYWHERE, + NUMA_NO_NODE); + size = size * scratch_scale / 100 - scratch_size_lowmem; + scratch_size_global = round_up(size, CMA_MIN_ALIGNMENT_BYTES); +} + +static phys_addr_t __init scratch_size_node(int nid) +{ + phys_addr_t size; + + if (scratch_scale) { + size = memblock_reserved_kern_size(MEMBLOCK_ALLOC_ANYWHERE, + nid); + size = size * scratch_scale / 100; + } else { + size = scratch_size_pernode; + } + + return round_up(size, CMA_MIN_ALIGNMENT_BYTES); +} + +/** + * kho_reserve_scratch - Reserve a contiguous chunk of memory for kexec + * + * With KHO we can preserve arbitrary pages in the system. To ensure we still + * have a large contiguous region of memory when we search the physical address + * space for target memory, let's make sure we always have a large CMA region + * active. This CMA region will only be used for movable pages which are not a + * problem for us during KHO because we can just move them somewhere else. + */ +static void __init kho_reserve_scratch(void) +{ + phys_addr_t addr, size; + int nid, i = 0; + + if (!kho_enable) + return; + + scratch_size_update(); + + /* FIXME: deal with node hot-plug/remove */ + kho_scratch_cnt = num_online_nodes() + 2; + size = kho_scratch_cnt * sizeof(*kho_scratch); + kho_scratch = memblock_alloc(size, PAGE_SIZE); + if (!kho_scratch) + goto err_disable_kho; + + /* + * reserve scratch area in low memory for lowmem allocations in the + * next kernel + */ + size = scratch_size_lowmem; + addr = memblock_phys_alloc_range(size, CMA_MIN_ALIGNMENT_BYTES, 0, + ARCH_LOW_ADDRESS_LIMIT); + if (!addr) + goto err_free_scratch_desc; + + kho_scratch[i].addr = addr; + kho_scratch[i].size = size; + i++; + + /* reserve large contiguous area for allocations without nid */ + size = scratch_size_global; + addr = memblock_phys_alloc(size, CMA_MIN_ALIGNMENT_BYTES); + if (!addr) + goto err_free_scratch_areas; + + kho_scratch[i].addr = addr; + kho_scratch[i].size = size; + i++; + + for_each_online_node(nid) { + size = scratch_size_node(nid); + addr = memblock_alloc_range_nid(size, CMA_MIN_ALIGNMENT_BYTES, + 0, MEMBLOCK_ALLOC_ACCESSIBLE, + nid, true); + if (!addr) + goto err_free_scratch_areas; + + kho_scratch[i].addr = addr; + kho_scratch[i].size = size; + i++; + } + + return; + +err_free_scratch_areas: + for (i--; i >= 0; i--) + memblock_phys_free(kho_scratch[i].addr, kho_scratch[i].size); +err_free_scratch_desc: + memblock_free(kho_scratch, kho_scratch_cnt * sizeof(*kho_scratch)); +err_disable_kho: + kho_enable = false; +} + +struct fdt_debugfs { + struct list_head list; + struct debugfs_blob_wrapper wrapper; + struct dentry *file; +}; + +static int kho_debugfs_fdt_add(struct list_head *list, struct dentry *dir, + const char *name, const void *fdt) +{ + struct fdt_debugfs *f; + struct dentry *file; + + f = kmalloc(sizeof(*f), GFP_KERNEL); + if (!f) + return -ENOMEM; + + f->wrapper.data = (void *)fdt; + f->wrapper.size = fdt_totalsize(fdt); + + file = debugfs_create_blob(name, 0400, dir, &f->wrapper); + if (IS_ERR(file)) { + kfree(f); + return PTR_ERR(file); + } + + f->file = file; + list_add(&f->list, list); + + return 0; +} + +/** + * kho_add_subtree - record the physical address of a sub FDT in KHO root tree. + * @ser: serialization control object passed by KHO notifiers. + * @name: name of the sub tree. + * @fdt: the sub tree blob. + * + * Creates a new child node named @name in KHO root FDT and records + * the physical address of @fdt. The pages of @fdt must also be preserved + * by KHO for the new kernel to retrieve it after kexec. + * + * A debugfs blob entry is also created at + * ``/sys/kernel/debug/kho/out/sub_fdts/@name``. + * + * Return: 0 on success, error code on failure + */ +int kho_add_subtree(struct kho_serialization *ser, const char *name, void *fdt) +{ + int err = 0; + u64 phys = (u64)virt_to_phys(fdt); + void *root = page_to_virt(ser->fdt); + + err |= fdt_begin_node(root, name); + err |= fdt_property(root, PROP_SUB_FDT, &phys, sizeof(phys)); + err |= fdt_end_node(root); + + if (err) + return err; + + return kho_debugfs_fdt_add(&ser->fdt_list, ser->sub_fdt_dir, name, fdt); +} +EXPORT_SYMBOL_GPL(kho_add_subtree); + +struct kho_out { + struct blocking_notifier_head chain_head; + + struct dentry *dir; + + struct mutex lock; /* protects KHO FDT finalization */ + + struct kho_serialization ser; + bool finalized; +}; + +static struct kho_out kho_out = { + .chain_head = BLOCKING_NOTIFIER_INIT(kho_out.chain_head), + .lock = __MUTEX_INITIALIZER(kho_out.lock), + .ser = { + .fdt_list = LIST_HEAD_INIT(kho_out.ser.fdt_list), + }, + .finalized = false, +}; + +int register_kho_notifier(struct notifier_block *nb) +{ + return blocking_notifier_chain_register(&kho_out.chain_head, nb); +} +EXPORT_SYMBOL_GPL(register_kho_notifier); + +int unregister_kho_notifier(struct notifier_block *nb) +{ + return blocking_notifier_chain_unregister(&kho_out.chain_head, nb); +} +EXPORT_SYMBOL_GPL(unregister_kho_notifier); + +/* Handling for debug/kho/out */ + +static struct dentry *debugfs_root; + +static int kho_out_update_debugfs_fdt(void) +{ + int err = 0; + struct fdt_debugfs *ff, *tmp; + + if (kho_out.finalized) { + err = kho_debugfs_fdt_add(&kho_out.ser.fdt_list, kho_out.dir, + "fdt", page_to_virt(kho_out.ser.fdt)); + } else { + list_for_each_entry_safe(ff, tmp, &kho_out.ser.fdt_list, list) { + debugfs_remove(ff->file); + list_del(&ff->list); + kfree(ff); + } + } + + return err; +} + +static int kho_abort(void) +{ + int err; + + err = blocking_notifier_call_chain(&kho_out.chain_head, KEXEC_KHO_ABORT, + NULL); + err = notifier_to_errno(err); + + if (err) + pr_err("Failed to abort KHO finalization: %d\n", err); + + return err; +} + +static int kho_finalize(void) +{ + int err = 0; + void *fdt = page_to_virt(kho_out.ser.fdt); + + err |= fdt_create(fdt, PAGE_SIZE); + err |= fdt_finish_reservemap(fdt); + err |= fdt_begin_node(fdt, ""); + err |= fdt_property_string(fdt, "compatible", KHO_FDT_COMPATIBLE); + if (err) + goto abort; + + err = blocking_notifier_call_chain(&kho_out.chain_head, + KEXEC_KHO_FINALIZE, &kho_out.ser); + err = notifier_to_errno(err); + if (err) + goto abort; + + err |= fdt_end_node(fdt); + err |= fdt_finish(fdt); + +abort: + if (err) { + pr_err("Failed to convert KHO state tree: %d\n", err); + kho_abort(); + } + + return err; +} + +static int kho_out_finalize_get(void *data, u64 *val) +{ + mutex_lock(&kho_out.lock); + *val = kho_out.finalized; + mutex_unlock(&kho_out.lock); + + return 0; +} + +static int kho_out_finalize_set(void *data, u64 _val) +{ + int ret = 0; + bool val = !!_val; + + mutex_lock(&kho_out.lock); + + if (val == kho_out.finalized) { + if (kho_out.finalized) + ret = -EEXIST; + else + ret = -ENOENT; + goto unlock; + } + + if (val) + ret = kho_finalize(); + else + ret = kho_abort(); + + if (ret) + goto unlock; + + kho_out.finalized = val; + ret = kho_out_update_debugfs_fdt(); + +unlock: + mutex_unlock(&kho_out.lock); + return ret; +} + +DEFINE_DEBUGFS_ATTRIBUTE(fops_kho_out_finalize, kho_out_finalize_get, + kho_out_finalize_set, "%llu\n"); + +static int scratch_phys_show(struct seq_file *m, void *v) +{ + for (int i = 0; i < kho_scratch_cnt; i++) + seq_printf(m, "0x%llx\n", kho_scratch[i].addr); + + return 0; +} +DEFINE_SHOW_ATTRIBUTE(scratch_phys); + +static int scratch_len_show(struct seq_file *m, void *v) +{ + for (int i = 0; i < kho_scratch_cnt; i++) + seq_printf(m, "0x%llx\n", kho_scratch[i].size); + + return 0; +} +DEFINE_SHOW_ATTRIBUTE(scratch_len); + +static __init int kho_out_debugfs_init(void) +{ + struct dentry *dir, *f, *sub_fdt_dir; + + dir = debugfs_create_dir("out", debugfs_root); + if (IS_ERR(dir)) + return -ENOMEM; + + sub_fdt_dir = debugfs_create_dir("sub_fdts", dir); + if (IS_ERR(sub_fdt_dir)) + goto err_rmdir; + + f = debugfs_create_file("scratch_phys", 0400, dir, NULL, + &scratch_phys_fops); + if (IS_ERR(f)) + goto err_rmdir; + + f = debugfs_create_file("scratch_len", 0400, dir, NULL, + &scratch_len_fops); + if (IS_ERR(f)) + goto err_rmdir; + + f = debugfs_create_file("finalize", 0600, dir, NULL, + &fops_kho_out_finalize); + if (IS_ERR(f)) + goto err_rmdir; + + kho_out.dir = dir; + kho_out.ser.sub_fdt_dir = sub_fdt_dir; + return 0; + +err_rmdir: + debugfs_remove_recursive(dir); + return -ENOENT; +} + +static __init int kho_init(void) +{ + int err = 0; + + if (!kho_enable) + return 0; + + kho_out.ser.fdt = alloc_page(GFP_KERNEL); + if (!kho_out.ser.fdt) { + err = -ENOMEM; + goto err_free_scratch; + } + + debugfs_root = debugfs_create_dir("kho", NULL); + if (IS_ERR(debugfs_root)) { + err = -ENOENT; + goto err_free_fdt; + } + + err = kho_out_debugfs_init(); + if (err) + goto err_free_fdt; + + for (int i = 0; i < kho_scratch_cnt; i++) { + unsigned long base_pfn = PHYS_PFN(kho_scratch[i].addr); + unsigned long count = kho_scratch[i].size >> PAGE_SHIFT; + unsigned long pfn; + + for (pfn = base_pfn; pfn < base_pfn + count; + pfn += pageblock_nr_pages) + init_cma_reserved_pageblock(pfn_to_page(pfn)); + } + + return 0; + +err_free_fdt: + put_page(kho_out.ser.fdt); + kho_out.ser.fdt = NULL; +err_free_scratch: + for (int i = 0; i < kho_scratch_cnt; i++) { + void *start = __va(kho_scratch[i].addr); + void *end = start + kho_scratch[i].size; + + free_reserved_area(start, end, -1, ""); + } + kho_enable = false; + return err; +} +late_initcall(kho_init); + +void __init kho_memory_init(void) +{ + kho_reserve_scratch(); +} diff --git a/mm/mm_init.c b/mm/mm_init.c index 62d7f551b295..b35006d9d49d 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -30,6 +30,7 @@ #include #include #include +#include #include #include "internal.h" #include "slab.h" @@ -2770,6 +2771,13 @@ void __init mm_core_init(void) report_meminit(); kmsan_init_shadow(); stack_depot_early_init(); + + /* + * KHO memory setup must happen while memblock is still active, but + * as close as possible to buddy initialization + */ + kho_memory_init(); + memblock_free_all(); mem_init(); kmem_cache_init(); From c609c144b0e8dbc19712ff8c8a0929be38afe58d Mon Sep 17 00:00:00 2001 From: Alexander Graf Date: Fri, 9 May 2025 00:46:23 -0700 Subject: [PATCH 180/285] kexec: add KHO parsing support When we have a KHO kexec, we get an FDT blob and scratch region to populate the state of the system. Provide helper functions that allow architecture code to easily handle memory reservations based on them and give device drivers visibility into the KHO FDT and memory reservations so they can recover their own state. Include a fix from Arnd Bergmann https://lore.kernel.org/lkml/20250424093302.3894961-1-arnd@kernel.org/. Link: https://lkml.kernel.org/r/20250509074635.3187114-6-changyuanl@google.com Signed-off-by: Alexander Graf Signed-off-by: Arnd Bergmann Co-developed-by: Mike Rapoport (Microsoft) Signed-off-by: Mike Rapoport (Microsoft) Co-developed-by: Changyuan Lyu Signed-off-by: Changyuan Lyu Cc: Andy Lutomirski Cc: Anthony Yznaga Cc: Ashish Kalra Cc: Ben Herrenschmidt Cc: Borislav Betkov Cc: Catalin Marinas Cc: Dave Hansen Cc: David Woodhouse Cc: Eric Biederman Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: James Gowans Cc: Jason Gunthorpe Cc: Jonathan Corbet Cc: Krzysztof Kozlowski Cc: Marc Rutland Cc: Paolo Bonzini Cc: Pasha Tatashin Cc: Peter Zijlstra Cc: Pratyush Yadav Cc: Rob Herring Cc: Saravana Kannan Cc: Stanislav Kinsburskii Cc: Steven Rostedt Cc: Thomas Gleinxer Cc: Thomas Lendacky Cc: Will Deacon Signed-off-by: Andrew Morton --- include/linux/kexec_handover.h | 14 ++ kernel/kexec_handover.c | 233 ++++++++++++++++++++++++++++++++- mm/memblock.c | 1 + 3 files changed, 247 insertions(+), 1 deletion(-) diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h index 2e19004776f6..02dcfc8c427e 100644 --- a/include/linux/kexec_handover.h +++ b/include/linux/kexec_handover.h @@ -24,11 +24,15 @@ struct kho_serialization; bool kho_is_enabled(void); int kho_add_subtree(struct kho_serialization *ser, const char *name, void *fdt); +int kho_retrieve_subtree(const char *name, phys_addr_t *phys); int register_kho_notifier(struct notifier_block *nb); int unregister_kho_notifier(struct notifier_block *nb); void kho_memory_init(void); + +void kho_populate(phys_addr_t fdt_phys, u64 fdt_len, phys_addr_t scratch_phys, + u64 scratch_len); #else static inline bool kho_is_enabled(void) { @@ -41,6 +45,11 @@ static inline int kho_add_subtree(struct kho_serialization *ser, return -EOPNOTSUPP; } +static inline int kho_retrieve_subtree(const char *name, phys_addr_t *phys) +{ + return -EOPNOTSUPP; +} + static inline int register_kho_notifier(struct notifier_block *nb) { return -EOPNOTSUPP; @@ -54,6 +63,11 @@ static inline int unregister_kho_notifier(struct notifier_block *nb) static inline void kho_memory_init(void) { } + +static inline void kho_populate(phys_addr_t fdt_phys, u64 fdt_len, + phys_addr_t scratch_phys, u64 scratch_len) +{ +} #endif /* CONFIG_KEXEC_HANDOVER */ #endif /* LINUX_KEXEC_HANDOVER_H */ diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c index e541d3d5003d..59f3cf9557f5 100644 --- a/kernel/kexec_handover.c +++ b/kernel/kexec_handover.c @@ -17,6 +17,9 @@ #include #include #include + +#include + /* * KHO is tightly coupled with mm init and needs access to some of mm * internal APIs. @@ -501,9 +504,112 @@ err_rmdir: return -ENOENT; } +struct kho_in { + struct dentry *dir; + phys_addr_t fdt_phys; + phys_addr_t scratch_phys; + struct list_head fdt_list; +}; + +static struct kho_in kho_in = { + .fdt_list = LIST_HEAD_INIT(kho_in.fdt_list), +}; + +static const void *kho_get_fdt(void) +{ + return kho_in.fdt_phys ? phys_to_virt(kho_in.fdt_phys) : NULL; +} + +/** + * kho_retrieve_subtree - retrieve a preserved sub FDT by its name. + * @name: the name of the sub FDT passed to kho_add_subtree(). + * @phys: if found, the physical address of the sub FDT is stored in @phys. + * + * Retrieve a preserved sub FDT named @name and store its physical + * address in @phys. + * + * Return: 0 on success, error code on failure + */ +int kho_retrieve_subtree(const char *name, phys_addr_t *phys) +{ + const void *fdt = kho_get_fdt(); + const u64 *val; + int offset, len; + + if (!fdt) + return -ENOENT; + + if (!phys) + return -EINVAL; + + offset = fdt_subnode_offset(fdt, 0, name); + if (offset < 0) + return -ENOENT; + + val = fdt_getprop(fdt, offset, PROP_SUB_FDT, &len); + if (!val || len != sizeof(*val)) + return -EINVAL; + + *phys = (phys_addr_t)*val; + + return 0; +} +EXPORT_SYMBOL_GPL(kho_retrieve_subtree); + +/* Handling for debugfs/kho/in */ + +static __init int kho_in_debugfs_init(const void *fdt) +{ + struct dentry *sub_fdt_dir; + int err, child; + + kho_in.dir = debugfs_create_dir("in", debugfs_root); + if (IS_ERR(kho_in.dir)) + return PTR_ERR(kho_in.dir); + + sub_fdt_dir = debugfs_create_dir("sub_fdts", kho_in.dir); + if (IS_ERR(sub_fdt_dir)) { + err = PTR_ERR(sub_fdt_dir); + goto err_rmdir; + } + + err = kho_debugfs_fdt_add(&kho_in.fdt_list, kho_in.dir, "fdt", fdt); + if (err) + goto err_rmdir; + + fdt_for_each_subnode(child, fdt, 0) { + int len = 0; + const char *name = fdt_get_name(fdt, child, NULL); + const u64 *fdt_phys; + + fdt_phys = fdt_getprop(fdt, child, "fdt", &len); + if (!fdt_phys) + continue; + if (len != sizeof(*fdt_phys)) { + pr_warn("node `%s`'s prop `fdt` has invalid length: %d\n", + name, len); + continue; + } + err = kho_debugfs_fdt_add(&kho_in.fdt_list, sub_fdt_dir, name, + phys_to_virt(*fdt_phys)); + if (err) { + pr_warn("failed to add fdt `%s` to debugfs: %d\n", name, + err); + continue; + } + } + + return 0; + +err_rmdir: + debugfs_remove_recursive(kho_in.dir); + return err; +} + static __init int kho_init(void) { int err = 0; + const void *fdt = kho_get_fdt(); if (!kho_enable) return 0; @@ -524,6 +630,20 @@ static __init int kho_init(void) if (err) goto err_free_fdt; + if (fdt) { + err = kho_in_debugfs_init(fdt); + /* + * Failure to create /sys/kernel/debug/kho/in does not prevent + * reviving state from KHO and setting up KHO for the next + * kexec. + */ + if (err) + pr_err("failed exposing handover FDT in debugfs: %d\n", + err); + + return 0; + } + for (int i = 0; i < kho_scratch_cnt; i++) { unsigned long base_pfn = PHYS_PFN(kho_scratch[i].addr); unsigned long count = kho_scratch[i].size >> PAGE_SHIFT; @@ -551,7 +671,118 @@ err_free_scratch: } late_initcall(kho_init); +static void __init kho_release_scratch(void) +{ + phys_addr_t start, end; + u64 i; + + memmap_init_kho_scratch_pages(); + + /* + * Mark scratch mem as CMA before we return it. That way we + * ensure that no kernel allocations happen on it. That means + * we can reuse it as scratch memory again later. + */ + __for_each_mem_range(i, &memblock.memory, NULL, NUMA_NO_NODE, + MEMBLOCK_KHO_SCRATCH, &start, &end, NULL) { + ulong start_pfn = pageblock_start_pfn(PFN_DOWN(start)); + ulong end_pfn = pageblock_align(PFN_UP(end)); + ulong pfn; + + for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) + set_pageblock_migratetype(pfn_to_page(pfn), + MIGRATE_CMA); + } +} + void __init kho_memory_init(void) { - kho_reserve_scratch(); + if (kho_in.scratch_phys) { + kho_scratch = phys_to_virt(kho_in.scratch_phys); + kho_release_scratch(); + } else { + kho_reserve_scratch(); + } +} + +void __init kho_populate(phys_addr_t fdt_phys, u64 fdt_len, + phys_addr_t scratch_phys, u64 scratch_len) +{ + void *fdt = NULL; + struct kho_scratch *scratch = NULL; + int err = 0; + unsigned int scratch_cnt = scratch_len / sizeof(*kho_scratch); + + /* Validate the input FDT */ + fdt = early_memremap(fdt_phys, fdt_len); + if (!fdt) { + pr_warn("setup: failed to memremap FDT (0x%llx)\n", fdt_phys); + err = -EFAULT; + goto out; + } + err = fdt_check_header(fdt); + if (err) { + pr_warn("setup: handover FDT (0x%llx) is invalid: %d\n", + fdt_phys, err); + err = -EINVAL; + goto out; + } + err = fdt_node_check_compatible(fdt, 0, KHO_FDT_COMPATIBLE); + if (err) { + pr_warn("setup: handover FDT (0x%llx) is incompatible with '%s': %d\n", + fdt_phys, KHO_FDT_COMPATIBLE, err); + err = -EINVAL; + goto out; + } + + scratch = early_memremap(scratch_phys, scratch_len); + if (!scratch) { + pr_warn("setup: failed to memremap scratch (phys=0x%llx, len=%lld)\n", + scratch_phys, scratch_len); + err = -EFAULT; + goto out; + } + + /* + * We pass a safe contiguous blocks of memory to use for early boot + * purporses from the previous kernel so that we can resize the + * memblock array as needed. + */ + for (int i = 0; i < scratch_cnt; i++) { + struct kho_scratch *area = &scratch[i]; + u64 size = area->size; + + memblock_add(area->addr, size); + err = memblock_mark_kho_scratch(area->addr, size); + if (WARN_ON(err)) { + pr_warn("failed to mark the scratch region 0x%pa+0x%pa: %d", + &area->addr, &size, err); + goto out; + } + pr_debug("Marked 0x%pa+0x%pa as scratch", &area->addr, &size); + } + + memblock_reserve(scratch_phys, scratch_len); + + /* + * Now that we have a viable region of scratch memory, let's tell + * the memblocks allocator to only use that for any allocations. + * That way we ensure that nothing scribbles over in use data while + * we initialize the page tables which we will need to ingest all + * memory reservations from the previous kernel. + */ + memblock_set_kho_scratch_only(); + + kho_in.fdt_phys = fdt_phys; + kho_in.scratch_phys = scratch_phys; + kho_scratch_cnt = scratch_cnt; + pr_info("found kexec handover data. Will skip init for some devices\n"); + +out: + if (fdt) + early_memunmap(fdt, fdt_len); + if (scratch) + early_memunmap(scratch, scratch_len); + if (err) + pr_warn("disabling KHO revival: %d\n", err); } diff --git a/mm/memblock.c b/mm/memblock.c index ec30d850e195..8895b95ffb5b 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -2394,6 +2394,7 @@ void __init memblock_free_all(void) free_unused_memmap(); reset_all_zones_managed_pages(); + memblock_clear_kho_scratch_only(); pages = free_low_memory_core_early(); totalram_pages_add(pages); } From fc33e4b44b2717feba2f6f07ce7943a96499c9ec Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Fri, 9 May 2025 00:46:24 -0700 Subject: [PATCH 181/285] kexec: enable KHO support for memory preservation Introduce APIs allowing KHO users to preserve memory across kexec and get access to that memory after boot of the kexeced kernel kho_preserve_folio() - record a folio to be preserved over kexec kho_restore_folio() - recreates the folio from the preserved memory kho_preserve_phys() - record physically contiguous range to be preserved over kexec. The memory preservations are tracked by two levels of xarrays to manage chunks of per-order 512 byte bitmaps. For instance if PAGE_SIZE = 4096, the entire 1G order of a 1TB x86 system would fit inside a single 512 byte bitmap. For order 0 allocations each bitmap will cover 16M of address space. Thus, for 16G of memory at most 512K of bitmap memory will be needed for order 0. At serialization time all bitmaps are recorded in a linked list of pages for the next kernel to process and the physical address of the list is recorded in KHO FDT. The next kernel then processes that list, reserves the memory ranges and later, when a user requests a folio or a physical range, KHO restores corresponding memory map entries. Link: https://lkml.kernel.org/r/20250509074635.3187114-7-changyuanl@google.com Suggested-by: Jason Gunthorpe Signed-off-by: Mike Rapoport (Microsoft) Co-developed-by: Changyuan Lyu Signed-off-by: Changyuan Lyu Cc: Alexander Graf Cc: Andy Lutomirski Cc: Anthony Yznaga Cc: Arnd Bergmann Cc: Ashish Kalra Cc: Ben Herrenschmidt Cc: Borislav Betkov Cc: Catalin Marinas Cc: Dave Hansen Cc: David Woodhouse Cc: Eric Biederman Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: James Gowans Cc: Jonathan Corbet Cc: Krzysztof Kozlowski Cc: Marc Rutland Cc: Paolo Bonzini Cc: Pasha Tatashin Cc: Peter Zijlstra Cc: Pratyush Yadav Cc: Rob Herring Cc: Saravana Kannan Cc: Stanislav Kinsburskii Cc: Steven Rostedt Cc: Thomas Gleinxer Cc: Thomas Lendacky Cc: Will Deacon Signed-off-by: Andrew Morton --- include/linux/kexec_handover.h | 36 +++ kernel/kexec_handover.c | 411 +++++++++++++++++++++++++++++++++ 2 files changed, 447 insertions(+) diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h index 02dcfc8c427e..348844cffb13 100644 --- a/include/linux/kexec_handover.h +++ b/include/linux/kexec_handover.h @@ -16,13 +16,34 @@ enum kho_event { KEXEC_KHO_ABORT = 1, }; +struct folio; struct notifier_block; +#define DECLARE_KHOSER_PTR(name, type) \ + union { \ + phys_addr_t phys; \ + type ptr; \ + } name +#define KHOSER_STORE_PTR(dest, val) \ + ({ \ + typeof(val) v = val; \ + typecheck(typeof((dest).ptr), v); \ + (dest).phys = virt_to_phys(v); \ + }) +#define KHOSER_LOAD_PTR(src) \ + ({ \ + typeof(src) s = src; \ + (typeof((s).ptr))((s).phys ? phys_to_virt((s).phys) : NULL); \ + }) + struct kho_serialization; #ifdef CONFIG_KEXEC_HANDOVER bool kho_is_enabled(void); +int kho_preserve_folio(struct folio *folio); +int kho_preserve_phys(phys_addr_t phys, size_t size); +struct folio *kho_restore_folio(phys_addr_t phys); int kho_add_subtree(struct kho_serialization *ser, const char *name, void *fdt); int kho_retrieve_subtree(const char *name, phys_addr_t *phys); @@ -39,6 +60,21 @@ static inline bool kho_is_enabled(void) return false; } +static inline int kho_preserve_folio(struct folio *folio) +{ + return -EOPNOTSUPP; +} + +static inline int kho_preserve_phys(phys_addr_t phys, size_t size) +{ + return -EOPNOTSUPP; +} + +static inline struct folio *kho_restore_folio(phys_addr_t phys) +{ + return NULL; +} + static inline int kho_add_subtree(struct kho_serialization *ser, const char *name, void *fdt) { diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c index 59f3cf9557f5..9cc818cefd15 100644 --- a/kernel/kexec_handover.c +++ b/kernel/kexec_handover.c @@ -9,6 +9,7 @@ #define pr_fmt(fmt) "KHO: " fmt #include +#include #include #include #include @@ -44,12 +45,307 @@ static int __init kho_parse_enable(char *p) } early_param("kho", kho_parse_enable); +/* + * Keep track of memory that is to be preserved across KHO. + * + * The serializing side uses two levels of xarrays to manage chunks of per-order + * 512 byte bitmaps. For instance if PAGE_SIZE = 4096, the entire 1G order of a + * 1TB system would fit inside a single 512 byte bitmap. For order 0 allocations + * each bitmap will cover 16M of address space. Thus, for 16G of memory at most + * 512K of bitmap memory will be needed for order 0. + * + * This approach is fully incremental, as the serialization progresses folios + * can continue be aggregated to the tracker. The final step, immediately prior + * to kexec would serialize the xarray information into a linked list for the + * successor kernel to parse. + */ + +#define PRESERVE_BITS (512 * 8) + +struct kho_mem_phys_bits { + DECLARE_BITMAP(preserve, PRESERVE_BITS); +}; + +struct kho_mem_phys { + /* + * Points to kho_mem_phys_bits, a sparse bitmap array. Each bit is sized + * to order. + */ + struct xarray phys_bits; +}; + +struct kho_mem_track { + /* Points to kho_mem_phys, each order gets its own bitmap tree */ + struct xarray orders; +}; + +struct khoser_mem_chunk; + struct kho_serialization { struct page *fdt; struct list_head fdt_list; struct dentry *sub_fdt_dir; + struct kho_mem_track track; + /* First chunk of serialized preserved memory map */ + struct khoser_mem_chunk *preserved_mem_map; }; +static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz) +{ + void *elm, *res; + + elm = xa_load(xa, index); + if (elm) + return elm; + + elm = kzalloc(sz, GFP_KERNEL); + if (!elm) + return ERR_PTR(-ENOMEM); + + res = xa_cmpxchg(xa, index, NULL, elm, GFP_KERNEL); + if (xa_is_err(res)) + res = ERR_PTR(xa_err(res)); + + if (res) { + kfree(elm); + return res; + } + + return elm; +} + +static void __kho_unpreserve(struct kho_mem_track *track, unsigned long pfn, + unsigned long end_pfn) +{ + struct kho_mem_phys_bits *bits; + struct kho_mem_phys *physxa; + + while (pfn < end_pfn) { + const unsigned int order = + min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn)); + const unsigned long pfn_high = pfn >> order; + + physxa = xa_load(&track->orders, order); + if (!physxa) + continue; + + bits = xa_load(&physxa->phys_bits, pfn_high / PRESERVE_BITS); + if (!bits) + continue; + + clear_bit(pfn_high % PRESERVE_BITS, bits->preserve); + + pfn += 1 << order; + } +} + +static int __kho_preserve_order(struct kho_mem_track *track, unsigned long pfn, + unsigned int order) +{ + struct kho_mem_phys_bits *bits; + struct kho_mem_phys *physxa; + const unsigned long pfn_high = pfn >> order; + + might_sleep(); + + physxa = xa_load_or_alloc(&track->orders, order, sizeof(*physxa)); + if (IS_ERR(physxa)) + return PTR_ERR(physxa); + + bits = xa_load_or_alloc(&physxa->phys_bits, pfn_high / PRESERVE_BITS, + sizeof(*bits)); + if (IS_ERR(bits)) + return PTR_ERR(bits); + + set_bit(pfn_high % PRESERVE_BITS, bits->preserve); + + return 0; +} + +/* almost as free_reserved_page(), just don't free the page */ +static void kho_restore_page(struct page *page) +{ + ClearPageReserved(page); + init_page_count(page); + adjust_managed_page_count(page, 1); +} + +/** + * kho_restore_folio - recreates the folio from the preserved memory. + * @phys: physical address of the folio. + * + * Return: pointer to the struct folio on success, NULL on failure. + */ +struct folio *kho_restore_folio(phys_addr_t phys) +{ + struct page *page = pfn_to_online_page(PHYS_PFN(phys)); + unsigned long order; + + if (!page) + return NULL; + + order = page->private; + if (order) { + if (order > MAX_PAGE_ORDER) + return NULL; + + prep_compound_page(page, order); + } else { + kho_restore_page(page); + } + + return page_folio(page); +} +EXPORT_SYMBOL_GPL(kho_restore_folio); + +/* Serialize and deserialize struct kho_mem_phys across kexec + * + * Record all the bitmaps in a linked list of pages for the next kernel to + * process. Each chunk holds bitmaps of the same order and each block of bitmaps + * starts at a given physical address. This allows the bitmaps to be sparse. The + * xarray is used to store them in a tree while building up the data structure, + * but the KHO successor kernel only needs to process them once in order. + * + * All of this memory is normal kmalloc() memory and is not marked for + * preservation. The successor kernel will remain isolated to the scratch space + * until it completes processing this list. Once processed all the memory + * storing these ranges will be marked as free. + */ + +struct khoser_mem_bitmap_ptr { + phys_addr_t phys_start; + DECLARE_KHOSER_PTR(bitmap, struct kho_mem_phys_bits *); +}; + +struct khoser_mem_chunk_hdr { + DECLARE_KHOSER_PTR(next, struct khoser_mem_chunk *); + unsigned int order; + unsigned int num_elms; +}; + +#define KHOSER_BITMAP_SIZE \ + ((PAGE_SIZE - sizeof(struct khoser_mem_chunk_hdr)) / \ + sizeof(struct khoser_mem_bitmap_ptr)) + +struct khoser_mem_chunk { + struct khoser_mem_chunk_hdr hdr; + struct khoser_mem_bitmap_ptr bitmaps[KHOSER_BITMAP_SIZE]; +}; + +static_assert(sizeof(struct khoser_mem_chunk) == PAGE_SIZE); + +static struct khoser_mem_chunk *new_chunk(struct khoser_mem_chunk *cur_chunk, + unsigned long order) +{ + struct khoser_mem_chunk *chunk; + + chunk = kzalloc(PAGE_SIZE, GFP_KERNEL); + if (!chunk) + return NULL; + chunk->hdr.order = order; + if (cur_chunk) + KHOSER_STORE_PTR(cur_chunk->hdr.next, chunk); + return chunk; +} + +static void kho_mem_ser_free(struct khoser_mem_chunk *first_chunk) +{ + struct khoser_mem_chunk *chunk = first_chunk; + + while (chunk) { + struct khoser_mem_chunk *tmp = chunk; + + chunk = KHOSER_LOAD_PTR(chunk->hdr.next); + kfree(tmp); + } +} + +static int kho_mem_serialize(struct kho_serialization *ser) +{ + struct khoser_mem_chunk *first_chunk = NULL; + struct khoser_mem_chunk *chunk = NULL; + struct kho_mem_phys *physxa; + unsigned long order; + + xa_for_each(&ser->track.orders, order, physxa) { + struct kho_mem_phys_bits *bits; + unsigned long phys; + + chunk = new_chunk(chunk, order); + if (!chunk) + goto err_free; + + if (!first_chunk) + first_chunk = chunk; + + xa_for_each(&physxa->phys_bits, phys, bits) { + struct khoser_mem_bitmap_ptr *elm; + + if (chunk->hdr.num_elms == ARRAY_SIZE(chunk->bitmaps)) { + chunk = new_chunk(chunk, order); + if (!chunk) + goto err_free; + } + + elm = &chunk->bitmaps[chunk->hdr.num_elms]; + chunk->hdr.num_elms++; + elm->phys_start = (phys * PRESERVE_BITS) + << (order + PAGE_SHIFT); + KHOSER_STORE_PTR(elm->bitmap, bits); + } + } + + ser->preserved_mem_map = first_chunk; + + return 0; + +err_free: + kho_mem_ser_free(first_chunk); + return -ENOMEM; +} + +static void deserialize_bitmap(unsigned int order, + struct khoser_mem_bitmap_ptr *elm) +{ + struct kho_mem_phys_bits *bitmap = KHOSER_LOAD_PTR(elm->bitmap); + unsigned long bit; + + for_each_set_bit(bit, bitmap->preserve, PRESERVE_BITS) { + int sz = 1 << (order + PAGE_SHIFT); + phys_addr_t phys = + elm->phys_start + (bit << (order + PAGE_SHIFT)); + struct page *page = phys_to_page(phys); + + memblock_reserve(phys, sz); + memblock_reserved_mark_noinit(phys, sz); + page->private = order; + } +} + +static void __init kho_mem_deserialize(const void *fdt) +{ + struct khoser_mem_chunk *chunk; + const phys_addr_t *mem; + int len; + + mem = fdt_getprop(fdt, 0, PROP_PRESERVED_MEMORY_MAP, &len); + + if (!mem || len != sizeof(*mem)) { + pr_err("failed to get preserved memory bitmaps\n"); + return; + } + + chunk = *mem ? phys_to_virt(*mem) : NULL; + while (chunk) { + unsigned int i; + + for (i = 0; i != chunk->hdr.num_elms; i++) + deserialize_bitmap(chunk->hdr.order, + &chunk->bitmaps[i]); + chunk = KHOSER_LOAD_PTR(chunk->hdr.next); + } +} + /* * With KHO enabled, memory can become fragmented because KHO regions may * be anywhere in physical address space. The scratch regions give us a @@ -324,6 +620,9 @@ static struct kho_out kho_out = { .lock = __MUTEX_INITIALIZER(kho_out.lock), .ser = { .fdt_list = LIST_HEAD_INIT(kho_out.ser.fdt_list), + .track = { + .orders = XARRAY_INIT(kho_out.ser.track.orders, 0), + }, }, .finalized = false, }; @@ -340,6 +639,73 @@ int unregister_kho_notifier(struct notifier_block *nb) } EXPORT_SYMBOL_GPL(unregister_kho_notifier); +/** + * kho_preserve_folio - preserve a folio across kexec. + * @folio: folio to preserve. + * + * Instructs KHO to preserve the whole folio across kexec. The order + * will be preserved as well. + * + * Return: 0 on success, error code on failure + */ +int kho_preserve_folio(struct folio *folio) +{ + const unsigned long pfn = folio_pfn(folio); + const unsigned int order = folio_order(folio); + struct kho_mem_track *track = &kho_out.ser.track; + + if (kho_out.finalized) + return -EBUSY; + + return __kho_preserve_order(track, pfn, order); +} +EXPORT_SYMBOL_GPL(kho_preserve_folio); + +/** + * kho_preserve_phys - preserve a physically contiguous range across kexec. + * @phys: physical address of the range. + * @size: size of the range. + * + * Instructs KHO to preserve the memory range from @phys to @phys + @size + * across kexec. + * + * Return: 0 on success, error code on failure + */ +int kho_preserve_phys(phys_addr_t phys, size_t size) +{ + unsigned long pfn = PHYS_PFN(phys); + unsigned long failed_pfn = 0; + const unsigned long start_pfn = pfn; + const unsigned long end_pfn = PHYS_PFN(phys + size); + int err = 0; + struct kho_mem_track *track = &kho_out.ser.track; + + if (kho_out.finalized) + return -EBUSY; + + if (!PAGE_ALIGNED(phys) || !PAGE_ALIGNED(size)) + return -EINVAL; + + while (pfn < end_pfn) { + const unsigned int order = + min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn)); + + err = __kho_preserve_order(track, pfn, order); + if (err) { + failed_pfn = pfn; + break; + } + + pfn += 1 << order; + } + + if (err) + __kho_unpreserve(track, start_pfn, failed_pfn); + + return err; +} +EXPORT_SYMBOL_GPL(kho_preserve_phys); + /* Handling for debug/kho/out */ static struct dentry *debugfs_root; @@ -366,6 +732,25 @@ static int kho_out_update_debugfs_fdt(void) static int kho_abort(void) { int err; + unsigned long order; + struct kho_mem_phys *physxa; + + xa_for_each(&kho_out.ser.track.orders, order, physxa) { + struct kho_mem_phys_bits *bits; + unsigned long phys; + + xa_for_each(&physxa->phys_bits, phys, bits) + kfree(bits); + + xa_destroy(&physxa->phys_bits); + kfree(physxa); + } + xa_destroy(&kho_out.ser.track.orders); + + if (kho_out.ser.preserved_mem_map) { + kho_mem_ser_free(kho_out.ser.preserved_mem_map); + kho_out.ser.preserved_mem_map = NULL; + } err = blocking_notifier_call_chain(&kho_out.chain_head, KEXEC_KHO_ABORT, NULL); @@ -380,12 +765,25 @@ static int kho_abort(void) static int kho_finalize(void) { int err = 0; + u64 *preserved_mem_map; void *fdt = page_to_virt(kho_out.ser.fdt); err |= fdt_create(fdt, PAGE_SIZE); err |= fdt_finish_reservemap(fdt); err |= fdt_begin_node(fdt, ""); err |= fdt_property_string(fdt, "compatible", KHO_FDT_COMPATIBLE); + /** + * Reserve the preserved-memory-map property in the root FDT, so + * that all property definitions will precede subnodes created by + * KHO callers. + */ + err |= fdt_property_placeholder(fdt, PROP_PRESERVED_MEMORY_MAP, + sizeof(*preserved_mem_map), + (void **)&preserved_mem_map); + if (err) + goto abort; + + err = kho_preserve_folio(page_folio(kho_out.ser.fdt)); if (err) goto abort; @@ -395,6 +793,12 @@ static int kho_finalize(void) if (err) goto abort; + err = kho_mem_serialize(&kho_out.ser); + if (err) + goto abort; + + *preserved_mem_map = (u64)virt_to_phys(kho_out.ser.preserved_mem_map); + err |= fdt_end_node(fdt); err |= fdt_finish(fdt); @@ -697,9 +1101,16 @@ static void __init kho_release_scratch(void) void __init kho_memory_init(void) { + struct folio *folio; + if (kho_in.scratch_phys) { kho_scratch = phys_to_virt(kho_in.scratch_phys); kho_release_scratch(); + + kho_mem_deserialize(kho_get_fdt()); + folio = kho_restore_folio(kho_in.fdt_phys); + if (!folio) + pr_warn("failed to restore folio for KHO fdt\n"); } else { kho_reserve_scratch(); } From 3bdecc3c93f9f68d11ed54971dde169b6ead9d78 Mon Sep 17 00:00:00 2001 From: Alexander Graf Date: Fri, 9 May 2025 00:46:25 -0700 Subject: [PATCH 182/285] kexec: add KHO support to kexec file loads Kexec has 2 modes: A user space driven mode and a kernel driven mode. For the kernel driven mode, kernel code determines the physical addresses of all target buffers that the payload gets copied into. With KHO, we can only safely copy payloads into the "scratch area". Teach the kexec file loader about it, so it only allocates for that area. In addition, enlighten it with support to ask the KHO subsystem for its respective payloads to copy into target memory. Also teach the KHO subsystem how to fill the images for file loads. Link: https://lkml.kernel.org/r/20250509074635.3187114-8-changyuanl@google.com Signed-off-by: Alexander Graf Co-developed-by: Mike Rapoport (Microsoft) Signed-off-by: Mike Rapoport (Microsoft) Co-developed-by: Changyuan Lyu Signed-off-by: Changyuan Lyu Cc: Andy Lutomirski Cc: Anthony Yznaga Cc: Arnd Bergmann Cc: Ashish Kalra Cc: Ben Herrenschmidt Cc: Borislav Betkov Cc: Catalin Marinas Cc: Dave Hansen Cc: David Woodhouse Cc: Eric Biederman Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: James Gowans Cc: Jason Gunthorpe Cc: Jonathan Corbet Cc: Krzysztof Kozlowski Cc: Marc Rutland Cc: Paolo Bonzini Cc: Pasha Tatashin Cc: Peter Zijlstra Cc: Pratyush Yadav Cc: Rob Herring Cc: Saravana Kannan Cc: Stanislav Kinsburskii Cc: Steven Rostedt Cc: Thomas Gleinxer Cc: Thomas Lendacky Cc: Will Deacon Signed-off-by: Andrew Morton --- include/linux/kexec.h | 5 +++ kernel/kexec_file.c | 13 ++++++++ kernel/kexec_handover.c | 67 +++++++++++++++++++++++++++++++++++++++++ kernel/kexec_internal.h | 16 ++++++++++ 4 files changed, 101 insertions(+) diff --git a/include/linux/kexec.h b/include/linux/kexec.h index c8971861521a..075255de8154 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -371,6 +371,11 @@ struct kimage { size_t ima_buffer_size; #endif + struct { + struct kexec_segment *scratch; + phys_addr_t fdt; + } kho; + /* Core ELF header buffer */ void *elf_headers; unsigned long elf_headers_sz; diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c index fba686487e3b..77758c533122 100644 --- a/kernel/kexec_file.c +++ b/kernel/kexec_file.c @@ -253,6 +253,11 @@ kimage_file_prepare_segments(struct kimage *image, int kernel_fd, int initrd_fd, /* IMA needs to pass the measurement list to the next kernel. */ ima_add_kexec_buffer(image); + /* If KHO is active, add its images to the list */ + ret = kho_fill_kimage(image); + if (ret) + goto out; + /* Call image load handler */ ldata = kexec_image_load_default(image); @@ -648,6 +653,14 @@ int kexec_locate_mem_hole(struct kexec_buf *kbuf) if (kbuf->mem != KEXEC_BUF_MEM_UNKNOWN) return 0; + /* + * If KHO is active, only use KHO scratch memory. All other memory + * could potentially be handed over. + */ + ret = kho_locate_mem_hole(kbuf, locate_mem_hole_callback); + if (ret <= 0) + return ret; + if (!IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) ret = kexec_walk_resources(kbuf, locate_mem_hole_callback); else diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c index 9cc818cefd15..69b953551677 100644 --- a/kernel/kexec_handover.c +++ b/kernel/kexec_handover.c @@ -26,6 +26,7 @@ * internal APIs. */ #include "../mm/internal.h" +#include "kexec_internal.h" #define KHO_FDT_COMPATIBLE "kho-v1" #define PROP_PRESERVED_MEMORY_MAP "preserved-memory-map" @@ -1197,3 +1198,69 @@ out: if (err) pr_warn("disabling KHO revival: %d\n", err); } + +/* Helper functions for kexec_file_load */ + +int kho_fill_kimage(struct kimage *image) +{ + ssize_t scratch_size; + int err = 0; + struct kexec_buf scratch; + + if (!kho_enable) + return 0; + + image->kho.fdt = page_to_phys(kho_out.ser.fdt); + + scratch_size = sizeof(*kho_scratch) * kho_scratch_cnt; + scratch = (struct kexec_buf){ + .image = image, + .buffer = kho_scratch, + .bufsz = scratch_size, + .mem = KEXEC_BUF_MEM_UNKNOWN, + .memsz = scratch_size, + .buf_align = SZ_64K, /* Makes it easier to map */ + .buf_max = ULONG_MAX, + .top_down = true, + }; + err = kexec_add_buffer(&scratch); + if (err) + return err; + image->kho.scratch = &image->segment[image->nr_segments - 1]; + + return 0; +} + +static int kho_walk_scratch(struct kexec_buf *kbuf, + int (*func)(struct resource *, void *)) +{ + int ret = 0; + int i; + + for (i = 0; i < kho_scratch_cnt; i++) { + struct resource res = { + .start = kho_scratch[i].addr, + .end = kho_scratch[i].addr + kho_scratch[i].size - 1, + }; + + /* Try to fit the kimage into our KHO scratch region */ + ret = func(&res, kbuf); + if (ret) + break; + } + + return ret; +} + +int kho_locate_mem_hole(struct kexec_buf *kbuf, + int (*func)(struct resource *, void *)) +{ + int ret; + + if (!kho_enable || kbuf->image->type == KEXEC_TYPE_CRASH) + return 1; + + ret = kho_walk_scratch(kbuf, func); + + return ret == 1 ? 0 : -EADDRNOTAVAIL; +} diff --git a/kernel/kexec_internal.h b/kernel/kexec_internal.h index d35d9792402d..30a733a55a67 100644 --- a/kernel/kexec_internal.h +++ b/kernel/kexec_internal.h @@ -39,4 +39,20 @@ extern size_t kexec_purgatory_size; #else /* CONFIG_KEXEC_FILE */ static inline void kimage_file_post_load_cleanup(struct kimage *image) { } #endif /* CONFIG_KEXEC_FILE */ + +struct kexec_buf; + +#ifdef CONFIG_KEXEC_HANDOVER +int kho_locate_mem_hole(struct kexec_buf *kbuf, + int (*func)(struct resource *, void *)); +int kho_fill_kimage(struct kimage *image); +#else +static inline int kho_locate_mem_hole(struct kexec_buf *kbuf, + int (*func)(struct resource *, void *)) +{ + return 1; +} + +static inline int kho_fill_kimage(struct kimage *image) { return 0; } +#endif /* CONFIG_KEXEC_HANDOVER */ #endif /* LINUX_KEXEC_INTERNAL_H */ From 4e1d010e3bda2e0e4147e26490dbb1989ef65fc1 Mon Sep 17 00:00:00 2001 From: Alexander Graf Date: Fri, 9 May 2025 00:46:26 -0700 Subject: [PATCH 183/285] kexec: add config option for KHO We have all generic code in place now to support Kexec with KHO. This patch adds a config option that depends on architecture support to enable KHO support. Link: https://lkml.kernel.org/r/20250509074635.3187114-9-changyuanl@google.com Signed-off-by: Alexander Graf Co-developed-by: Mike Rapoport (Microsoft) Signed-off-by: Mike Rapoport (Microsoft) Signed-off-by: Changyuan Lyu Cc: Andy Lutomirski Cc: Anthony Yznaga Cc: Arnd Bergmann Cc: Ashish Kalra Cc: Ben Herrenschmidt Cc: Borislav Betkov Cc: Catalin Marinas Cc: Dave Hansen Cc: David Woodhouse Cc: Eric Biederman Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: James Gowans Cc: Jason Gunthorpe Cc: Jonathan Corbet Cc: Krzysztof Kozlowski Cc: Marc Rutland Cc: Paolo Bonzini Cc: Pasha Tatashin Cc: Peter Zijlstra Cc: Pratyush Yadav Cc: Rob Herring Cc: Saravana Kannan Cc: Stanislav Kinsburskii Cc: Steven Rostedt Cc: Thomas Gleinxer Cc: Thomas Lendacky Cc: Will Deacon Signed-off-by: Andrew Morton --- kernel/Kconfig.kexec | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec index 4d111f871951..4fa212909d69 100644 --- a/kernel/Kconfig.kexec +++ b/kernel/Kconfig.kexec @@ -95,6 +95,20 @@ config KEXEC_JUMP Jump between original kernel and kexeced kernel and invoke code in physical address mode via KEXEC +config KEXEC_HANDOVER + bool "kexec handover" + depends on ARCH_SUPPORTS_KEXEC_HANDOVER && ARCH_SUPPORTS_KEXEC_FILE + select MEMBLOCK_KHO_SCRATCH + select KEXEC_FILE + select DEBUG_FS + select LIBFDT + select CMA + help + Allow kexec to hand over state across kernels by generating and + passing additional metadata to the target kernel. This is useful + to keep data or state alive across the kexec. For this to work, + both source and target kernels need to have this option enabled. + config CRASH_DUMP bool "kernel crash dumps" default ARCH_DEFAULT_CRASH_DUMP From 274cdcb1c004c455451b1ca6fb5576f474f9eba0 Mon Sep 17 00:00:00 2001 From: Alexander Graf Date: Fri, 9 May 2025 00:46:27 -0700 Subject: [PATCH 184/285] arm64: add KHO support We now have all bits in place to support KHO kexecs. Add awareness of KHO in the kexec file as well as boot path for arm64 and adds the respective kconfig option to the architecture so that it can use KHO successfully. Changes to the "chosen" node have been sent to https://github.com/devicetree-org/dt-schema/pull/158. Link: https://lkml.kernel.org/r/20250509074635.3187114-10-changyuanl@google.com Signed-off-by: Alexander Graf Co-developed-by: Mike Rapoport (Microsoft) Signed-off-by: Mike Rapoport (Microsoft) Co-developed-by: Changyuan Lyu Signed-off-by: Changyuan Lyu Cc: Andy Lutomirski Cc: Anthony Yznaga Cc: Arnd Bergmann Cc: Ashish Kalra Cc: Ben Herrenschmidt Cc: Borislav Betkov Cc: Catalin Marinas Cc: Dave Hansen Cc: David Woodhouse Cc: Eric Biederman Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: James Gowans Cc: Jason Gunthorpe Cc: Jonathan Corbet Cc: Krzysztof Kozlowski Cc: Marc Rutland Cc: Paolo Bonzini Cc: Pasha Tatashin Cc: Peter Zijlstra Cc: Pratyush Yadav Cc: Rob Herring Cc: Saravana Kannan Cc: Stanislav Kinsburskii Cc: Steven Rostedt Cc: Thomas Gleinxer Cc: Thomas Lendacky Cc: Will Deacon Signed-off-by: Andrew Morton --- arch/arm64/Kconfig | 3 +++ drivers/of/fdt.c | 34 ++++++++++++++++++++++++++++++++++ drivers/of/kexec.c | 42 ++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 79 insertions(+) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index a182295e6f08..34c79f4fee3f 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -1602,6 +1602,9 @@ config ARCH_SUPPORTS_KEXEC_IMAGE_VERIFY_SIG config ARCH_DEFAULT_KEXEC_IMAGE_VERIFY_SIG def_bool y +config ARCH_SUPPORTS_KEXEC_HANDOVER + def_bool y + config ARCH_SUPPORTS_CRASH_DUMP def_bool y diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c index aedd0e2dcd89..0edd639898a6 100644 --- a/drivers/of/fdt.c +++ b/drivers/of/fdt.c @@ -25,6 +25,7 @@ #include #include #include +#include #include /* for COMMAND_LINE_SIZE */ #include @@ -875,6 +876,36 @@ void __init early_init_dt_check_for_usable_mem_range(void) memblock_add(rgn[i].base, rgn[i].size); } +/** + * early_init_dt_check_kho - Decode info required for kexec handover from DT + */ +static void __init early_init_dt_check_kho(void) +{ + unsigned long node = chosen_node_offset; + u64 fdt_start, fdt_size, scratch_start, scratch_size; + const __be32 *p; + int l; + + if (!IS_ENABLED(CONFIG_KEXEC_HANDOVER) || (long)node < 0) + return; + + p = of_get_flat_dt_prop(node, "linux,kho-fdt", &l); + if (l != (dt_root_addr_cells + dt_root_size_cells) * sizeof(__be32)) + return; + + fdt_start = dt_mem_next_cell(dt_root_addr_cells, &p); + fdt_size = dt_mem_next_cell(dt_root_addr_cells, &p); + + p = of_get_flat_dt_prop(node, "linux,kho-scratch", &l); + if (l != (dt_root_addr_cells + dt_root_size_cells) * sizeof(__be32)) + return; + + scratch_start = dt_mem_next_cell(dt_root_addr_cells, &p); + scratch_size = dt_mem_next_cell(dt_root_addr_cells, &p); + + kho_populate(fdt_start, fdt_size, scratch_start, scratch_size); +} + #ifdef CONFIG_SERIAL_EARLYCON int __init early_init_dt_scan_chosen_stdout(void) @@ -1169,6 +1200,9 @@ void __init early_init_dt_scan_nodes(void) /* Handle linux,usable-memory-range property */ early_init_dt_check_for_usable_mem_range(); + + /* Handle kexec handover */ + early_init_dt_check_kho(); } bool __init early_init_dt_scan(void *dt_virt, phys_addr_t dt_phys) diff --git a/drivers/of/kexec.c b/drivers/of/kexec.c index 5b924597a4de..1ee2d31816ae 100644 --- a/drivers/of/kexec.c +++ b/drivers/of/kexec.c @@ -264,6 +264,43 @@ static inline int setup_ima_buffer(const struct kimage *image, void *fdt, } #endif /* CONFIG_IMA_KEXEC */ +static int kho_add_chosen(const struct kimage *image, void *fdt, int chosen_node) +{ + int ret = 0; +#ifdef CONFIG_KEXEC_HANDOVER + phys_addr_t fdt_mem = 0; + phys_addr_t fdt_len = 0; + phys_addr_t scratch_mem = 0; + phys_addr_t scratch_len = 0; + + ret = fdt_delprop(fdt, chosen_node, "linux,kho-fdt"); + if (ret && ret != -FDT_ERR_NOTFOUND) + return ret; + ret = fdt_delprop(fdt, chosen_node, "linux,kho-scratch"); + if (ret && ret != -FDT_ERR_NOTFOUND) + return ret; + + if (!image->kho.fdt || !image->kho.scratch) + return 0; + + fdt_mem = image->kho.fdt; + fdt_len = PAGE_SIZE; + scratch_mem = image->kho.scratch->mem; + scratch_len = image->kho.scratch->bufsz; + + pr_debug("Adding kho metadata to DT"); + + ret = fdt_appendprop_addrrange(fdt, 0, chosen_node, "linux,kho-fdt", + fdt_mem, fdt_len); + if (ret) + return ret; + ret = fdt_appendprop_addrrange(fdt, 0, chosen_node, "linux,kho-scratch", + scratch_mem, scratch_len); + +#endif /* CONFIG_KEXEC_HANDOVER */ + return ret; +} + /* * of_kexec_alloc_and_setup_fdt - Alloc and setup a new Flattened Device Tree * @@ -414,6 +451,11 @@ void *of_kexec_alloc_and_setup_fdt(const struct kimage *image, #endif } + /* Add kho metadata if this is a KHO image */ + ret = kho_add_chosen(image, fdt, chosen_node); + if (ret) + goto out; + /* add bootargs */ if (cmdline) { ret = fdt_setprop_string(fdt, chosen_node, "bootargs", cmdline); From 96383f1fb876c87763c163f3e7656b105cd8b643 Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Fri, 9 May 2025 00:46:28 -0700 Subject: [PATCH 185/285] x86/setup: use memblock_reserve_kern for memory used by kernel memblock_reserve() does not distinguish memory used by firmware from memory used by kernel. The distinction is nice to have for accounting of early memory allocations and reservations, but it is essential for kexec handover (kho) to know how much memory kernel consumes during boot. Use memblock_reserve_kern() to reserve kernel memory, such as kernel image, initrd and setup data. Link: https://lkml.kernel.org/r/20250509074635.3187114-11-changyuanl@google.com Signed-off-by: Mike Rapoport (Microsoft) Signed-off-by: Changyuan Lyu Acked-by: Dave Hansen Cc: Alexander Graf Cc: Andy Lutomirski Cc: Anthony Yznaga Cc: Arnd Bergmann Cc: Ashish Kalra Cc: Ben Herrenschmidt Cc: Borislav Betkov Cc: Catalin Marinas Cc: David Woodhouse Cc: Eric Biederman Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: James Gowans Cc: Jason Gunthorpe Cc: Jonathan Corbet Cc: Krzysztof Kozlowski Cc: Marc Rutland Cc: Paolo Bonzini Cc: Pasha Tatashin Cc: Peter Zijlstra Cc: Pratyush Yadav Cc: Rob Herring Cc: Saravana Kannan Cc: Stanislav Kinsburskii Cc: Steven Rostedt Cc: Thomas Gleinxer Cc: Thomas Lendacky Cc: Will Deacon Signed-off-by: Andrew Morton --- arch/x86/kernel/setup.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index 9d2a13b37833..766176c4f5ee 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -286,8 +286,8 @@ static void __init cleanup_highmap(void) static void __init reserve_brk(void) { if (_brk_end > _brk_start) - memblock_reserve(__pa_symbol(_brk_start), - _brk_end - _brk_start); + memblock_reserve_kern(__pa_symbol(_brk_start), + _brk_end - _brk_start); /* Mark brk area as locked down and no longer taking any new allocations */ @@ -360,7 +360,7 @@ static void __init early_reserve_initrd(void) !ramdisk_image || !ramdisk_size) return; /* No initrd provided by bootloader */ - memblock_reserve(ramdisk_image, ramdisk_end - ramdisk_image); + memblock_reserve_kern(ramdisk_image, ramdisk_end - ramdisk_image); } static void __init reserve_initrd(void) @@ -413,7 +413,7 @@ static void __init add_early_ima_buffer(u64 phys_addr) } if (data->size) { - memblock_reserve(data->addr, data->size); + memblock_reserve_kern(data->addr, data->size); ima_kexec_buffer_phys = data->addr; ima_kexec_buffer_size = data->size; } @@ -553,7 +553,7 @@ static void __init memblock_x86_reserve_range_setup_data(void) len = sizeof(*data); pa_next = data->next; - memblock_reserve(pa_data, sizeof(*data) + data->len); + memblock_reserve_kern(pa_data, sizeof(*data) + data->len); if (data->type == SETUP_INDIRECT) { len += data->len; @@ -567,7 +567,7 @@ static void __init memblock_x86_reserve_range_setup_data(void) indirect = (struct setup_indirect *)data->data; if (indirect->type != SETUP_INDIRECT) - memblock_reserve(indirect->addr, indirect->len); + memblock_reserve_kern(indirect->addr, indirect->len); } pa_data = pa_next; @@ -770,8 +770,8 @@ static void __init early_reserve_memory(void) * __end_of_kernel_reserve symbol must be explicitly reserved with a * separate memblock_reserve() or they will be discarded. */ - memblock_reserve(__pa_symbol(_text), - (unsigned long)__end_of_kernel_reserve - (unsigned long)_text); + memblock_reserve_kern(__pa_symbol(_text), + (unsigned long)__end_of_kernel_reserve - (unsigned long)_text); /* * The first 4Kb of memory is a BIOS owned area, but generally it is From 65a5d7278545b5cac3ca0a5b6a1e9a4ea1554181 Mon Sep 17 00:00:00 2001 From: Alexander Graf Date: Fri, 9 May 2025 00:46:29 -0700 Subject: [PATCH 186/285] x86/kexec: add support for passing kexec handover (KHO) data kexec handover (KHO) creates a metadata that the kernels pass between each other during kexec. This metadata is stored in memory and kexec image contains a (physical) pointer to that memory. In addition, KHO keeps "scratch regions" available for kexec: physically contiguous memory regions that are guaranteed to not have any memory that KHO would preserve. The new kernel bootstraps itself using the scratch regions and sets all handed over memory as in use. When subsystems that support KHO initialize, they introspect the KHO metadata, restore preserved memory regions, and retrieve their state stored in the preserved memory. Enlighten x86 kexec-file and boot path about the KHO metadata and make sure it gets passed along to the next kernel. Link: https://lkml.kernel.org/r/20250509074635.3187114-12-changyuanl@google.com Signed-off-by: Alexander Graf Co-developed-by: Mike Rapoport (Microsoft) Signed-off-by: Mike Rapoport (Microsoft) Co-developed-by: Changyuan Lyu Signed-off-by: Changyuan Lyu Acked-by: Dave Hansen Cc: Andy Lutomirski Cc: Anthony Yznaga Cc: Arnd Bergmann Cc: Ashish Kalra Cc: Ben Herrenschmidt Cc: Borislav Betkov Cc: Catalin Marinas Cc: David Woodhouse Cc: Eric Biederman Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: James Gowans Cc: Jason Gunthorpe Cc: Jonathan Corbet Cc: Krzysztof Kozlowski Cc: Marc Rutland Cc: Paolo Bonzini Cc: Pasha Tatashin Cc: Peter Zijlstra Cc: Pratyush Yadav Cc: Rob Herring Cc: Saravana Kannan Cc: Stanislav Kinsburskii Cc: Steven Rostedt Cc: Thomas Gleinxer Cc: Thomas Lendacky Cc: Will Deacon Signed-off-by: Andrew Morton --- arch/x86/include/asm/setup.h | 2 ++ arch/x86/include/uapi/asm/setup_data.h | 13 ++++++++- arch/x86/kernel/kexec-bzimage64.c | 37 ++++++++++++++++++++++++++ arch/x86/kernel/setup.c | 26 ++++++++++++++++++ 4 files changed, 77 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h index ad9212df0ec0..3b37571911f4 100644 --- a/arch/x86/include/asm/setup.h +++ b/arch/x86/include/asm/setup.h @@ -67,6 +67,8 @@ extern void x86_ce4100_early_setup(void); static inline void x86_ce4100_early_setup(void) { } #endif +#include + #ifndef _SETUP #include diff --git a/arch/x86/include/uapi/asm/setup_data.h b/arch/x86/include/uapi/asm/setup_data.h index 50c45ead4e7c..2671c4e1b3a0 100644 --- a/arch/x86/include/uapi/asm/setup_data.h +++ b/arch/x86/include/uapi/asm/setup_data.h @@ -13,7 +13,8 @@ #define SETUP_CC_BLOB 7 #define SETUP_IMA 8 #define SETUP_RNG_SEED 9 -#define SETUP_ENUM_MAX SETUP_RNG_SEED +#define SETUP_KEXEC_KHO 10 +#define SETUP_ENUM_MAX SETUP_KEXEC_KHO #define SETUP_INDIRECT (1<<31) #define SETUP_TYPE_MAX (SETUP_ENUM_MAX | SETUP_INDIRECT) @@ -78,6 +79,16 @@ struct ima_setup_data { __u64 size; } __attribute__((packed)); +/* + * Locations of kexec handover metadata + */ +struct kho_data { + __u64 fdt_addr; + __u64 fdt_size; + __u64 scratch_addr; + __u64 scratch_size; +} __attribute__((packed)); + #endif /* __ASSEMBLER__ */ #endif /* _UAPI_ASM_X86_SETUP_DATA_H */ diff --git a/arch/x86/kernel/kexec-bzimage64.c b/arch/x86/kernel/kexec-bzimage64.c index 68530fad05f7..dad174e3bed0 100644 --- a/arch/x86/kernel/kexec-bzimage64.c +++ b/arch/x86/kernel/kexec-bzimage64.c @@ -233,6 +233,32 @@ setup_ima_state(const struct kimage *image, struct boot_params *params, #endif /* CONFIG_IMA_KEXEC */ } +static void setup_kho(const struct kimage *image, struct boot_params *params, + unsigned long params_load_addr, + unsigned int setup_data_offset) +{ + struct setup_data *sd = (void *)params + setup_data_offset; + struct kho_data *kho = (void *)sd + sizeof(*sd); + + if (!IS_ENABLED(CONFIG_KEXEC_HANDOVER)) + return; + + sd->type = SETUP_KEXEC_KHO; + sd->len = sizeof(struct kho_data); + + /* Only add if we have all KHO images in place */ + if (!image->kho.fdt || !image->kho.scratch) + return; + + /* Add setup data */ + kho->fdt_addr = image->kho.fdt; + kho->fdt_size = PAGE_SIZE; + kho->scratch_addr = image->kho.scratch->mem; + kho->scratch_size = image->kho.scratch->bufsz; + sd->next = params->hdr.setup_data; + params->hdr.setup_data = params_load_addr + setup_data_offset; +} + static int setup_boot_parameters(struct kimage *image, struct boot_params *params, unsigned long params_load_addr, @@ -312,6 +338,13 @@ setup_boot_parameters(struct kimage *image, struct boot_params *params, sizeof(struct ima_setup_data); } + if (IS_ENABLED(CONFIG_KEXEC_HANDOVER)) { + /* Setup space to store preservation metadata */ + setup_kho(image, params, params_load_addr, setup_data_offset); + setup_data_offset += sizeof(struct setup_data) + + sizeof(struct kho_data); + } + /* Setup RNG seed */ setup_rng_seed(params, params_load_addr, setup_data_offset); @@ -479,6 +512,10 @@ static void *bzImage64_load(struct kimage *image, char *kernel, kbuf.bufsz += sizeof(struct setup_data) + sizeof(struct ima_setup_data); + if (IS_ENABLED(CONFIG_KEXEC_HANDOVER)) + kbuf.bufsz += sizeof(struct setup_data) + + sizeof(struct kho_data); + params = kzalloc(kbuf.bufsz, GFP_KERNEL); if (!params) return ERR_PTR(-ENOMEM); diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index 766176c4f5ee..664cd21b8532 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -451,6 +451,29 @@ int __init ima_get_kexec_buffer(void **addr, size_t *size) } #endif +static void __init add_kho(u64 phys_addr, u32 data_len) +{ + struct kho_data *kho; + u64 addr = phys_addr + sizeof(struct setup_data); + u64 size = data_len - sizeof(struct setup_data); + + if (!IS_ENABLED(CONFIG_KEXEC_HANDOVER)) { + pr_warn("Passed KHO data, but CONFIG_KEXEC_HANDOVER not set. Ignoring.\n"); + return; + } + + kho = early_memremap(addr, size); + if (!kho) { + pr_warn("setup: failed to memremap kho data (0x%llx, 0x%llx)\n", + addr, size); + return; + } + + kho_populate(kho->fdt_addr, kho->fdt_size, kho->scratch_addr, kho->scratch_size); + + early_memunmap(kho, size); +} + static void __init parse_setup_data(void) { struct setup_data *data; @@ -479,6 +502,9 @@ static void __init parse_setup_data(void) case SETUP_IMA: add_early_ima_buffer(pa_data); break; + case SETUP_KEXEC_KHO: + add_kho(pa_data, data_len); + break; case SETUP_RNG_SEED: data = early_memremap(pa_data, data_len); add_bootloader_randomness(data->data, data->len); From a2daf83e10378ff4ef61f75da710cac9b84e3eaa Mon Sep 17 00:00:00 2001 From: Alexander Graf Date: Fri, 9 May 2025 00:46:30 -0700 Subject: [PATCH 187/285] x86/e820: temporarily enable KHO scratch for memory below 1M KHO kernels are special and use only scratch memory for memblock allocations, but memory below 1M is ignored by kernel after early boot and cannot be naturally marked as scratch. To allow allocation of the real-mode trampoline and a few (if any) other very early allocations from below 1M forcibly mark the memory below 1M as scratch. After real mode trampoline is allocated, clear that scratch marking. Link: https://lkml.kernel.org/r/20250509074635.3187114-13-changyuanl@google.com Signed-off-by: Alexander Graf Co-developed-by: Mike Rapoport (Microsoft) Signed-off-by: Mike Rapoport (Microsoft) Co-developed-by: Changyuan Lyu Signed-off-by: Changyuan Lyu Acked-by: Dave Hansen Cc: Andy Lutomirski Cc: Anthony Yznaga Cc: Arnd Bergmann Cc: Ashish Kalra Cc: Ben Herrenschmidt Cc: Borislav Betkov Cc: Catalin Marinas Cc: David Woodhouse Cc: Eric Biederman Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: James Gowans Cc: Jason Gunthorpe Cc: Jonathan Corbet Cc: Krzysztof Kozlowski Cc: Marc Rutland Cc: Paolo Bonzini Cc: Pasha Tatashin Cc: Peter Zijlstra Cc: Pratyush Yadav Cc: Rob Herring Cc: Saravana Kannan Cc: Stanislav Kinsburskii Cc: Steven Rostedt Cc: Thomas Gleinxer Cc: Thomas Lendacky Cc: Will Deacon Signed-off-by: Andrew Morton --- arch/x86/kernel/e820.c | 18 ++++++++++++++++++ arch/x86/realmode/init.c | 2 ++ 2 files changed, 20 insertions(+) diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c index 9920122018a0..c3acbd26408b 100644 --- a/arch/x86/kernel/e820.c +++ b/arch/x86/kernel/e820.c @@ -1299,6 +1299,24 @@ void __init e820__memblock_setup(void) memblock_add(entry->addr, entry->size); } + /* + * At this point memblock is only allowed to allocate from memory + * below 1M (aka ISA_END_ADDRESS) up until direct map is completely set + * up in init_mem_mapping(). + * + * KHO kernels are special and use only scratch memory for memblock + * allocations, but memory below 1M is ignored by kernel after early + * boot and cannot be naturally marked as scratch. + * + * To allow allocation of the real-mode trampoline and a few (if any) + * other very early allocations from below 1M forcibly mark the memory + * below 1M as scratch. + * + * After real mode trampoline is allocated, we clear that scratch + * marking. + */ + memblock_mark_kho_scratch(0, SZ_1M); + /* * 32-bit systems are limited to 4BG of memory even with HIGHMEM and * to even less without it. diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c index f9bc444a3064..9b9f4534086d 100644 --- a/arch/x86/realmode/init.c +++ b/arch/x86/realmode/init.c @@ -65,6 +65,8 @@ void __init reserve_real_mode(void) * setup_arch(). */ memblock_reserve(0, SZ_1M); + + memblock_clear_kho_scratch(0, SZ_1M); } static void __init sme_sev_setup_real_mode(struct trampoline_header *th) From a8ebb70447f840ecf3157ec7d6e1393616df0c1e Mon Sep 17 00:00:00 2001 From: Alexander Graf Date: Fri, 9 May 2025 00:46:31 -0700 Subject: [PATCH 188/285] x86/boot: make sure KASLR does not step over KHO preserved memory During kexec handover (KHO) memory contains data that should be preserved and this data would be consumed by kexec'ed kernel. To make sure that the preserved memory is not overwritten, KHO uses "scratch regions" to bootstrap kexec'ed kernel. These regions are guaranteed to not have any memory that KHO would preserve and are used as the only memory the kernel sees during the early boot. The scratch regions are passed in the setup_data by the first kernel with other KHO parameters. If the setup_data contains the KHO parameters, limit randomization to scratch areas only to make sure preserved memory won't get overwritten. Since all the pointers in setup_data are represented by u64, they require double casting (first to unsigned long and then to the actual pointer type) to compile on 32-bits. This looks goofy out of context, but it is unfortunately the way that this is handled across the tree. There are at least a dozen instances of casting like this. Link: https://lkml.kernel.org/r/20250509074635.3187114-14-changyuanl@google.com Signed-off-by: Alexander Graf Co-developed-by: Mike Rapoport (Microsoft) Signed-off-by: Mike Rapoport (Microsoft) Co-developed-by: Changyuan Lyu Signed-off-by: Changyuan Lyu Cc: Andy Lutomirski Cc: Anthony Yznaga Cc: Arnd Bergmann Cc: Ashish Kalra Cc: Ben Herrenschmidt Cc: Borislav Betkov Cc: Catalin Marinas Cc: Dave Hansen Cc: David Woodhouse Cc: Eric Biederman Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: James Gowans Cc: Jason Gunthorpe Cc: Jonathan Corbet Cc: Krzysztof Kozlowski Cc: Marc Rutland Cc: Paolo Bonzini Cc: Pasha Tatashin Cc: Peter Zijlstra Cc: Pratyush Yadav Cc: Rob Herring Cc: Saravana Kannan Cc: Stanislav Kinsburskii Cc: Steven Rostedt Cc: Thomas Gleinxer Cc: Thomas Lendacky Cc: Will Deacon Signed-off-by: Andrew Morton --- arch/x86/boot/compressed/kaslr.c | 50 +++++++++++++++++++++++++++++++- 1 file changed, 49 insertions(+), 1 deletion(-) diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c index f03d59ea6e40..3b0948ad449f 100644 --- a/arch/x86/boot/compressed/kaslr.c +++ b/arch/x86/boot/compressed/kaslr.c @@ -760,6 +760,49 @@ static void process_e820_entries(unsigned long minimum, } } +/* + * If KHO is active, only process its scratch areas to ensure we are not + * stepping onto preserved memory. + */ +static bool process_kho_entries(unsigned long minimum, unsigned long image_size) +{ + struct kho_scratch *kho_scratch; + struct setup_data *ptr; + struct kho_data *kho; + int i, nr_areas = 0; + + if (!IS_ENABLED(CONFIG_KEXEC_HANDOVER)) + return false; + + ptr = (struct setup_data *)(unsigned long)boot_params_ptr->hdr.setup_data; + while (ptr) { + if (ptr->type == SETUP_KEXEC_KHO) { + kho = (struct kho_data *)(unsigned long)ptr->data; + kho_scratch = (void *)(unsigned long)kho->scratch_addr; + nr_areas = kho->scratch_size / sizeof(*kho_scratch); + break; + } + + ptr = (struct setup_data *)(unsigned long)ptr->next; + } + + if (!nr_areas) + return false; + + for (i = 0; i < nr_areas; i++) { + struct kho_scratch *area = &kho_scratch[i]; + struct mem_vector region = { + .start = area->addr, + .size = area->size, + }; + + if (process_mem_region(®ion, minimum, image_size)) + break; + } + + return true; +} + static unsigned long find_random_phys_addr(unsigned long minimum, unsigned long image_size) { @@ -775,7 +818,12 @@ static unsigned long find_random_phys_addr(unsigned long minimum, return 0; } - if (!process_efi_entries(minimum, image_size)) + /* + * During kexec handover only process KHO scratch areas that are known + * not to contain any data that must be preserved. + */ + if (!process_kho_entries(minimum, image_size) && + !process_efi_entries(minimum, image_size)) process_e820_entries(minimum, image_size); phys_addr = slots_fetch_random(); From 2b082d6f6200a386ef6229f4319c0d95c120a840 Mon Sep 17 00:00:00 2001 From: Alexander Graf Date: Fri, 9 May 2025 00:46:32 -0700 Subject: [PATCH 189/285] x86/Kconfig: enable kexec handover for 64 bits Add ARCH_SUPPORTS_KEXEC_HANDOVER for 64 bits to allow enabling of KEXEC_HANDOVER configuration option. Link: https://lkml.kernel.org/r/20250509074635.3187114-15-changyuanl@google.com Signed-off-by: Alexander Graf Co-developed-by: Mike Rapoport (Microsoft) Signed-off-by: Mike Rapoport (Microsoft) Co-developed-by: Changyuan Lyu Signed-off-by: Changyuan Lyu Cc: Andy Lutomirski Cc: Anthony Yznaga Cc: Arnd Bergmann Cc: Ashish Kalra Cc: Ben Herrenschmidt Cc: Borislav Betkov Cc: Catalin Marinas Cc: Dave Hansen Cc: David Woodhouse Cc: Eric Biederman Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: James Gowans Cc: Jason Gunthorpe Cc: Jonathan Corbet Cc: Krzysztof Kozlowski Cc: Marc Rutland Cc: Paolo Bonzini Cc: Pasha Tatashin Cc: Peter Zijlstra Cc: Pratyush Yadav Cc: Rob Herring Cc: Saravana Kannan Cc: Stanislav Kinsburskii Cc: Steven Rostedt Cc: Thomas Gleinxer Cc: Thomas Lendacky Cc: Will Deacon Signed-off-by: Andrew Morton --- arch/x86/Kconfig | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 5873c9e39919..055204dc211d 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -2029,6 +2029,9 @@ config ARCH_SUPPORTS_KEXEC_BZIMAGE_VERIFY_SIG config ARCH_SUPPORTS_KEXEC_JUMP def_bool y +config ARCH_SUPPORTS_KEXEC_HANDOVER + def_bool X86_64 + config ARCH_SUPPORTS_CRASH_DUMP def_bool X86_64 || (X86_32 && HIGHMEM) From f99230780211a4534b50204c6613852b54ae0e15 Mon Sep 17 00:00:00 2001 From: Alexander Graf Date: Fri, 9 May 2025 00:46:33 -0700 Subject: [PATCH 190/285] memblock: add KHO support for reserve_mem Linux has recently gained support for "reserve_mem": A mechanism to allocate a region of memory early enough in boot that we can cross our fingers and hope it stays at the same location during most boots, so we can store for example ftrace buffers into it. Thanks to KASLR, we can never be really sure that "reserve_mem" allocations are static across kexec. Let's teach it KHO awareness so that it serializes its reservations on kexec exit and deserializes them again on boot, preserving the exact same mapping across kexec. This is an example user for KHO in the KHO patch set to ensure we have at least one (not very controversial) user in the tree before extending KHO's use to more subsystems. Link: https://lkml.kernel.org/r/20250509074635.3187114-16-changyuanl@google.com Signed-off-by: Alexander Graf Co-developed-by: Mike Rapoport (Microsoft) Signed-off-by: Mike Rapoport (Microsoft) Co-developed-by: Changyuan Lyu Signed-off-by: Changyuan Lyu Cc: Andy Lutomirski Cc: Anthony Yznaga Cc: Arnd Bergmann Cc: Ashish Kalra Cc: Ben Herrenschmidt Cc: Borislav Betkov Cc: Catalin Marinas Cc: Dave Hansen Cc: David Woodhouse Cc: Eric Biederman Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: James Gowans Cc: Jason Gunthorpe Cc: Jonathan Corbet Cc: Krzysztof Kozlowski Cc: Marc Rutland Cc: Paolo Bonzini Cc: Pasha Tatashin Cc: Peter Zijlstra Cc: Pratyush Yadav Cc: Rob Herring Cc: Saravana Kannan Cc: Stanislav Kinsburskii Cc: Steven Rostedt Cc: Thomas Gleinxer Cc: Thomas Lendacky Cc: Will Deacon Signed-off-by: Andrew Morton --- mm/memblock.c | 193 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 193 insertions(+) diff --git a/mm/memblock.c b/mm/memblock.c index 8895b95ffb5b..154f1d73b61f 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -18,6 +18,11 @@ #include #include +#ifdef CONFIG_KEXEC_HANDOVER +#include +#include +#endif /* CONFIG_KEXEC_HANDOVER */ + #include #include @@ -2492,6 +2497,189 @@ int reserve_mem_release_by_name(const char *name) return 1; } +#ifdef CONFIG_KEXEC_HANDOVER +#define MEMBLOCK_KHO_FDT "memblock" +#define MEMBLOCK_KHO_NODE_COMPATIBLE "memblock-v1" +#define RESERVE_MEM_KHO_NODE_COMPATIBLE "reserve-mem-v1" +static struct page *kho_fdt; + +static int reserve_mem_kho_finalize(struct kho_serialization *ser) +{ + int err = 0, i; + + for (i = 0; i < reserved_mem_count; i++) { + struct reserve_mem_table *map = &reserved_mem_table[i]; + + err |= kho_preserve_phys(map->start, map->size); + } + + err |= kho_preserve_folio(page_folio(kho_fdt)); + err |= kho_add_subtree(ser, MEMBLOCK_KHO_FDT, page_to_virt(kho_fdt)); + + return notifier_from_errno(err); +} + +static int reserve_mem_kho_notifier(struct notifier_block *self, + unsigned long cmd, void *v) +{ + switch (cmd) { + case KEXEC_KHO_FINALIZE: + return reserve_mem_kho_finalize((struct kho_serialization *)v); + case KEXEC_KHO_ABORT: + return NOTIFY_DONE; + default: + return NOTIFY_BAD; + } +} + +static struct notifier_block reserve_mem_kho_nb = { + .notifier_call = reserve_mem_kho_notifier, +}; + +static int __init prepare_kho_fdt(void) +{ + int err = 0, i; + void *fdt; + + kho_fdt = alloc_page(GFP_KERNEL); + if (!kho_fdt) + return -ENOMEM; + + fdt = page_to_virt(kho_fdt); + + err |= fdt_create(fdt, PAGE_SIZE); + err |= fdt_finish_reservemap(fdt); + + err |= fdt_begin_node(fdt, ""); + err |= fdt_property_string(fdt, "compatible", MEMBLOCK_KHO_NODE_COMPATIBLE); + for (i = 0; i < reserved_mem_count; i++) { + struct reserve_mem_table *map = &reserved_mem_table[i]; + + err |= fdt_begin_node(fdt, map->name); + err |= fdt_property_string(fdt, "compatible", RESERVE_MEM_KHO_NODE_COMPATIBLE); + err |= fdt_property(fdt, "start", &map->start, sizeof(map->start)); + err |= fdt_property(fdt, "size", &map->size, sizeof(map->size)); + err |= fdt_end_node(fdt); + } + err |= fdt_end_node(fdt); + + err |= fdt_finish(fdt); + + if (err) { + pr_err("failed to prepare memblock FDT for KHO: %d\n", err); + put_page(kho_fdt); + kho_fdt = NULL; + } + + return err; +} + +static int __init reserve_mem_init(void) +{ + int err; + + if (!kho_is_enabled() || !reserved_mem_count) + return 0; + + err = prepare_kho_fdt(); + if (err) + return err; + + err = register_kho_notifier(&reserve_mem_kho_nb); + if (err) { + put_page(kho_fdt); + kho_fdt = NULL; + } + + return err; +} +late_initcall(reserve_mem_init); + +static void *__init reserve_mem_kho_retrieve_fdt(void) +{ + phys_addr_t fdt_phys; + static void *fdt; + int err; + + if (fdt) + return fdt; + + err = kho_retrieve_subtree(MEMBLOCK_KHO_FDT, &fdt_phys); + if (err) { + if (err != -ENOENT) + pr_warn("failed to retrieve FDT '%s' from KHO: %d\n", + MEMBLOCK_KHO_FDT, err); + return NULL; + } + + fdt = phys_to_virt(fdt_phys); + + err = fdt_node_check_compatible(fdt, 0, MEMBLOCK_KHO_NODE_COMPATIBLE); + if (err) { + pr_warn("FDT '%s' is incompatible with '%s': %d\n", + MEMBLOCK_KHO_FDT, MEMBLOCK_KHO_NODE_COMPATIBLE, err); + fdt = NULL; + } + + return fdt; +} + +static bool __init reserve_mem_kho_revive(const char *name, phys_addr_t size, + phys_addr_t align) +{ + int err, len_start, len_size, offset; + const phys_addr_t *p_start, *p_size; + const void *fdt; + + fdt = reserve_mem_kho_retrieve_fdt(); + if (!fdt) + return false; + + offset = fdt_subnode_offset(fdt, 0, name); + if (offset < 0) { + pr_warn("FDT '%s' has no child '%s': %d\n", + MEMBLOCK_KHO_FDT, name, offset); + return false; + } + err = fdt_node_check_compatible(fdt, offset, RESERVE_MEM_KHO_NODE_COMPATIBLE); + if (err) { + pr_warn("Node '%s' is incompatible with '%s': %d\n", + name, RESERVE_MEM_KHO_NODE_COMPATIBLE, err); + return false; + } + + p_start = fdt_getprop(fdt, offset, "start", &len_start); + p_size = fdt_getprop(fdt, offset, "size", &len_size); + if (!p_start || len_start != sizeof(*p_start) || !p_size || + len_size != sizeof(*p_size)) { + return false; + } + + if (*p_start & (align - 1)) { + pr_warn("KHO reserve-mem '%s' has wrong alignment (0x%lx, 0x%lx)\n", + name, (long)align, (long)*p_start); + return false; + } + + if (*p_size != size) { + pr_warn("KHO reserve-mem '%s' has wrong size (0x%lx != 0x%lx)\n", + name, (long)*p_size, (long)size); + return false; + } + + reserved_mem_add(*p_start, size, name); + pr_info("Revived memory reservation '%s' from KHO\n", name); + + return true; +} +#else +static bool __init reserve_mem_kho_revive(const char *name, phys_addr_t size, + phys_addr_t align) +{ + return false; +} +#endif /* CONFIG_KEXEC_HANDOVER */ + /* * Parse reserve_mem=nn:align:name */ @@ -2547,6 +2735,11 @@ static int __init reserve_mem(char *p) if (reserve_mem_find_by_name(name, &start, &tmp)) return -EBUSY; + /* Pick previous allocations up from KHO if available */ + if (reserve_mem_kho_revive(name, size, align)) + return 1; + + /* TODO: Allocation must be outside of scratch region */ start = memblock_phys_alloc(size, align); if (!start) return -ENOMEM; From 3498209ff64ea72e7c15f96274427250f9ad9c97 Mon Sep 17 00:00:00 2001 From: Alexander Graf Date: Fri, 9 May 2025 00:46:34 -0700 Subject: [PATCH 191/285] Documentation: add documentation for KHO With KHO in place, let's add documentation that describes what it is and how to use it. Link: https://lkml.kernel.org/r/20250509074635.3187114-17-changyuanl@google.com Signed-off-by: Alexander Graf Co-developed-by: Mike Rapoport (Microsoft) Signed-off-by: Mike Rapoport (Microsoft) Co-developed-by: Changyuan Lyu Signed-off-by: Changyuan Lyu Cc: Andy Lutomirski Cc: Anthony Yznaga Cc: Arnd Bergmann Cc: Ashish Kalra Cc: Ben Herrenschmidt Cc: Borislav Betkov Cc: Catalin Marinas Cc: Dave Hansen Cc: David Woodhouse Cc: Eric Biederman Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: James Gowans Cc: Jason Gunthorpe Cc: Jonathan Corbet Cc: Krzysztof Kozlowski Cc: Marc Rutland Cc: Paolo Bonzini Cc: Pasha Tatashin Cc: Peter Zijlstra Cc: Pratyush Yadav Cc: Rob Herring Cc: Saravana Kannan Cc: Stanislav Kinsburskii Cc: Steven Rostedt Cc: Thomas Gleinxer Cc: Thomas Lendacky Cc: Will Deacon Signed-off-by: Andrew Morton --- .../admin-guide/kernel-parameters.txt | 25 ++++ Documentation/admin-guide/mm/index.rst | 1 + Documentation/admin-guide/mm/kho.rst | 115 ++++++++++++++++++ Documentation/core-api/index.rst | 1 + Documentation/core-api/kho/bindings/kho.yaml | 43 +++++++ .../core-api/kho/bindings/sub-fdt.yaml | 27 ++++ Documentation/core-api/kho/concepts.rst | 74 +++++++++++ Documentation/core-api/kho/fdt.rst | 80 ++++++++++++ Documentation/core-api/kho/index.rst | 13 ++ MAINTAINERS | 2 + 10 files changed, 381 insertions(+) create mode 100644 Documentation/admin-guide/mm/kho.rst create mode 100644 Documentation/core-api/kho/bindings/kho.yaml create mode 100644 Documentation/core-api/kho/bindings/sub-fdt.yaml create mode 100644 Documentation/core-api/kho/concepts.rst create mode 100644 Documentation/core-api/kho/fdt.rst create mode 100644 Documentation/core-api/kho/index.rst diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index d9fd26b95b34..54cb1d46e41f 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2725,6 +2725,31 @@ kgdbwait [KGDB,EARLY] Stop kernel execution and enter the kernel debugger at the earliest opportunity. + kho= [KEXEC,EARLY] + Format: { "0" | "1" | "off" | "on" | "y" | "n" } + Enables or disables Kexec HandOver. + "0" | "off" | "n" - kexec handover is disabled + "1" | "on" | "y" - kexec handover is enabled + + kho_scratch= [KEXEC,EARLY] + Format: ll[KMG],mm[KMG],nn[KMG] | nn% + Defines the size of the KHO scratch region. The KHO + scratch regions are physically contiguous memory + ranges that can only be used for non-kernel + allocations. That way, even when memory is heavily + fragmented with handed over memory, the kexeced + kernel will always have enough contiguous ranges to + bootstrap itself. + + It is possible to specify the exact amount of + memory in the form of "ll[KMG],mm[KMG],nn[KMG]" + where the first parameter defines the size of a low + memory scratch area, the second parameter defines + the size of a global scratch area and the third + parameter defines the size of additional per-node + scratch areas. The form "nn%" defines scale factor + (in percents) of memory that was used during boot. + kmac= [MIPS] Korina ethernet MAC address. Configure the RouterBoard 532 series on-chip Ethernet adapter MAC address. diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index 8b35795b664b..2d2f6c222308 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -42,3 +42,4 @@ the Linux memory management. transhuge userfaultfd zswap + kho diff --git a/Documentation/admin-guide/mm/kho.rst b/Documentation/admin-guide/mm/kho.rst new file mode 100644 index 000000000000..6dc18ed4b886 --- /dev/null +++ b/Documentation/admin-guide/mm/kho.rst @@ -0,0 +1,115 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later + +==================== +Kexec Handover Usage +==================== + +Kexec HandOver (KHO) is a mechanism that allows Linux to preserve memory +regions, which could contain serialized system states, across kexec. + +This document expects that you are familiar with the base KHO +:ref:`concepts `. If you have not read +them yet, please do so now. + +Prerequisites +============= + +KHO is available when the kernel is compiled with ``CONFIG_KEXEC_HANDOVER`` +set to y. Every KHO producer may have its own config option that you +need to enable if you would like to preserve their respective state across +kexec. + +To use KHO, please boot the kernel with the ``kho=on`` command line +parameter. You may use ``kho_scratch`` parameter to define size of the +scratch regions. For example ``kho_scratch=16M,512M,256M`` will reserve a +16 MiB low memory scratch area, a 512 MiB global scratch region, and 256 MiB +per NUMA node scratch regions on boot. + +Perform a KHO kexec +=================== + +First, before you perform a KHO kexec, you need to move the system into +the :ref:`KHO finalization phase ` :: + + $ echo 1 > /sys/kernel/debug/kho/out/finalize + +After this command, the KHO FDT is available in +``/sys/kernel/debug/kho/out/fdt``. Other subsystems may also register +their own preserved sub FDTs under +``/sys/kernel/debug/kho/out/sub_fdts/``. + +Next, load the target payload and kexec into it. It is important that you +use the ``-s`` parameter to use the in-kernel kexec file loader, as user +space kexec tooling currently has no support for KHO with the user space +based file loader :: + + # kexec -l /path/to/bzImage --initrd /path/to/initrd -s + # kexec -e + +The new kernel will boot up and contain some of the previous kernel's state. + +For example, if you used ``reserve_mem`` command line parameter to create +an early memory reservation, the new kernel will have that memory at the +same physical address as the old kernel. + +Abort a KHO exec +================ + +You can move the system out of KHO finalization phase again by calling :: + + $ echo 0 > /sys/kernel/debug/kho/out/active + +After this command, the KHO FDT is no longer available in +``/sys/kernel/debug/kho/out/fdt``. + +debugfs Interfaces +================== + +Currently KHO creates the following debugfs interfaces. Notice that these +interfaces may change in the future. They will be moved to sysfs once KHO is +stabilized. + +``/sys/kernel/debug/kho/out/finalize`` + Kexec HandOver (KHO) allows Linux to transition the state of + compatible drivers into the next kexec'ed kernel. To do so, + device drivers will instruct KHO to preserve memory regions, + which could contain serialized kernel state. + While the state is serialized, they are unable to perform + any modifications to state that was serialized, such as + handed over memory allocations. + + When this file contains "1", the system is in the transition + state. When contains "0", it is not. To switch between the + two states, echo the respective number into this file. + +``/sys/kernel/debug/kho/out/fdt`` + When KHO state tree is finalized, the kernel exposes the + flattened device tree blob that carries its current KHO + state in this file. Kexec user space tooling can use this + as input file for the KHO payload image. + +``/sys/kernel/debug/kho/out/scratch_len`` + Lengths of KHO scratch regions, which are physically contiguous + memory regions that will always stay available for future kexec + allocations. Kexec user space tools can use this file to determine + where it should place its payload images. + +``/sys/kernel/debug/kho/out/scratch_phys`` + Physical locations of KHO scratch regions. Kexec user space tools + can use this file in conjunction to scratch_phys to determine where + it should place its payload images. + +``/sys/kernel/debug/kho/out/sub_fdts/`` + In the KHO finalization phase, KHO producers register their own + FDT blob under this directory. + +``/sys/kernel/debug/kho/in/fdt`` + When the kernel was booted with Kexec HandOver (KHO), + the state tree that carries metadata about the previous + kernel's state is in this file in the format of flattened + device tree. This file may disappear when all consumers of + it finished to interpret their metadata. + +``/sys/kernel/debug/kho/in/sub_fdts/`` + Similar to ``kho/out/sub_fdts/``, but contains sub FDT blobs + of KHO producers passed from the old kernel. diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst index e9789bd381d8..7a4ca18ca6e2 100644 --- a/Documentation/core-api/index.rst +++ b/Documentation/core-api/index.rst @@ -115,6 +115,7 @@ more memory-management documentation in Documentation/mm/index.rst. pin_user_pages boot-time-mm gfp_mask-from-fs-io + kho/index Interfaces for kernel debugging =============================== diff --git a/Documentation/core-api/kho/bindings/kho.yaml b/Documentation/core-api/kho/bindings/kho.yaml new file mode 100644 index 000000000000..11e8ab7b219d --- /dev/null +++ b/Documentation/core-api/kho/bindings/kho.yaml @@ -0,0 +1,43 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +title: Kexec HandOver (KHO) root tree + +maintainers: + - Mike Rapoport + - Changyuan Lyu + +description: | + System memory preserved by KHO across kexec. + +properties: + compatible: + enum: + - kho-v1 + + preserved-memory-map: + description: | + physical address (u64) of an in-memory structure describing all preserved + folios and memory ranges. + +patternProperties: + "$[0-9a-f_]+^": + $ref: sub-fdt.yaml# + description: physical address of a KHO user's own FDT. + +required: + - compatible + - preserved-memory-map + +additionalProperties: false + +examples: + - | + kho { + compatible = "kho-v1"; + preserved-memory-map = <0xf0be16 0x1000000>; + + memblock { + fdt = <0x80cc16 0x1000000>; + }; + }; diff --git a/Documentation/core-api/kho/bindings/sub-fdt.yaml b/Documentation/core-api/kho/bindings/sub-fdt.yaml new file mode 100644 index 000000000000..b9a3d2d24850 --- /dev/null +++ b/Documentation/core-api/kho/bindings/sub-fdt.yaml @@ -0,0 +1,27 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +title: KHO users' FDT address + +maintainers: + - Mike Rapoport + - Changyuan Lyu + +description: | + Physical address of an FDT blob registered by a KHO user. + +properties: + fdt: + description: | + physical address (u64) of an FDT blob. + +required: + - fdt + +additionalProperties: false + +examples: + - | + memblock { + fdt = <0x80cc16 0x1000000>; + }; diff --git a/Documentation/core-api/kho/concepts.rst b/Documentation/core-api/kho/concepts.rst new file mode 100644 index 000000000000..36d5c05cfb30 --- /dev/null +++ b/Documentation/core-api/kho/concepts.rst @@ -0,0 +1,74 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later +.. _kho-concepts: + +======================= +Kexec Handover Concepts +======================= + +Kexec HandOver (KHO) is a mechanism that allows Linux to preserve memory +regions, which could contain serialized system states, across kexec. + +It introduces multiple concepts: + +KHO FDT +======= + +Every KHO kexec carries a KHO specific flattened device tree (FDT) blob +that describes preserved memory regions. These regions contain either +serialized subsystem states, or in-memory data that shall not be touched +across kexec. After KHO, subsystems can retrieve and restore preserved +memory regions from KHO FDT. + +KHO only uses the FDT container format and libfdt library, but does not +adhere to the same property semantics that normal device trees do: Properties +are passed in native endianness and standardized properties like ``regs`` and +``ranges`` do not exist, hence there are no ``#...-cells`` properties. + +KHO is still under development. The FDT schema is unstable and would change +in the future. + +Scratch Regions +=============== + +To boot into kexec, we need to have a physically contiguous memory range that +contains no handed over memory. Kexec then places the target kernel and initrd +into that region. The new kernel exclusively uses this region for memory +allocations before during boot up to the initialization of the page allocator. + +We guarantee that we always have such regions through the scratch regions: On +first boot KHO allocates several physically contiguous memory regions. Since +after kexec these regions will be used by early memory allocations, there is a +scratch region per NUMA node plus a scratch region to satisfy allocations +requests that do not require particular NUMA node assignment. +By default, size of the scratch region is calculated based on amount of memory +allocated during boot. The ``kho_scratch`` kernel command line option may be +used to explicitly define size of the scratch regions. +The scratch regions are declared as CMA when page allocator is initialized so +that their memory can be used during system lifetime. CMA gives us the +guarantee that no handover pages land in that region, because handover pages +must be at a static physical memory location and CMA enforces that only +movable pages can be located inside. + +After KHO kexec, we ignore the ``kho_scratch`` kernel command line option and +instead reuse the exact same region that was originally allocated. This allows +us to recursively execute any amount of KHO kexecs. Because we used this region +for boot memory allocations and as target memory for kexec blobs, some parts +of that memory region may be reserved. These reservations are irrelevant for +the next KHO, because kexec can overwrite even the original kernel. + +.. _kho-finalization-phase: + +KHO finalization phase +====================== + +To enable user space based kexec file loader, the kernel needs to be able to +provide the FDT that describes the current kernel's state before +performing the actual kexec. The process of generating that FDT is +called serialization. When the FDT is generated, some properties +of the system may become immutable because they are already written down +in the FDT. That state is called the KHO finalization phase. + +Public API +========== +.. kernel-doc:: kernel/kexec_handover.c + :export: diff --git a/Documentation/core-api/kho/fdt.rst b/Documentation/core-api/kho/fdt.rst new file mode 100644 index 000000000000..62505285d60d --- /dev/null +++ b/Documentation/core-api/kho/fdt.rst @@ -0,0 +1,80 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later + +======= +KHO FDT +======= + +KHO uses the flattened device tree (FDT) container format and libfdt +library to create and parse the data that is passed between the +kernels. The properties in KHO FDT are stored in native format. +It includes the physical address of an in-memory structure describing +all preserved memory regions, as well as physical addresses of KHO users' +own FDTs. Interpreting those sub FDTs is the responsibility of KHO users. + +KHO nodes and properties +======================== + +Property ``preserved-memory-map`` +--------------------------------- + +KHO saves a special property named ``preserved-memory-map`` under the root node. +This node contains the physical address of an in-memory structure for KHO to +preserve memory regions across kexec. + +Property ``compatible`` +----------------------- + +The ``compatible`` property determines compatibility between the kernel +that created the KHO FDT and the kernel that attempts to load it. +If the kernel that loads the KHO FDT is not compatible with it, the entire +KHO process will be bypassed. + +Property ``fdt`` +---------------- + +Generally, a KHO user serialize its state into its own FDT and instructs +KHO to preserve the underlying memory, such that after kexec, the new kernel +can recover its state from the preserved FDT. + +A KHO user thus can create a node in KHO root tree and save the physical address +of its own FDT in that node's property ``fdt`` . + +Examples +======== + +The following example demonstrates KHO FDT that preserves two memory +regions created with ``reserve_mem`` kernel command line parameter:: + + /dts-v1/; + + / { + compatible = "kho-v1"; + + preserved-memory-map = <0x40be16 0x1000000>; + + memblock { + fdt = <0x1517 0x1000000>; + }; + }; + +where the ``memblock`` node contains an FDT that is requested by the +subsystem memblock for preservation. The FDT contains the following +serialized data:: + + /dts-v1/; + + / { + compatible = "memblock-v1"; + + n1 { + compatible = "reserve-mem-v1"; + start = <0xc06b 0x4000000>; + size = <0x04 0x00>; + }; + + n2 { + compatible = "reserve-mem-v1"; + start = <0xc067 0x4000000>; + size = <0x04 0x00>; + }; + }; diff --git a/Documentation/core-api/kho/index.rst b/Documentation/core-api/kho/index.rst new file mode 100644 index 000000000000..0c63b0c5c143 --- /dev/null +++ b/Documentation/core-api/kho/index.rst @@ -0,0 +1,13 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later + +======================== +Kexec Handover Subsystem +======================== + +.. toctree:: + :maxdepth: 1 + + concepts + fdt + +.. only:: subproject and html diff --git a/MAINTAINERS b/MAINTAINERS index 943b23fc3442..584274a5426a 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -13145,6 +13145,8 @@ M: Mike Rapoport M: Changyuan Lyu L: kexec@lists.infradead.org S: Maintained +F: Documentation/admin-guide/mm/kho.rst +F: Documentation/core-api/kho/* F: include/linux/kexec_handover.h F: kernel/kexec_handover.c From a3d2e34dce2041cf6994919430e75e5eafb99bcd Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Fri, 9 May 2025 00:46:35 -0700 Subject: [PATCH 192/285] Documentation: KHO: add memblock bindings We introduced KHO into Linux: A framework that allows Linux to pass metadata and memory across kexec from Linux to Linux. KHO reuses fdt as file format and shares a lot of the same properties of firmware-to- Linux boot formats: It needs a stable, documented ABI that allows for forward and backward compatibility as well as versioning. As first user of KHO, we introduced memblock which can now preserve memory ranges reserved with reserve_mem command line options contents across kexec, so you can use the post-kexec kernel to read traces from the pre-kexec kernel. This patch adds memblock schemas similar to "device" device tree ones to a new kho bindings directory. This allows us to force contributors to document the data that moves across KHO kexecs and catch breaking change during review. Link: https://lkml.kernel.org/r/20250509074635.3187114-18-changyuanl@google.com Co-developed-by: Alexander Graf Signed-off-by: Alexander Graf Signed-off-by: Mike Rapoport (Microsoft) Signed-off-by: Changyuan Lyu Cc: Andy Lutomirski Cc: Anthony Yznaga Cc: Arnd Bergmann Cc: Ashish Kalra Cc: Ben Herrenschmidt Cc: Borislav Betkov Cc: Catalin Marinas Cc: Dave Hansen Cc: David Woodhouse Cc: Eric Biederman Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: James Gowans Cc: Jason Gunthorpe Cc: Jonathan Corbet Cc: Krzysztof Kozlowski Cc: Marc Rutland Cc: Paolo Bonzini Cc: Pasha Tatashin Cc: Peter Zijlstra Cc: Pratyush Yadav Cc: Rob Herring Cc: Saravana Kannan Cc: Stanislav Kinsburskii Cc: Steven Rostedt Cc: Thomas Gleinxer Cc: Thomas Lendacky Cc: Will Deacon Signed-off-by: Andrew Morton --- .../kho/bindings/memblock/memblock.yaml | 39 ++++++++++++++++++ .../kho/bindings/memblock/reserve-mem.yaml | 40 +++++++++++++++++++ MAINTAINERS | 1 + 3 files changed, 80 insertions(+) create mode 100644 Documentation/core-api/kho/bindings/memblock/memblock.yaml create mode 100644 Documentation/core-api/kho/bindings/memblock/reserve-mem.yaml diff --git a/Documentation/core-api/kho/bindings/memblock/memblock.yaml b/Documentation/core-api/kho/bindings/memblock/memblock.yaml new file mode 100644 index 000000000000..d388c28eb91d --- /dev/null +++ b/Documentation/core-api/kho/bindings/memblock/memblock.yaml @@ -0,0 +1,39 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +title: Memblock reserved memory + +maintainers: + - Mike Rapoport + +description: | + Memblock can serialize its current memory reservations created with + reserve_mem command line option across kexec through KHO. + The post-KHO kernel can then consume these reservations and they are + guaranteed to have the same physical address. + +properties: + compatible: + enum: + - reserve-mem-v1 + +patternProperties: + "$[0-9a-f_]+^": + $ref: reserve-mem.yaml# + description: reserved memory regions + +required: + - compatible + +additionalProperties: false + +examples: + - | + memblock { + compatible = "memblock-v1"; + n1 { + compatible = "reserve-mem-v1"; + start = <0xc06b 0x4000000>; + size = <0x04 0x00>; + }; + }; diff --git a/Documentation/core-api/kho/bindings/memblock/reserve-mem.yaml b/Documentation/core-api/kho/bindings/memblock/reserve-mem.yaml new file mode 100644 index 000000000000..10282d3d1bcd --- /dev/null +++ b/Documentation/core-api/kho/bindings/memblock/reserve-mem.yaml @@ -0,0 +1,40 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +title: Memblock reserved memory regions + +maintainers: + - Mike Rapoport + +description: | + Memblock can serialize its current memory reservations created with + reserve_mem command line option across kexec through KHO. + This object describes each such region. + +properties: + compatible: + enum: + - reserve-mem-v1 + + start: + description: | + physical address (u64) of the reserved memory region. + + size: + description: | + size (u64) of the reserved memory region. + +required: + - compatible + - start + - size + +additionalProperties: false + +examples: + - | + n1 { + compatible = "reserve-mem-v1"; + start = <0xc06b 0x4000000>; + size = <0x04 0x00>; + }; diff --git a/MAINTAINERS b/MAINTAINERS index 584274a5426a..eb5a8c791f01 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -15448,6 +15448,7 @@ M: Mike Rapoport L: linux-mm@kvack.org S: Maintained F: Documentation/core-api/boot-time-mm.rst +F: Documentation/core-api/kho/bindings/memblock/* F: include/linux/memblock.h F: mm/memblock.c F: mm/mm_init.c From f88ce2c84a341f44a7d00bc10868714bc4751f7e Mon Sep 17 00:00:00 2001 From: David Woodhouse Date: Wed, 23 Apr 2025 14:33:37 +0100 Subject: [PATCH 193/285] mm: introduce for_each_valid_pfn() and use it from reserve_bootmem_region() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Patch series "mm: Introduce for_each_valid_pfn()", v4. There are cases where a naïve loop over a PFN range, calling pfn_valid() on each one, is horribly inefficient. Ruihan Li reported the case where memmap_init() iterates all the way from zero to a potentially large value of ARCH_PFN_OFFSET, and we at Amazon found the reserve_bootmem_region() one as it affects hypervisor live update. Others are more cosmetic. By introducing a for_each_valid_pfn() helper it can optimise away a lot of pointless calls to pfn_valid(), skipping immediately to the next valid PFN and also skipping *all* checks within a valid (sub)region according to the granularity of the memory model in use. This patch (of 7) Especially since commit 9092d4f7a1f8 ("memblock: update initialization of reserved pages"), the reserve_bootmem_region() function can spend a significant amount of time iterating over every 4KiB PFN in a range, calling pfn_valid() on each one, and ultimately doing absolutely nothing. On a platform used for virtualization, with large NOMAP regions that eventually get used for guest RAM, this leads to a significant increase in steal time experienced during kexec for a live update. Introduce for_each_valid_pfn() and use it from reserve_bootmem_region(). This implementation is precisely the same naïve loop that the functio used to have, but subsequent commits will provide optimised versions for FLATMEM and SPARSEMEM, and this version will remain for those architectures which provide their own pfn_valid() implementation, until/unless they also provide a matching for_each_valid_pfn(). Link: https://lkml.kernel.org/r/20250423133821.789413-1-dwmw2@infradead.org Link: https://lkml.kernel.org/r/20250423133821.789413-2-dwmw2@infradead.org Signed-off-by: David Woodhouse Reviewed-by: Mike Rapoport (Microsoft) Acked-by: David Hildenbrand Cc: Anshuman Khandual Cc: Ard Biesheuvel Cc: Catalin Marinas Cc: Marc Rutland Cc: Marc Zyngier Cc: Ruihan Li Cc: Will Deacon Cc: Lorenzo Stoakes Signed-off-by: Andrew Morton --- include/linux/mmzone.h | 10 ++++++++++ mm/mm_init.c | 23 ++++++++++------------- 2 files changed, 20 insertions(+), 13 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 6ccec1bf2896..230a29c2ed1a 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -2177,6 +2177,16 @@ void sparse_init(void); #define subsection_map_init(_pfn, _nr_pages) do {} while (0) #endif /* CONFIG_SPARSEMEM */ +/* + * Fallback case for when the architecture provides its own pfn_valid() but + * not a corresponding for_each_valid_pfn(). + */ +#ifndef for_each_valid_pfn +#define for_each_valid_pfn(_pfn, _start_pfn, _end_pfn) \ + for ((_pfn) = (_start_pfn); (_pfn) < (_end_pfn); (_pfn)++) \ + if (pfn_valid(_pfn)) +#endif + #endif /* !__GENERATING_BOUNDS.H */ #endif /* !__ASSEMBLY__ */ #endif /* _LINUX_MMZONE_H */ diff --git a/mm/mm_init.c b/mm/mm_init.c index b35006d9d49d..7191703a5820 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -783,22 +783,19 @@ void __meminit init_deferred_page(unsigned long pfn, int nid) void __meminit reserve_bootmem_region(phys_addr_t start, phys_addr_t end, int nid) { - unsigned long start_pfn = PFN_DOWN(start); - unsigned long end_pfn = PFN_UP(end); + unsigned long pfn; - for (; start_pfn < end_pfn; start_pfn++) { - if (pfn_valid(start_pfn)) { - struct page *page = pfn_to_page(start_pfn); + for_each_valid_pfn(pfn, PFN_DOWN(start), PFN_UP(end)) { + struct page *page = pfn_to_page(pfn); - __init_deferred_page(start_pfn, nid); + __init_deferred_page(pfn, nid); - /* - * no need for atomic set_bit because the struct - * page is not visible yet so nobody should - * access it yet. - */ - __SetPageReserved(page); - } + /* + * no need for atomic set_bit because the struct + * page is not visible yet so nobody should + * access it yet. + */ + __SetPageReserved(page); } } From 928930c2e0a8e1d9252aedb1fa9be83c3669dfd7 Mon Sep 17 00:00:00 2001 From: David Woodhouse Date: Wed, 23 Apr 2025 14:33:38 +0100 Subject: [PATCH 194/285] mm: implement for_each_valid_pfn() for CONFIG_FLATMEM In the FLATMEM case, the default pfn_valid() just checks that the PFN is within the range [ ARCH_PFN_OFFSET .. ARCH_PFN_OFFSET + max_mapnr ). The for_each_valid_pfn() function can therefore be a simple for() loop using those as min/max respectively. Link: https://lkml.kernel.org/r/20250423133821.789413-3-dwmw2@infradead.org Signed-off-by: David Woodhouse Reviewed-by: Mike Rapoport (Microsoft) Acked-by: David Hildenbrand Cc: Anshuman Khandual Cc: Ard Biesheuvel Cc: Catalin Marinas Cc: Marc Rutland Cc: Marc Zyngier Cc: Ruihan Li Cc: Will Deacon Cc: Lorenzo Stoakes Signed-off-by: Andrew Morton --- include/asm-generic/memory_model.h | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/include/asm-generic/memory_model.h b/include/asm-generic/memory_model.h index a3b5029aebbd..74d0077cc5fa 100644 --- a/include/asm-generic/memory_model.h +++ b/include/asm-generic/memory_model.h @@ -30,7 +30,15 @@ static inline int pfn_valid(unsigned long pfn) return pfn >= pfn_offset && (pfn - pfn_offset) < max_mapnr; } #define pfn_valid pfn_valid -#endif + +#ifndef for_each_valid_pfn +#define for_each_valid_pfn(pfn, start_pfn, end_pfn) \ + for ((pfn) = max_t(unsigned long, (start_pfn), ARCH_PFN_OFFSET); \ + (pfn) < min_t(unsigned long, (end_pfn), \ + ARCH_PFN_OFFSET + max_mapnr); \ + (pfn)++) +#endif /* for_each_valid_pfn */ +#endif /* valid_pfn */ #elif defined(CONFIG_SPARSEMEM_VMEMMAP) From 037926316c9ded1194927ee56b862e3a1450aaf3 Mon Sep 17 00:00:00 2001 From: David Woodhouse Date: Wed, 23 Apr 2025 14:33:39 +0100 Subject: [PATCH 195/285] mm: implement for_each_valid_pfn() for CONFIG_SPARSEMEM Implement for_each_valid_pfn() based on two helper functions. The first_valid_pfn() function largely mirrors pfn_valid(), calling into a pfn_section_first_valid() helper which is trivial for the !VMEMMAP case, and in the VMEMMAP case will skip to the next subsection as needed. Since next_valid_pfn() knows that its argument *is* a valid PFN, it doesn't need to do any checking at all while iterating over the low bits within a (sub)section mask; the whole (sub)section is either present or not. Note that the VMEMMAP version of pfn_section_first_valid() may return a value *higher* than end_pfn when skipping to the next subsection, and first_valid_pfn() happily returns that higher value. This is fine. [dwmw2@infradead.org: fix next_valid_pfn() for sparsemem] Link: https://lkml.kernel.org/r/c15100fcf6781a60b852c4dbb43bdc98a678fcf0.camel@infradead.org Link: https://lkml.kernel.org/r/20250423133821.789413-4-dwmw2@infradead.org Signed-off-by: David Woodhouse Reviewed-by: Mike Rapoport (Microsoft) Cc: Anshuman Khandual Cc: Ard Biesheuvel Cc: Catalin Marinas Cc: David Hildenbrand Cc: Marc Rutland Cc: Marc Zyngier Cc: Ruihan Li Cc: Will Deacon Cc: Lorenzo Stoakes Signed-off-by: Andrew Morton --- include/linux/mmzone.h | 78 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 78 insertions(+) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 230a29c2ed1a..b19a98c20de8 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -2075,11 +2075,37 @@ static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn) return usage ? test_bit(idx, usage->subsection_map) : 0; } + +static inline bool pfn_section_first_valid(struct mem_section *ms, unsigned long *pfn) +{ + struct mem_section_usage *usage = READ_ONCE(ms->usage); + int idx = subsection_map_index(*pfn); + unsigned long bit; + + if (!usage) + return false; + + if (test_bit(idx, usage->subsection_map)) + return true; + + /* Find the next subsection that exists */ + bit = find_next_bit(usage->subsection_map, SUBSECTIONS_PER_SECTION, idx); + if (bit == SUBSECTIONS_PER_SECTION) + return false; + + *pfn = (*pfn & PAGE_SECTION_MASK) + (bit * PAGES_PER_SUBSECTION); + return true; +} #else static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn) { return 1; } + +static inline bool pfn_section_first_valid(struct mem_section *ms, unsigned long *pfn) +{ + return true; +} #endif void sparse_init_early_section(int nid, struct page *map, unsigned long pnum, @@ -2128,6 +2154,58 @@ static inline int pfn_valid(unsigned long pfn) return ret; } + +/* Returns end_pfn or higher if no valid PFN remaining in range */ +static inline unsigned long first_valid_pfn(unsigned long pfn, unsigned long end_pfn) +{ + unsigned long nr = pfn_to_section_nr(pfn); + + rcu_read_lock_sched(); + + while (nr <= __highest_present_section_nr && pfn < end_pfn) { + struct mem_section *ms = __pfn_to_section(pfn); + + if (valid_section(ms) && + (early_section(ms) || pfn_section_first_valid(ms, &pfn))) { + rcu_read_unlock_sched(); + return pfn; + } + + /* Nothing left in this section? Skip to next section */ + nr++; + pfn = section_nr_to_pfn(nr); + } + + rcu_read_unlock_sched(); + return end_pfn; +} + +static inline unsigned long next_valid_pfn(unsigned long pfn, unsigned long end_pfn) +{ + pfn++; + + if (pfn >= end_pfn) + return end_pfn; + + /* + * Either every PFN within the section (or subsection for VMEMMAP) is + * valid, or none of them are. So there's no point repeating the check + * for every PFN; only call first_valid_pfn() again when crossing a + * (sub)section boundary (i.e. !(pfn & ~PAGE_{SUB,}SECTION_MASK)). + */ + if (pfn & ~(IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP) ? + PAGE_SUBSECTION_MASK : PAGE_SECTION_MASK)) + return pfn; + + return first_valid_pfn(pfn, end_pfn); +} + + +#define for_each_valid_pfn(_pfn, _start_pfn, _end_pfn) \ + for ((_pfn) = first_valid_pfn((_start_pfn), (_end_pfn)); \ + (_pfn) < (_end_pfn); \ + (_pfn) = next_valid_pfn((_pfn), (_end_pfn))) + #endif static inline int pfn_in_present_section(unsigned long pfn) From 312eca8a14c5f58de62692517afd4738099e6fac Mon Sep 17 00:00:00 2001 From: David Woodhouse Date: Wed, 23 Apr 2025 14:33:40 +0100 Subject: [PATCH 196/285] mm, PM: use for_each_valid_pfn() in kernel/power/snapshot.c Link: https://lkml.kernel.org/r/20250423133821.789413-5-dwmw2@infradead.org Signed-off-by: David Woodhouse Acked-by: Mike Rapoport (Microsoft) Cc: Anshuman Khandual Cc: Ard Biesheuvel Cc: Catalin Marinas Cc: David Hildenbrand Cc: Marc Rutland Cc: Marc Zyngier Cc: Ruihan Li Cc: Will Deacon Cc: Lorenzo Stoakes Signed-off-by: Andrew Morton --- kernel/power/snapshot.c | 44 ++++++++++++++++++++--------------------- 1 file changed, 21 insertions(+), 23 deletions(-) diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c index 4e6e24e8b854..2af36cfe35cd 100644 --- a/kernel/power/snapshot.c +++ b/kernel/power/snapshot.c @@ -1094,16 +1094,15 @@ static void mark_nosave_pages(struct memory_bitmap *bm) ((unsigned long long) region->end_pfn << PAGE_SHIFT) - 1); - for (pfn = region->start_pfn; pfn < region->end_pfn; pfn++) - if (pfn_valid(pfn)) { - /* - * It is safe to ignore the result of - * mem_bm_set_bit_check() here, since we won't - * touch the PFNs for which the error is - * returned anyway. - */ - mem_bm_set_bit_check(bm, pfn); - } + for_each_valid_pfn(pfn, region->start_pfn, region->end_pfn) { + /* + * It is safe to ignore the result of + * mem_bm_set_bit_check() here, since we won't + * touch the PFNs for which the error is + * returned anyway. + */ + mem_bm_set_bit_check(bm, pfn); + } } } @@ -1255,22 +1254,21 @@ static void mark_free_pages(struct zone *zone) spin_lock_irqsave(&zone->lock, flags); max_zone_pfn = zone_end_pfn(zone); - for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) - if (pfn_valid(pfn)) { - page = pfn_to_page(pfn); + for_each_valid_pfn(pfn, zone->zone_start_pfn, max_zone_pfn) { + page = pfn_to_page(pfn); - if (!--page_count) { - touch_nmi_watchdog(); - page_count = WD_PAGE_COUNT; - } - - if (page_zone(page) != zone) - continue; - - if (!swsusp_page_is_forbidden(page)) - swsusp_unset_page_free(page); + if (!--page_count) { + touch_nmi_watchdog(); + page_count = WD_PAGE_COUNT; } + if (page_zone(page) != zone) + continue; + + if (!swsusp_page_is_forbidden(page)) + swsusp_unset_page_free(page); + } + for_each_migratetype_order(order, t) { list_for_each_entry(page, &zone->free_area[order].free_list[t], buddy_list) { From 49d8d78f8c6f5ef58e8dccde0d447132b7b4594a Mon Sep 17 00:00:00 2001 From: David Woodhouse Date: Wed, 23 Apr 2025 14:33:41 +0100 Subject: [PATCH 197/285] mm, x86: use for_each_valid_pfn() from __ioremap_check_ram() Instead of calling pfn_valid() separately for every single PFN in the range, use for_each_valid_pfn() and only look at the ones which are. Link: https://lkml.kernel.org/r/20250423133821.789413-6-dwmw2@infradead.org Signed-off-by: David Woodhouse Acked-by: Mike Rapoport (Microsoft) Cc: Anshuman Khandual Cc: Ard Biesheuvel Cc: Catalin Marinas Cc: David Hildenbrand Cc: Marc Rutland Cc: Marc Zyngier Cc: Ruihan Li Cc: Will Deacon Cc: Lorenzo Stoakes Signed-off-by: Andrew Morton --- arch/x86/mm/ioremap.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c index 331e101bf801..12c8180ca1ba 100644 --- a/arch/x86/mm/ioremap.c +++ b/arch/x86/mm/ioremap.c @@ -71,7 +71,7 @@ int ioremap_change_attr(unsigned long vaddr, unsigned long size, static unsigned int __ioremap_check_ram(struct resource *res) { unsigned long start_pfn, stop_pfn; - unsigned long i; + unsigned long pfn; if ((res->flags & IORESOURCE_SYSTEM_RAM) != IORESOURCE_SYSTEM_RAM) return 0; @@ -79,9 +79,8 @@ static unsigned int __ioremap_check_ram(struct resource *res) start_pfn = (res->start + PAGE_SIZE - 1) >> PAGE_SHIFT; stop_pfn = (res->end + 1) >> PAGE_SHIFT; if (stop_pfn > start_pfn) { - for (i = 0; i < (stop_pfn - start_pfn); ++i) - if (pfn_valid(start_pfn + i) && - !PageReserved(pfn_to_page(start_pfn + i))) + for_each_valid_pfn(pfn, start_pfn, stop_pfn) + if (!PageReserved(pfn_to_page(pfn))) return IORES_MAP_SYSTEM_RAM; } From 6f544e41d9d552c6c5fbd751badd33dc3eae4a6c Mon Sep 17 00:00:00 2001 From: David Woodhouse Date: Wed, 23 Apr 2025 14:33:42 +0100 Subject: [PATCH 198/285] mm: use for_each_valid_pfn() in memory_hotplug Link: https://lkml.kernel.org/r/20250423133821.789413-7-dwmw2@infradead.org Signed-off-by: David Woodhouse Acked-by: Mike Rapoport (Microsoft) Acked-by: David Hildenbrand Cc: Anshuman Khandual Cc: Ard Biesheuvel Cc: Catalin Marinas Cc: Marc Rutland Cc: Marc Zyngier Cc: Ruihan Li Cc: Will Deacon Cc: Lorenzo Stoakes Signed-off-by: Andrew Morton --- mm/memory_hotplug.c | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 8305483de38b..b1caedbade5b 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1756,12 +1756,10 @@ static int scan_movable_pages(unsigned long start, unsigned long end, { unsigned long pfn; - for (pfn = start; pfn < end; pfn++) { + for_each_valid_pfn(pfn, start, end) { struct page *page; struct folio *folio; - if (!pfn_valid(pfn)) - continue; page = pfn_to_page(pfn); if (PageLRU(page)) goto found; @@ -1805,11 +1803,9 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) static DEFINE_RATELIMIT_STATE(migrate_rs, DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST); - for (pfn = start_pfn; pfn < end_pfn; pfn++) { + for_each_valid_pfn(pfn, start_pfn, end_pfn) { struct page *page; - if (!pfn_valid(pfn)) - continue; page = pfn_to_page(pfn); folio = page_folio(page); From 31cf0dd94509eb61e7242e217aea9604621f6b6d Mon Sep 17 00:00:00 2001 From: David Woodhouse Date: Wed, 23 Apr 2025 14:33:43 +0100 Subject: [PATCH 199/285] mm/mm_init: use for_each_valid_pfn() in init_unavailable_range() Currently, memmap_init initializes pfn_hole with 0 instead of ARCH_PFN_OFFSET. Then init_unavailable_range will start iterating each page from the page at address zero to the first available page, but it won't do anything for pages below ARCH_PFN_OFFSET because pfn_valid won't pass. If ARCH_PFN_OFFSET is very large (e.g., something like 2^64-2GiB if the kernel is used as a library and loaded at a very high address), the pointless iteration for pages below ARCH_PFN_OFFSET will take a very long time, and the kernel will look stuck at boot time. Use for_each_valid_pfn() to skip the pointless iterations. Link: https://lkml.kernel.org/r/20250423133821.789413-8-dwmw2@infradead.org Signed-off-by: David Woodhouse Reported-by: Ruihan Li Suggested-by: Mike Rapoport Reviewed-by: Mike Rapoport (Microsoft) Tested-by: Ruihan Li Tested-by: Lorenzo Stoakes Cc: Anshuman Khandual Cc: Ard Biesheuvel Cc: Catalin Marinas Cc: David Hildenbrand Cc: Marc Rutland Cc: Marc Zyngier Cc: Will Deacon Signed-off-by: Andrew Morton --- mm/mm_init.c | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/mm/mm_init.c b/mm/mm_init.c index 7191703a5820..1c5444e188f8 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -851,11 +851,7 @@ static void __init init_unavailable_range(unsigned long spfn, unsigned long pfn; u64 pgcnt = 0; - for (pfn = spfn; pfn < epfn; pfn++) { - if (!pfn_valid(pageblock_start_pfn(pfn))) { - pfn = pageblock_end_pfn(pfn) - 1; - continue; - } + for_each_valid_pfn(pfn, spfn, epfn) { __init_single_page(pfn_to_page(pfn), pfn, zone, node); __SetPageReserved(pfn_to_page(pfn)); pgcnt++; From 551c643fb29a221e8fcd00ff680a364a73deb2f3 Mon Sep 17 00:00:00 2001 From: Pedro Falcato Date: Mon, 21 Apr 2025 18:16:28 +0100 Subject: [PATCH 200/285] mm: workingset: simplify lockdep check in update_node container_of(node->array, ..., i_pages) just to access i_pages again is an incredibly roundabout way of accessing node->array itself. Simplify it. Link: https://lkml.kernel.org/r/20250421-workingset-simplify-v1-1-de5c40051e0e@suse.de Signed-off-by: Pedro Falcato Reviewed-by: Matthew Wilcox (Oracle) Acked-by: Johannes Weiner Signed-off-by: Andrew Morton --- mm/workingset.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/mm/workingset.c b/mm/workingset.c index 4841ae8af411..6e7f4cb1b9a7 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -612,7 +612,6 @@ struct list_lru shadow_nodes; void workingset_update_node(struct xa_node *node) { - struct address_space *mapping; struct page *page = virt_to_page(node); /* @@ -623,8 +622,7 @@ void workingset_update_node(struct xa_node *node) * already where they should be. The list_empty() test is safe * as node->private_list is protected by the i_pages lock. */ - mapping = container_of(node->array, struct address_space, i_pages); - lockdep_assert_held(&mapping->i_pages.xa_lock); + lockdep_assert_held(&node->array->xa_lock); if (node->count && node->count == node->nr_values) { if (list_empty(&node->private_list)) { From ee43f26b49e9ddad0f06c149085343613a9d73a4 Mon Sep 17 00:00:00 2001 From: Su Hui Date: Mon, 21 Apr 2025 14:24:24 +0800 Subject: [PATCH 201/285] mm/damon/sysfs-schemes: use kmalloc_array() and size_add() It's safer to use kmalloc_array() and size_add() because it can prevent possible overflow problem. Link: https://lkml.kernel.org/r/20250421062423.740605-1-suhui@nfschina.com Signed-off-by: Su Hui Reviewed-by: SeongJae Park Signed-off-by: Andrew Morton --- mm/damon/sysfs-schemes.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c index 729fe5f1ef30..c2b8a9cb44ec 100644 --- a/mm/damon/sysfs-schemes.c +++ b/mm/damon/sysfs-schemes.c @@ -465,7 +465,8 @@ static ssize_t memcg_path_store(struct kobject *kobj, { struct damon_sysfs_scheme_filter *filter = container_of(kobj, struct damon_sysfs_scheme_filter, kobj); - char *path = kmalloc(sizeof(*path) * (count + 1), GFP_KERNEL); + char *path = kmalloc_array(size_add(count, 1), sizeof(*path), + GFP_KERNEL); if (!path) return -ENOMEM; @@ -2064,7 +2065,7 @@ static int damon_sysfs_memcg_path_to_id(char *memcg_path, unsigned short *id) if (!memcg_path) return -EINVAL; - path = kmalloc(sizeof(*path) * PATH_MAX, GFP_KERNEL); + path = kmalloc_array(PATH_MAX, sizeof(*path), GFP_KERNEL); if (!path) return -ENOMEM; From 4428a35f91f0f0e31d874038b3091e1c5a461f34 Mon Sep 17 00:00:00 2001 From: Lance Yang Date: Thu, 24 Apr 2025 23:56:06 +0800 Subject: [PATCH 202/285] mm/rmap: inline folio_test_large_maybe_mapped_shared() into callers To prevent the function from being used when CONFIG_MM_ID is disabled, we intend to inline it into its few callers, which also would help maintain the expected code placement. Link: https://lkml.kernel.org/r/20250424155606.57488-1-lance.yang@linux.dev Signed-off-by: Lance Yang Suggested-by: David Hildenbrand Acked-by: David Hildenbrand Cc: Mingzhe Yang Signed-off-by: Andrew Morton --- include/linux/mm.h | 2 +- include/linux/page-flags.h | 4 ---- include/linux/rmap.h | 2 +- mm/memory.c | 4 ++-- 4 files changed, 4 insertions(+), 8 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 9b701cfbef22..21dd110b6655 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2111,7 +2111,7 @@ static inline bool folio_maybe_mapped_shared(struct folio *folio) */ if (mapcount <= 1) return false; - return folio_test_large_maybe_mapped_shared(folio); + return test_bit(FOLIO_MM_IDS_SHARED_BITNUM, &folio->_mm_ids); } #ifndef HAVE_ARCH_MAKE_FOLIO_ACCESSIBLE diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index d3909cb1e576..37b11f15dbd9 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -1230,10 +1230,6 @@ static inline int folio_has_private(const struct folio *folio) return !!(folio->flags & PAGE_FLAGS_PRIVATE); } -static inline bool folio_test_large_maybe_mapped_shared(const struct folio *folio) -{ - return test_bit(FOLIO_MM_IDS_SHARED_BITNUM, &folio->_mm_ids); -} #undef PF_ANY #undef PF_HEAD #undef PF_NO_TAIL diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 6b82b618846e..c4f4903b1088 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -223,7 +223,7 @@ static inline void __folio_large_mapcount_sanity_checks(const struct folio *foli VM_WARN_ON_ONCE(folio_mm_id(folio, 1) != MM_ID_DUMMY && folio->_mm_id_mapcount[1] < 0); VM_WARN_ON_ONCE(!folio_mapped(folio) && - folio_test_large_maybe_mapped_shared(folio)); + test_bit(FOLIO_MM_IDS_SHARED_BITNUM, &folio->_mm_ids)); } static __always_inline void folio_set_large_mapcount(struct folio *folio, diff --git a/mm/memory.c b/mm/memory.c index be124dadec9e..68c1d962d0ad 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3768,7 +3768,7 @@ static bool __wp_can_reuse_large_anon_folio(struct folio *folio, * If all folio references are from mappings, and all mappings are in * the page tables of this MM, then this folio is exclusive to this MM. */ - if (folio_test_large_maybe_mapped_shared(folio)) + if (test_bit(FOLIO_MM_IDS_SHARED_BITNUM, &folio->_mm_ids)) return false; VM_WARN_ON_ONCE(folio_test_ksm(folio)); @@ -3791,7 +3791,7 @@ static bool __wp_can_reuse_large_anon_folio(struct folio *folio, folio_lock_large_mapcount(folio); VM_WARN_ON_ONCE_FOLIO(folio_large_mapcount(folio) > folio_ref_count(folio), folio); - if (folio_test_large_maybe_mapped_shared(folio)) + if (test_bit(FOLIO_MM_IDS_SHARED_BITNUM, &folio->_mm_ids)) goto unlock; if (folio_large_mapcount(folio) != folio_ref_count(folio)) goto unlock; From f60b6634cd88a749fdcd9edfeb2079c23aa05b66 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Thu, 24 Apr 2025 17:57:29 -0400 Subject: [PATCH 203/285] mm/selftests: add a test to verify mmap_changing race with -EAGAIN Add an unit test to verify the recent mmap_changing ABI breakage. Note that I used some tricks here and there to make the test simple, e.g. I abused UFFDIO_MOVE on top of shmem with the fact that I know what I want to test will be even earlier than the vma type check. Rich comments were added to explain trivial details. Before that fix, -EAGAIN would have been written to the copy field most of the time but not always; the test should be able to reliably trigger the outlier case. After the fix, it's written always, the test verifies that making sure corresponding field (e.g. copy.copy for UFFDIO_COPY) is updated. [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/20250424215729.194656-3-peterx@redhat.com Signed-off-by: Peter Xu Cc: Andrea Arcangeli Cc: Axel Rasmussen Cc: Mike Rapoport Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/uffd-unit-tests.c | 202 +++++++++++++++++++ 1 file changed, 202 insertions(+) diff --git a/tools/testing/selftests/mm/uffd-unit-tests.c b/tools/testing/selftests/mm/uffd-unit-tests.c index e8fd9011c2a3..c73fd5d455c8 100644 --- a/tools/testing/selftests/mm/uffd-unit-tests.c +++ b/tools/testing/selftests/mm/uffd-unit-tests.c @@ -1231,6 +1231,182 @@ static void uffd_move_pmd_split_test(uffd_test_args_t *targs) uffd_move_pmd_handle_fault); } +static bool +uffdio_verify_results(const char *name, int ret, int error, long result) +{ + /* + * Should always return -1 with errno=EAGAIN, with corresponding + * result field updated in ioctl() args to be -EAGAIN too + * (e.g. copy.copy field for UFFDIO_COPY). + */ + if (ret != -1) { + uffd_test_fail("%s should have returned -1", name); + return false; + } + + if (error != EAGAIN) { + uffd_test_fail("%s should have errno==EAGAIN", name); + return false; + } + + if (result != -EAGAIN) { + uffd_test_fail("%s should have been updated for -EAGAIN", + name); + return false; + } + + return true; +} + +/* + * This defines a function to test one ioctl. Note that here "field" can + * be 1 or anything not -EAGAIN. With that initial value set, we can + * verify later that it should be updated by kernel (when -EAGAIN + * returned), by checking whether it is also updated to -EAGAIN. + */ +#define DEFINE_MMAP_CHANGING_TEST(name, ioctl_name, field) \ + static bool uffdio_mmap_changing_test_##name(int fd) \ + { \ + int ret; \ + struct uffdio_##name args = { \ + .field = 1, \ + }; \ + ret = ioctl(fd, ioctl_name, &args); \ + return uffdio_verify_results(#ioctl_name, ret, errno, args.field); \ + } + +DEFINE_MMAP_CHANGING_TEST(zeropage, UFFDIO_ZEROPAGE, zeropage) +DEFINE_MMAP_CHANGING_TEST(copy, UFFDIO_COPY, copy) +DEFINE_MMAP_CHANGING_TEST(move, UFFDIO_MOVE, move) +DEFINE_MMAP_CHANGING_TEST(poison, UFFDIO_POISON, updated) +DEFINE_MMAP_CHANGING_TEST(continue, UFFDIO_CONTINUE, mapped) + +typedef enum { + /* We actually do not care about any state except UNINTERRUPTIBLE.. */ + THR_STATE_UNKNOWN = 0, + THR_STATE_UNINTERRUPTIBLE, +} thread_state; + +static void sleep_short(void) +{ + usleep(1000); +} + +static thread_state thread_state_get(pid_t tid) +{ + const char *header = "State:\t"; + char tmp[256], *p, c; + FILE *fp; + + snprintf(tmp, sizeof(tmp), "/proc/%d/status", tid); + fp = fopen(tmp, "r"); + + if (!fp) + return THR_STATE_UNKNOWN; + + while (fgets(tmp, sizeof(tmp), fp)) { + p = strstr(tmp, header); + if (p) { + /* For example, "State:\tD (disk sleep)" */ + c = *(p + sizeof(header) - 1); + return c == 'D' ? + THR_STATE_UNINTERRUPTIBLE : THR_STATE_UNKNOWN; + } + } + + return THR_STATE_UNKNOWN; +} + +static void thread_state_until(pid_t tid, thread_state state) +{ + thread_state s; + + do { + s = thread_state_get(tid); + sleep_short(); + } while (s != state); +} + +static void *uffd_mmap_changing_thread(void *opaque) +{ + volatile pid_t *pid = opaque; + int ret; + + /* Unfortunately, it's only fetch-able from the thread itself.. */ + assert(*pid == 0); + *pid = syscall(SYS_gettid); + + /* Inject an event, this will hang solid until the event read */ + ret = madvise(area_dst, page_size, MADV_REMOVE); + if (ret) + err("madvise(MADV_REMOVE) failed"); + + return NULL; +} + +static void uffd_consume_message(int fd) +{ + struct uffd_msg msg = { 0 }; + + while (uffd_read_msg(fd, &msg)); +} + +static void uffd_mmap_changing_test(uffd_test_args_t *targs) +{ + /* + * This stores the real PID (which can be different from how tid is + * defined..) for the child thread, 0 means not initialized. + */ + pid_t pid = 0; + pthread_t tid; + int ret; + + if (uffd_register(uffd, area_dst, nr_pages * page_size, + true, false, false)) + err("uffd_register() failed"); + + /* Create a thread to generate the racy event */ + ret = pthread_create(&tid, NULL, uffd_mmap_changing_thread, &pid); + if (ret) + err("pthread_create() failed"); + + /* + * Wait until the thread setup the pid. Use volatile to make sure + * it reads from RAM not regs. + */ + while (!(volatile pid_t)pid) + sleep_short(); + + /* Wait until the thread hangs at REMOVE event */ + thread_state_until(pid, THR_STATE_UNINTERRUPTIBLE); + + if (!uffdio_mmap_changing_test_copy(uffd)) + return; + + if (!uffdio_mmap_changing_test_zeropage(uffd)) + return; + + if (!uffdio_mmap_changing_test_move(uffd)) + return; + + if (!uffdio_mmap_changing_test_poison(uffd)) + return; + + if (!uffdio_mmap_changing_test_continue(uffd)) + return; + + /* + * All succeeded above! Recycle everything. Start by reading the + * event so as to kick the thread roll again.. + */ + uffd_consume_message(uffd); + + ret = pthread_join(tid, NULL); + assert(ret == 0); + + uffd_test_pass(); +} + static int prevent_hugepages(const char **errmsg) { /* This should be done before source area is populated */ @@ -1470,6 +1646,32 @@ uffd_test_case_t uffd_tests[] = { .mem_targets = MEM_ALL, .uffd_feature_required = UFFD_FEATURE_POISON, }, + { + .name = "mmap-changing", + .uffd_fn = uffd_mmap_changing_test, + /* + * There's no point running this test over all mem types as + * they share the same code paths. + * + * Choose shmem for simplicity, because (1) shmem supports + * MINOR mode to cover UFFDIO_CONTINUE, and (2) shmem is + * almost always available (unlike hugetlb). Here we + * abused SHMEM for UFFDIO_MOVE, but the test we want to + * cover doesn't yet need the correct memory type.. + */ + .mem_targets = MEM_SHMEM, + /* + * Any UFFD_FEATURE_EVENT_* should work to trigger the + * race logically, but choose the simplest (REMOVE). + * + * Meanwhile, since we'll cover quite a few new ioctl()s + * (CONTINUE, POISON, MOVE), skip this test for old kernels + * by choosing all of them. + */ + .uffd_feature_required = UFFD_FEATURE_EVENT_REMOVE | + UFFD_FEATURE_MOVE | UFFD_FEATURE_POISON | + UFFD_FEATURE_MINOR_SHMEM, + }, }; static void usage(const char *prog) From 1f6c6ac03db4a8fface0ad34ba12e5ffb4d51327 Mon Sep 17 00:00:00 2001 From: Libo Chen Date: Wed, 23 Apr 2025 19:45:22 -0700 Subject: [PATCH 204/285] sched/numa: skip VMA scanning on memory pinned to one NUMA node via cpuset.mems MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Patch series "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems", v5. This patch (of 2): When the memory of the current task is pinned to one NUMA node by cgroup, there is no point in continuing the rest of VMA scanning and hinting page faults as they will just be overhead. With this change, there will be no more unnecessary PTE updates or page faults in this scenario. We have seen up to a 6x improvement on a typical java workload running on VMs with memory and CPU pinned to one NUMA node via cpuset in a two-socket AARCH64 system. With the same pinning, on a 18-cores-per-socket Intel platform, we have seen 20% improvment in a microbench that creates a 30-vCPU selftest KVM guest with 4GB memory, where each vCPU reads 4KB pages in a fixed number of loops. Link: https://lkml.kernel.org/r/20250424024523.2298272-1-libo.chen@oracle.com Link: https://lkml.kernel.org/r/20250424024523.2298272-2-libo.chen@oracle.com Signed-off-by: Libo Chen Tested-by: Chen Yu Tested-by: Srikanth Aithal Tested-by: Venkat Rao Bagalkote Cc: "Chen, Tim C" Cc: Chris Hyser Cc: Daniel Jordan Cc: Ingo Molnar Cc: Juri Lelli Cc: K Prateek Nayak Cc: Lorenzo Stoakes Cc: Madadi Vineeth Reddy Cc: Mel Gorman Cc: Michal Koutný Cc: Peter Zijlstra Cc: Raghavendra K T Cc: Steven Rostedt Cc: Tejun Heo Cc: Vincent Guittot Signed-off-by: Andrew Morton --- kernel/sched/fair.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0fb9bf995a47..b3b715e8a7cb 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3329,6 +3329,13 @@ static void task_numa_work(struct callback_head *work) if (p->flags & PF_EXITING) return; + /* + * Memory is pinned to only one NUMA node via cpuset.mems, naturally + * no page can be migrated. + */ + if (cpusets_enabled() && nodes_weight(cpuset_current_mems_allowed) == 1) + return; + if (!mm->numa_next_scan) { mm->numa_next_scan = now + msecs_to_jiffies(sysctl_numa_balancing_scan_delay); From 3fc567e4c0b71d6c59ba26c5d6e54cf3c490dd3a Mon Sep 17 00:00:00 2001 From: Libo Chen Date: Wed, 23 Apr 2025 19:45:23 -0700 Subject: [PATCH 205/285] sched/numa: add tracepoint that tracks the skipping of numa balancing due to cpuset memory pinning MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Unlike sched_skip_vma_numa tracepoint which tracks skipped VMAs, this tracks the task subjected to cpuset.mems pinning and prints out its allowed memory node mask. Link: https://lkml.kernel.org/r/20250424024523.2298272-3-libo.chen@oracle.com Signed-off-by: Libo Chen Cc: "Chen, Tim C" Cc: Chen Yu Cc: Chris Hyser Cc: Daniel Jordan Cc: Ingo Molnar Cc: Juri Lelli Cc: K Prateek Nayak Cc: Lorenzo Stoakes Cc: Madadi Vineeth Reddy Cc: Mel Gorman Cc: Michal Koutný Cc: Peter Zijlstra Cc: Raghavendra K T Cc: Srikanth Aithal Cc: Steven Rostedt Cc: Tejun Heo Cc: Venkat Rao Bagalkote Cc: Vincent Guittot Signed-off-by: Andrew Morton --- include/trace/events/sched.h | 33 +++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 4 +++- 2 files changed, 36 insertions(+), 1 deletion(-) diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index 8994e97d86c1..ff3990318aec 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -745,6 +745,39 @@ TRACE_EVENT(sched_skip_vma_numa, __entry->vm_end, __print_symbolic(__entry->reason, NUMAB_SKIP_REASON)) ); + +TRACE_EVENT(sched_skip_cpuset_numa, + + TP_PROTO(struct task_struct *tsk, nodemask_t *mem_allowed_ptr), + + TP_ARGS(tsk, mem_allowed_ptr), + + TP_STRUCT__entry( + __array( char, comm, TASK_COMM_LEN ) + __field( pid_t, pid ) + __field( pid_t, tgid ) + __field( pid_t, ngid ) + __array( unsigned long, mem_allowed, BITS_TO_LONGS(MAX_NUMNODES)) + ), + + TP_fast_assign( + memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN); + __entry->pid = task_pid_nr(tsk); + __entry->tgid = task_tgid_nr(tsk); + __entry->ngid = task_numa_group_id(tsk); + BUILD_BUG_ON(sizeof(nodemask_t) != \ + BITS_TO_LONGS(MAX_NUMNODES) * sizeof(long)); + memcpy(__entry->mem_allowed, mem_allowed_ptr->bits, + sizeof(__entry->mem_allowed)); + ), + + TP_printk("comm=%s pid=%d tgid=%d ngid=%d mem_nodes_allowed=%*pbl", + __entry->comm, + __entry->pid, + __entry->tgid, + __entry->ngid, + MAX_NUMNODES, __entry->mem_allowed) +); #endif /* CONFIG_NUMA_BALANCING */ /* diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b3b715e8a7cb..cef163c174bd 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3333,8 +3333,10 @@ static void task_numa_work(struct callback_head *work) * Memory is pinned to only one NUMA node via cpuset.mems, naturally * no page can be migrated. */ - if (cpusets_enabled() && nodes_weight(cpuset_current_mems_allowed) == 1) + if (cpusets_enabled() && nodes_weight(cpuset_current_mems_allowed) == 1) { + trace_sched_skip_cpuset_numa(current, &cpuset_current_mems_allowed); return; + } if (!mm->numa_next_scan) { mm->numa_next_scan = now + From 60309008e1e2b2d4bff3c2475b0c74faf395f787 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Mon, 28 Apr 2025 10:27:54 +0300 Subject: [PATCH 206/285] util_macros.h: make the header more resilient Add missing header inclusions. Link: https://lkml.kernel.org/r/20250428072754.3265274-1-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko Signed-off-by: Andrew Morton --- include/linux/util_macros.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/include/linux/util_macros.h b/include/linux/util_macros.h index 3b570b765b75..76ca2b83c13e 100644 --- a/include/linux/util_macros.h +++ b/include/linux/util_macros.h @@ -2,7 +2,10 @@ #ifndef _LINUX_HELPER_MACROS_H_ #define _LINUX_HELPER_MACROS_H_ +#include #include +#include +#include /** * for_each_if - helper for handling conditionals in various for_each macros From 86ebd50224c0734d965843260d0dc057a9431c61 Mon Sep 17 00:00:00 2001 From: Shivank Garg Date: Wed, 30 Apr 2025 10:01:51 +0000 Subject: [PATCH 207/285] mm: add folio_expected_ref_count() for reference count calculation Patch series " JFS: Implement migrate_folio for jfs_metapage_aops" v5. This patchset addresses a warning that occurs during memory compaction due to JFS's missing migrate_folio operation. The warning was introduced by commit 7ee3647243e5 ("migrate: Remove call to ->writepage") which added explicit warnings when filesystem don't implement migrate_folio. The syzbot reported following [1]: jfs_metapage_aops does not implement migrate_folio WARNING: CPU: 1 PID: 5861 at mm/migrate.c:955 fallback_migrate_folio mm/migrate.c:953 [inline] WARNING: CPU: 1 PID: 5861 at mm/migrate.c:955 move_to_new_folio+0x70e/0x840 mm/migrate.c:1007 Modules linked in: CPU: 1 UID: 0 PID: 5861 Comm: syz-executor280 Not tainted 6.15.0-rc1-next-20250411-syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2025 RIP: 0010:fallback_migrate_folio mm/migrate.c:953 [inline] RIP: 0010:move_to_new_folio+0x70e/0x840 mm/migrate.c:1007 To fix this issue, this series implement metapage_migrate_folio() for JFS which handles both single and multiple metapages per page configurations. While most filesystems leverage existing migration implementations like filemap_migrate_folio(), buffer_migrate_folio_norefs() or buffer_migrate_folio() (which internally used folio_expected_refs()), JFS's metapage architecture requires special handling of its private data during migration. To support this, this series introduce the folio_expected_ref_count(), which calculates external references to a folio from page/swap cache, private data, and page table mappings. This standardized implementation replaces the previous ad-hoc folio_expected_refs() function and enables JFS to accurately determine whether a folio has unexpected references before attempting migration. Implement folio_expected_ref_count() to calculate expected folio reference counts from: - Page/swap cache (1 per page) - Private data (1) - Page table mappings (1 per map) While originally needed for page migration operations, this improved implementation standardizes reference counting by consolidating all refcount contributors into a single, reusable function that can benefit any subsystem needing to detect unexpected references to folios. The folio_expected_ref_count() returns the sum of these external references without including any reference the caller itself might hold. Callers comparing against the actual folio_ref_count() must account for their own references separately. Link: https://syzkaller.appspot.com/bug?extid=8bb6fd945af4e0ad9299 [1] Link: https://lkml.kernel.org/r/20250430100150.279751-1-shivankg@amd.com Link: https://lkml.kernel.org/r/20250430100150.279751-2-shivankg@amd.com Signed-off-by: David Hildenbrand Signed-off-by: Shivank Garg Suggested-by: Matthew Wilcox Co-developed-by: David Hildenbrand Cc: Alistair Popple Cc: Dave Kleikamp Cc: Donet Tom Cc: Jane Chu Cc: Kefeng Wang Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/mm.h | 55 ++++++++++++++++++++++++++++++++++++++++++++++ mm/migrate.c | 22 ++++--------------- 2 files changed, 59 insertions(+), 18 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 21dd110b6655..1d1953e37baa 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2114,6 +2114,61 @@ static inline bool folio_maybe_mapped_shared(struct folio *folio) return test_bit(FOLIO_MM_IDS_SHARED_BITNUM, &folio->_mm_ids); } +/** + * folio_expected_ref_count - calculate the expected folio refcount + * @folio: the folio + * + * Calculate the expected folio refcount, taking references from the pagecache, + * swapcache, PG_private and page table mappings into account. Useful in + * combination with folio_ref_count() to detect unexpected references (e.g., + * GUP or other temporary references). + * + * Does currently not consider references from the LRU cache. If the folio + * was isolated from the LRU (which is the case during migration or split), + * the LRU cache does not apply. + * + * Calling this function on an unmapped folio -- !folio_mapped() -- that is + * locked will return a stable result. + * + * Calling this function on a mapped folio will not result in a stable result, + * because nothing stops additional page table mappings from coming (e.g., + * fork()) or going (e.g., munmap()). + * + * Calling this function without the folio lock will also not result in a + * stable result: for example, the folio might get dropped from the swapcache + * concurrently. + * + * However, even when called without the folio lock or on a mapped folio, + * this function can be used to detect unexpected references early (for example, + * if it makes sense to even lock the folio and unmap it). + * + * The caller must add any reference (e.g., from folio_try_get()) it might be + * holding itself to the result. + * + * Returns the expected folio refcount. + */ +static inline int folio_expected_ref_count(const struct folio *folio) +{ + const int order = folio_order(folio); + int ref_count = 0; + + if (WARN_ON_ONCE(folio_test_slab(folio))) + return 0; + + if (folio_test_anon(folio)) { + /* One reference per page from the swapcache. */ + ref_count += folio_test_swapcache(folio) << order; + } else if (!((unsigned long)folio->mapping & PAGE_MAPPING_FLAGS)) { + /* One reference per page from the pagecache. */ + ref_count += !!folio->mapping << order; + /* One reference from PG_private. */ + ref_count += folio_test_private(folio); + } + + /* One reference per page table mapping. */ + return ref_count + folio_mapcount(folio); +} + #ifndef HAVE_ARCH_MAKE_FOLIO_ACCESSIBLE static inline int arch_make_folio_accessible(struct folio *folio) { diff --git a/mm/migrate.c b/mm/migrate.c index 676d9cfc7059..273d46771a6c 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -445,20 +445,6 @@ unlock: } #endif -static int folio_expected_refs(struct address_space *mapping, - struct folio *folio) -{ - int refs = 1; - if (!mapping) - return refs; - - refs += folio_nr_pages(folio); - if (folio_test_private(folio)) - refs++; - - return refs; -} - /* * Replace the folio in the mapping. * @@ -601,7 +587,7 @@ static int __folio_migrate_mapping(struct address_space *mapping, int folio_migrate_mapping(struct address_space *mapping, struct folio *newfolio, struct folio *folio, int extra_count) { - int expected_count = folio_expected_refs(mapping, folio) + extra_count; + int expected_count = folio_expected_ref_count(folio) + extra_count + 1; if (folio_ref_count(folio) != expected_count) return -EAGAIN; @@ -618,7 +604,7 @@ int migrate_huge_page_move_mapping(struct address_space *mapping, struct folio *dst, struct folio *src) { XA_STATE(xas, &mapping->i_pages, folio_index(src)); - int rc, expected_count = folio_expected_refs(mapping, src); + int rc, expected_count = folio_expected_ref_count(src) + 1; if (folio_ref_count(src) != expected_count) return -EAGAIN; @@ -749,7 +735,7 @@ static int __migrate_folio(struct address_space *mapping, struct folio *dst, struct folio *src, void *src_private, enum migrate_mode mode) { - int rc, expected_count = folio_expected_refs(mapping, src); + int rc, expected_count = folio_expected_ref_count(src) + 1; /* Check whether src does not have extra refs before we do more work */ if (folio_ref_count(src) != expected_count) @@ -837,7 +823,7 @@ static int __buffer_migrate_folio(struct address_space *mapping, return migrate_folio(mapping, dst, src, mode); /* Check whether page does not have extra refs before we do more work */ - expected_count = folio_expected_refs(mapping, src); + expected_count = folio_expected_ref_count(src) + 1; if (folio_ref_count(src) != expected_count) return -EAGAIN; From 906d7ce3b59d62cf017ca602a0dfcb29413b8872 Mon Sep 17 00:00:00 2001 From: Shivank Garg Date: Wed, 30 Apr 2025 10:01:52 +0000 Subject: [PATCH 208/285] jfs: implement migrate_folio for jfs_metapage_aops Add the missing migrate_folio operation to jfs_metapage_aops to fix warnings during memory compaction. These warnings were introduced by commit 7ee3647243e5 ("migrate: Remove call to ->writepage") which added explicit warnings when filesystems don't implement migrate_folio. System reports following warnings: jfs_metapage_aops does not implement migrate_folio WARNING: CPU: 0 PID: 6870 at mm/migrate.c:955 fallback_migrate_folio mm/migrate.c:953 [inline] WARNING: CPU: 0 PID: 6870 at mm/migrate.c:955 move_to_new_folio+0x70e/0x840 mm/migrate.c:1007 Implement metapage_migrate_folio() which handles both single and multiple metapages per page configurations. [shivankg@amd.com: change comment style] Link: https://lkml.kernel.org/r/1967593d-8084-4a4a-b384-35d5adc54eb4@amd.com [akpm@linux-foundation.org: fix build] [shivankg@amd.com: remove redundant NULL check in __metapage_migrate_folio()] Link: https://lkml.kernel.org/r/a67db238-0ca6-4725-abb2-dc092de87e1b@amd.com Link: https://lkml.kernel.org/r/20250430100150.279751-3-shivankg@amd.com Fixes: 35474d52c605 ("jfs: Convert metapage_writepage to metapage_write_folio") Signed-off-by: Shivank Garg Reported-by: syzbot+8bb6fd945af4e0ad9299@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67faff52.050a0220.379d84.001b.GAE@google.com Tested-by: syzbot+8bb6fd945af4e0ad9299@syzkaller.appspotmail.com Cc: Alistair Popple Cc: Dave Kleikamp Cc: David Hildenbrand Cc: Donet Tom Cc: Jane Chu Cc: Kefeng Wang Cc: Matthew Wilcox Cc: Zi Yan Cc: Dan Carpenter Cc: kernel test robot Signed-off-by: Andrew Morton --- fs/jfs/jfs_metapage.c | 106 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 106 insertions(+) diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c index df575a873ec6..9029cd216912 100644 --- a/fs/jfs/jfs_metapage.c +++ b/fs/jfs/jfs_metapage.c @@ -15,6 +15,7 @@ #include #include #include +#include #include "jfs_incore.h" #include "jfs_superblock.h" #include "jfs_filsys.h" @@ -151,7 +152,59 @@ static inline void dec_io(struct folio *folio, blk_status_t status, handler(folio, anchor->status); } +#ifdef CONFIG_MIGRATION +static int __metapage_migrate_folio(struct address_space *mapping, + struct folio *dst, struct folio *src, + enum migrate_mode mode) +{ + struct meta_anchor *src_anchor = src->private; + struct metapage *mps[MPS_PER_PAGE] = {0}; + struct metapage *mp; + int i, rc; + + for (i = 0; i < MPS_PER_PAGE; i++) { + mp = src_anchor->mp[i]; + if (mp && metapage_locked(mp)) + return -EAGAIN; + } + + rc = filemap_migrate_folio(mapping, dst, src, mode); + if (rc != MIGRATEPAGE_SUCCESS) + return rc; + + for (i = 0; i < MPS_PER_PAGE; i++) { + mp = src_anchor->mp[i]; + if (!mp) + continue; + if (unlikely(insert_metapage(dst, mp))) { + /* If error, roll-back previosly inserted pages */ + for (int j = 0 ; j < i; j++) { + if (mps[j]) + remove_metapage(dst, mps[j]); + } + return -EAGAIN; + } + mps[i] = mp; + } + + /* Update the metapage and remove it from src */ + for (i = 0; i < MPS_PER_PAGE; i++) { + mp = mps[i]; + if (mp) { + int page_offset = mp->data - folio_address(src); + + mp->data = folio_address(dst) + page_offset; + mp->folio = dst; + remove_metapage(src, mp); + } + } + + return MIGRATEPAGE_SUCCESS; +} +#endif /* CONFIG_MIGRATION */ + #else + static inline struct metapage *folio_to_mp(struct folio *folio, int offset) { return folio->private; @@ -175,6 +228,35 @@ static inline void remove_metapage(struct folio *folio, struct metapage *mp) #define inc_io(folio) do {} while(0) #define dec_io(folio, status, handler) handler(folio, status) +#ifdef CONFIG_MIGRATION +static int __metapage_migrate_folio(struct address_space *mapping, + struct folio *dst, struct folio *src, + enum migrate_mode mode) +{ + struct metapage *mp; + int page_offset; + int rc; + + mp = folio_to_mp(src, 0); + if (metapage_locked(mp)) + return -EAGAIN; + + rc = filemap_migrate_folio(mapping, dst, src, mode); + if (rc != MIGRATEPAGE_SUCCESS) + return rc; + + if (unlikely(insert_metapage(dst, mp))) + return -EAGAIN; + + page_offset = mp->data - folio_address(src); + mp->data = folio_address(dst) + page_offset; + mp->folio = dst; + remove_metapage(src, mp); + + return MIGRATEPAGE_SUCCESS; +} +#endif /* CONFIG_MIGRATION */ + #endif static inline struct metapage *alloc_metapage(gfp_t gfp_mask) @@ -554,6 +636,29 @@ static bool metapage_release_folio(struct folio *folio, gfp_t gfp_mask) return ret; } +#ifdef CONFIG_MIGRATION +/* + * metapage_migrate_folio - Migration function for JFS metapages + */ +static int metapage_migrate_folio(struct address_space *mapping, + struct folio *dst, struct folio *src, + enum migrate_mode mode) +{ + int expected_count; + + if (!src->private) + return filemap_migrate_folio(mapping, dst, src, mode); + + /* Check whether page does not have extra refs before we do more work */ + expected_count = folio_expected_ref_count(src) + 1; + if (folio_ref_count(src) != expected_count) + return -EAGAIN; + return __metapage_migrate_folio(mapping, dst, src, mode); +} +#else +#define metapage_migrate_folio NULL +#endif /* CONFIG_MIGRATION */ + static void metapage_invalidate_folio(struct folio *folio, size_t offset, size_t length) { @@ -570,6 +675,7 @@ const struct address_space_operations jfs_metapage_aops = { .release_folio = metapage_release_folio, .invalidate_folio = metapage_invalidate_folio, .dirty_folio = filemap_dirty_folio, + .migrate_folio = metapage_migrate_folio, }; struct metapage *__get_metapage(struct inode *inode, unsigned long lblock, From e313ee4ebb357af5e316863ed15d01eb2c2e2a2f Mon Sep 17 00:00:00 2001 From: Luiz Capitulino Date: Wed, 30 Apr 2025 16:59:45 -0400 Subject: [PATCH 209/285] mm: kmemleak: drop kmemleak_warning variable These are a trivial mm/kmemleak.c cleanups. I found these while reading through the code. This patch (of 3): The kmemleak_warning variable is not used since commit c5665868183f ("mm: kmemleak: use the memory pool for early allocations"), drop it. Link: https://lkml.kernel.org/r/cover.1746046744.git.luizcap@redhat.com Link: https://lkml.kernel.org/r/97e23faa7b67099027a1094c9438da5f72e037af.1746046744.git.luizcap@redhat.com Signed-off-by: Luiz Capitulino Reviewed-by: Catalin Marinas Acked-by: David Hildenbrand Signed-off-by: Andrew Morton --- mm/kmemleak.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/mm/kmemleak.c b/mm/kmemleak.c index c12cef3eeb32..e6df94c7b032 100644 --- a/mm/kmemleak.c +++ b/mm/kmemleak.c @@ -215,8 +215,6 @@ static int kmemleak_enabled = 1; static int kmemleak_free_enabled = 1; /* set in the late_initcall if there were no errors */ static int kmemleak_late_initialized; -/* set if a kmemleak warning was issued */ -static int kmemleak_warning; /* set if a fatal kmemleak error has occurred */ static int kmemleak_error; @@ -254,7 +252,6 @@ static void kmemleak_disable(void); #define kmemleak_warn(x...) do { \ pr_warn(x); \ dump_stack(); \ - kmemleak_warning = 1; \ } while (0) /* From befbb2540aaea90993e150a84e342f04e24f3476 Mon Sep 17 00:00:00 2001 From: Luiz Capitulino Date: Wed, 30 Apr 2025 16:59:46 -0400 Subject: [PATCH 210/285] mm: kmemleak: drop wrong comment Newly created objects have object->count == 0, so the comment is incorrect. Just drop it. Link: https://lkml.kernel.org/r/3dfd09bc0e77bb626619184a808774ff07de275c.1746046744.git.luizcap@redhat.com Signed-off-by: Luiz Capitulino Reviewed-by: Catalin Marinas Acked-by: David Hildenbrand Signed-off-by: Andrew Morton --- mm/kmemleak.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/mm/kmemleak.c b/mm/kmemleak.c index e6df94c7b032..06baa3475252 100644 --- a/mm/kmemleak.c +++ b/mm/kmemleak.c @@ -322,8 +322,6 @@ static void hex_dump_object(struct seq_file *seq, * sufficient references to it (count >= min_count) * - black - ignore, it doesn't contain references (e.g. text section) * (min_count == -1). No function defined for this color. - * Newly created objects don't have any color assigned (object->count == -1) - * before the next memory scan when they become white. */ static bool color_white(const struct kmemleak_object *object) { From 0f4286765e4364bfad479324f78cd015e402398e Mon Sep 17 00:00:00 2001 From: Luiz Capitulino Date: Wed, 30 Apr 2025 16:59:47 -0400 Subject: [PATCH 211/285] mm: kmemleak: mark variables as __read_mostly The variables kmemleak_enabled and kmemleak_free_enabled are read in the kmemleak alloc and free path respectively, but are only written to if/when kmemleak is disabled. Link: https://lkml.kernel.org/r/4016090e857e8c4c2ade4b20df312f7f38325c15.1746046744.git.luizcap@redhat.com Signed-off-by: Luiz Capitulino Reviewed-by: Catalin Marinas Acked-by: David Hildenbrand Signed-off-by: Andrew Morton --- mm/kmemleak.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/kmemleak.c b/mm/kmemleak.c index 06baa3475252..da9cee34ee1b 100644 --- a/mm/kmemleak.c +++ b/mm/kmemleak.c @@ -210,9 +210,9 @@ static struct kmem_cache *object_cache; static struct kmem_cache *scan_area_cache; /* set if tracing memory operations is enabled */ -static int kmemleak_enabled = 1; +static int kmemleak_enabled __read_mostly = 1; /* same as above but only for the kmemleak_free() callback */ -static int kmemleak_free_enabled = 1; +static int kmemleak_free_enabled __read_mostly = 1; /* set in the late_initcall if there were no errors */ static int kmemleak_late_initialized; /* set if a fatal kmemleak error has occurred */ From 6c36ac1e124f1be97cf0485a220865fce5a2020d Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Mon, 28 Apr 2025 16:28:14 +0100 Subject: [PATCH 212/285] mm: establish mm/vma_exec.c for shared exec/mm VMA functionality Patch series "move all VMA allocation, freeing and duplication logic to mm", v3. Currently VMA allocation, freeing and duplication exist in kernel/fork.c, which is a violation of separation of concerns, and leaves these functions exposed to the rest of the kernel when they are in fact internal implementation details. Resolve this by moving this logic to mm, and making it internal to vma.c, vma.h. This also allows us, in future, to provide userland testing around this functionality. We additionally abstract dup_mmap() to mm, being careful to ensure kernel/fork.c acceses this via the mm internal header so it is not exposed elsewhere in the kernel. As part of this change, also abstract initial stack allocation performed in __bprm_mm_init() out of fs code into mm via the create_init_stack_vma(), as this code uses vm_area_alloc() and vm_area_free(). In order to do so sensibly, we introduce a new mm/vma_exec.c file, which contains the code that is shared by mm and exec. This file is added to both memory mapping and exec sections in MAINTAINERS so both sets of maintainers can maintain oversight. As part of this change, we also move relocate_vma_down() to mm/vma_exec.c so all shared mm/exec functionality is kept in one place. We add code shared between nommu and mmu-enabled configurations in order to share VMA allocation, freeing and duplication code correctly while also keeping these functions available in userland VMA testing. This is achieved by adding a mm/vma_init.c file which is also compiled by the userland tests. This patch (of 4): There is functionality that overlaps the exec and memory mapping subsystems. While it properly belongs in mm, it is important that exec maintainers maintain oversight of this functionality correctly. We can establish both goals by adding a new mm/vma_exec.c file which contains these 'glue' functions, and have fs/exec.c import them. As a part of this change, to ensure that proper oversight is achieved, add the file to both the MEMORY MAPPING and EXEC & BINFMT API, ELF sections. scripts/get_maintainer.pl can correctly handle files in multiple entries and this neatly handles the cross-over. [akpm@linux-foundation.org: fix comment typo] Link: https://lkml.kernel.org/r/80f0d0c6-0b68-47f9-ab78-0ab7f74677fc@lucifer.local Link: https://lkml.kernel.org/r/cover.1745853549.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/91f2cee8f17d65214a9d83abb7011aa15f1ea690.1745853549.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Liam R. Howlett Reviewed-by: Suren Baghdasaryan Reviewed-by: Pedro Falcato Reviewed-by: David Hildenbrand Reviewed-by: Kees Cook Reviewed-by: Vlastimil Babka Cc: Al Viro Cc: Christian Brauner Cc: Jan Kara Cc: Jann Horn Signed-off-by: Andrew Morton --- MAINTAINERS | 2 + fs/exec.c | 3 ++ include/linux/mm.h | 1 - mm/Makefile | 2 +- mm/mmap.c | 83 ---------------------------- mm/vma.h | 5 ++ mm/vma_exec.c | 92 ++++++++++++++++++++++++++++++++ tools/testing/vma/Makefile | 2 +- tools/testing/vma/vma.c | 1 + tools/testing/vma/vma_internal.h | 40 ++++++++++++++ 10 files changed, 145 insertions(+), 86 deletions(-) create mode 100644 mm/vma_exec.c diff --git a/MAINTAINERS b/MAINTAINERS index eb5a8c791f01..836105023495 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -8840,6 +8840,7 @@ F: include/linux/elf.h F: include/uapi/linux/auxvec.h F: include/uapi/linux/binfmts.h F: include/uapi/linux/elf.h +F: mm/vma_exec.c F: tools/testing/selftests/exec/ N: asm/elf.h N: binfmt @@ -15681,6 +15682,7 @@ F: mm/mremap.c F: mm/mseal.c F: mm/vma.c F: mm/vma.h +F: mm/vma_exec.c F: mm/vma_internal.h F: tools/testing/selftests/mm/merge.c F: tools/testing/vma/ diff --git a/fs/exec.c b/fs/exec.c index 8e4ea5f1e64c..477bc3f2e966 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -78,6 +78,9 @@ #include +/* For vma exec functions. */ +#include "../mm/internal.h" + static int bprm_creds_from_file(struct linux_binprm *bprm); int suid_dumpable = 0; diff --git a/include/linux/mm.h b/include/linux/mm.h index 1d1953e37baa..43748c8f3454 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3278,7 +3278,6 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node); extern int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin); extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *); extern void exit_mmap(struct mm_struct *); -int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift); bool mmap_read_lock_maybe_expand(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, bool write); diff --git a/mm/Makefile b/mm/Makefile index e7f6bbf8ae5f..7aadec97c37b 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -37,7 +37,7 @@ mmu-y := nommu.o mmu-$(CONFIG_MMU) := highmem.o memory.o mincore.o \ mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \ msync.o page_vma_mapped.o pagewalk.o \ - pgtable-generic.o rmap.o vmalloc.o vma.o + pgtable-generic.o rmap.o vmalloc.o vma.o vma_exec.o ifdef CONFIG_CROSS_MEMORY_ATTACH diff --git a/mm/mmap.c b/mm/mmap.c index bd210aaf7ebd..1794bf6f4dc0 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1717,89 +1717,6 @@ static int __meminit init_reserve_notifier(void) } subsys_initcall(init_reserve_notifier); -/* - * Relocate a VMA downwards by shift bytes. There cannot be any VMAs between - * this VMA and its relocated range, which will now reside at [vma->vm_start - - * shift, vma->vm_end - shift). - * - * This function is almost certainly NOT what you want for anything other than - * early executable temporary stack relocation. - */ -int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift) -{ - /* - * The process proceeds as follows: - * - * 1) Use shift to calculate the new vma endpoints. - * 2) Extend vma to cover both the old and new ranges. This ensures the - * arguments passed to subsequent functions are consistent. - * 3) Move vma's page tables to the new range. - * 4) Free up any cleared pgd range. - * 5) Shrink the vma to cover only the new range. - */ - - struct mm_struct *mm = vma->vm_mm; - unsigned long old_start = vma->vm_start; - unsigned long old_end = vma->vm_end; - unsigned long length = old_end - old_start; - unsigned long new_start = old_start - shift; - unsigned long new_end = old_end - shift; - VMA_ITERATOR(vmi, mm, new_start); - VMG_STATE(vmg, mm, &vmi, new_start, old_end, 0, vma->vm_pgoff); - struct vm_area_struct *next; - struct mmu_gather tlb; - PAGETABLE_MOVE(pmc, vma, vma, old_start, new_start, length); - - BUG_ON(new_start > new_end); - - /* - * ensure there are no vmas between where we want to go - * and where we are - */ - if (vma != vma_next(&vmi)) - return -EFAULT; - - vma_iter_prev_range(&vmi); - /* - * cover the whole range: [new_start, old_end) - */ - vmg.middle = vma; - if (vma_expand(&vmg)) - return -ENOMEM; - - /* - * move the page tables downwards, on failure we rely on - * process cleanup to remove whatever mess we made. - */ - pmc.for_stack = true; - if (length != move_page_tables(&pmc)) - return -ENOMEM; - - tlb_gather_mmu(&tlb, mm); - next = vma_next(&vmi); - if (new_end > old_start) { - /* - * when the old and new regions overlap clear from new_end. - */ - free_pgd_range(&tlb, new_end, old_end, new_end, - next ? next->vm_start : USER_PGTABLES_CEILING); - } else { - /* - * otherwise, clean from old_start; this is done to not touch - * the address space in [new_end, old_start) some architectures - * have constraints on va-space that make this illegal (IA64) - - * for the others its just a little faster. - */ - free_pgd_range(&tlb, old_start, old_end, new_end, - next ? next->vm_start : USER_PGTABLES_CEILING); - } - tlb_finish_mmu(&tlb); - - vma_prev(&vmi); - /* Shrink the vma to just the new range */ - return vma_shrink(&vmi, vma, new_start, new_end, vma->vm_pgoff); -} - #ifdef CONFIG_MMU /* * Obtain a read lock on mm->mmap_lock, if the specified address is below the diff --git a/mm/vma.h b/mm/vma.h index 149926e8a6d1..4413445e074b 100644 --- a/mm/vma.h +++ b/mm/vma.h @@ -548,4 +548,9 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address); int __vm_munmap(unsigned long start, size_t len, bool unlock); +/* vma_exec.c */ +#ifdef CONFIG_MMU +int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift); +#endif + #endif /* __MM_VMA_H */ diff --git a/mm/vma_exec.c b/mm/vma_exec.c new file mode 100644 index 000000000000..6736ae37f748 --- /dev/null +++ b/mm/vma_exec.c @@ -0,0 +1,92 @@ +// SPDX-License-Identifier: GPL-2.0-only + +/* + * Functions explicitly implemented for exec functionality which however are + * explicitly VMA-only logic. + */ + +#include "vma_internal.h" +#include "vma.h" + +/* + * Relocate a VMA downwards by shift bytes. There cannot be any VMAs between + * this VMA and its relocated range, which will now reside at [vma->vm_start - + * shift, vma->vm_end - shift). + * + * This function is almost certainly NOT what you want for anything other than + * early executable temporary stack relocation. + */ +int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift) +{ + /* + * The process proceeds as follows: + * + * 1) Use shift to calculate the new vma endpoints. + * 2) Extend vma to cover both the old and new ranges. This ensures the + * arguments passed to subsequent functions are consistent. + * 3) Move vma's page tables to the new range. + * 4) Free up any cleared pgd range. + * 5) Shrink the vma to cover only the new range. + */ + + struct mm_struct *mm = vma->vm_mm; + unsigned long old_start = vma->vm_start; + unsigned long old_end = vma->vm_end; + unsigned long length = old_end - old_start; + unsigned long new_start = old_start - shift; + unsigned long new_end = old_end - shift; + VMA_ITERATOR(vmi, mm, new_start); + VMG_STATE(vmg, mm, &vmi, new_start, old_end, 0, vma->vm_pgoff); + struct vm_area_struct *next; + struct mmu_gather tlb; + PAGETABLE_MOVE(pmc, vma, vma, old_start, new_start, length); + + BUG_ON(new_start > new_end); + + /* + * ensure there are no vmas between where we want to go + * and where we are + */ + if (vma != vma_next(&vmi)) + return -EFAULT; + + vma_iter_prev_range(&vmi); + /* + * cover the whole range: [new_start, old_end) + */ + vmg.middle = vma; + if (vma_expand(&vmg)) + return -ENOMEM; + + /* + * move the page tables downwards, on failure we rely on + * process cleanup to remove whatever mess we made. + */ + pmc.for_stack = true; + if (length != move_page_tables(&pmc)) + return -ENOMEM; + + tlb_gather_mmu(&tlb, mm); + next = vma_next(&vmi); + if (new_end > old_start) { + /* + * when the old and new regions overlap clear from new_end. + */ + free_pgd_range(&tlb, new_end, old_end, new_end, + next ? next->vm_start : USER_PGTABLES_CEILING); + } else { + /* + * otherwise, clean from old_start; this is done to not touch + * the address space in [new_end, old_start) some architectures + * have constraints on va-space that make this illegal (IA64) - + * for the others its just a little faster. + */ + free_pgd_range(&tlb, old_start, old_end, new_end, + next ? next->vm_start : USER_PGTABLES_CEILING); + } + tlb_finish_mmu(&tlb); + + vma_prev(&vmi); + /* Shrink the vma to just the new range */ + return vma_shrink(&vmi, vma, new_start, new_end, vma->vm_pgoff); +} diff --git a/tools/testing/vma/Makefile b/tools/testing/vma/Makefile index 860fd2311dcc..624040fcf193 100644 --- a/tools/testing/vma/Makefile +++ b/tools/testing/vma/Makefile @@ -9,7 +9,7 @@ include ../shared/shared.mk OFILES = $(SHARED_OFILES) vma.o maple-shim.o TARGETS = vma -vma.o: vma.c vma_internal.h ../../../mm/vma.c ../../../mm/vma.h +vma.o: vma.c vma_internal.h ../../../mm/vma.c ../../../mm/vma_exec.c ../../../mm/vma.h vma: $(OFILES) $(CC) $(CFLAGS) -o $@ $(OFILES) $(LDLIBS) diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c index 7cfd6e31db10..5832ae5d797d 100644 --- a/tools/testing/vma/vma.c +++ b/tools/testing/vma/vma.c @@ -28,6 +28,7 @@ unsigned long stack_guard_gap = 256UL<mas); @@ -1240,4 +1262,22 @@ static inline int mapping_map_writable(struct address_space *mapping) return 0; } +static inline unsigned long move_page_tables(struct pagetable_move_control *pmc) +{ + (void)pmc; + + return 0; +} + +static inline void free_pgd_range(struct mmu_gather *tlb, + unsigned long addr, unsigned long end, + unsigned long floor, unsigned long ceiling) +{ + (void)tlb; + (void)addr; + (void)end; + (void)floor; + (void)ceiling; +} + #endif /* __MM_VMA_INTERNAL_H */ From dd7a6246f4fd6e8a6dcb08f1f51c899f3e0d3b83 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Mon, 28 Apr 2025 16:28:15 +0100 Subject: [PATCH 213/285] mm: abstract initial stack setup to mm subsystem There are peculiarities within the kernel where what is very clearly mm code is performed elsewhere arbitrarily. This violates separation of concerns and makes it harder to refactor code to make changes to how fundamental initialisation and operation of mm logic is performed. One such case is the creation of the VMA containing the initial stack upon execve()'ing a new process. This is currently performed in __bprm_mm_init() in fs/exec.c. Abstract this operation to create_init_stack_vma(). This allows us to limit use of vma allocation and free code to fork and mm only. We previously did the same for the step at which we relocate the initial stack VMA downwards via relocate_vma_down(), now we move the initial VMA establishment too. Take the opportunity to also move insert_vm_struct() to mm/vma.c as it's no longer needed anywhere outside of mm. Link: https://lkml.kernel.org/r/118c950ef7a8dd19ab20a23a68c3603751acd30e.1745853549.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Acked-by: David Hildenbrand Reviewed-by: Suren Baghdasaryan Reviewed-by: Liam R. Howlett Reviewed-by: Pedro Falcato Reviewed-by: Kees Cook Reviewed-by: Vlastimil Babka Cc: Al Viro Cc: Christian Brauner Cc: Jan Kara Cc: Jann Horn Signed-off-by: Andrew Morton --- fs/exec.c | 66 +++--------------------------- mm/mmap.c | 42 ------------------- mm/vma.c | 43 ++++++++++++++++++++ mm/vma.h | 4 ++ mm/vma_exec.c | 69 ++++++++++++++++++++++++++++++++ tools/testing/vma/vma_internal.h | 32 +++++++++++++++ 6 files changed, 153 insertions(+), 103 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index 477bc3f2e966..f9bbcf0016a4 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -245,60 +245,6 @@ static void flush_arg_page(struct linux_binprm *bprm, unsigned long pos, flush_cache_page(bprm->vma, pos, page_to_pfn(page)); } -static int __bprm_mm_init(struct linux_binprm *bprm) -{ - int err; - struct vm_area_struct *vma = NULL; - struct mm_struct *mm = bprm->mm; - - bprm->vma = vma = vm_area_alloc(mm); - if (!vma) - return -ENOMEM; - vma_set_anonymous(vma); - - if (mmap_write_lock_killable(mm)) { - err = -EINTR; - goto err_free; - } - - /* - * Need to be called with mmap write lock - * held, to avoid race with ksmd. - */ - err = ksm_execve(mm); - if (err) - goto err_ksm; - - /* - * Place the stack at the largest stack address the architecture - * supports. Later, we'll move this to an appropriate place. We don't - * use STACK_TOP because that can depend on attributes which aren't - * configured yet. - */ - BUILD_BUG_ON(VM_STACK_FLAGS & VM_STACK_INCOMPLETE_SETUP); - vma->vm_end = STACK_TOP_MAX; - vma->vm_start = vma->vm_end - PAGE_SIZE; - vm_flags_init(vma, VM_SOFTDIRTY | VM_STACK_FLAGS | VM_STACK_INCOMPLETE_SETUP); - vma->vm_page_prot = vm_get_page_prot(vma->vm_flags); - - err = insert_vm_struct(mm, vma); - if (err) - goto err; - - mm->stack_vm = mm->total_vm = 1; - mmap_write_unlock(mm); - bprm->p = vma->vm_end - sizeof(void *); - return 0; -err: - ksm_exit(mm); -err_ksm: - mmap_write_unlock(mm); -err_free: - bprm->vma = NULL; - vm_area_free(vma); - return err; -} - static bool valid_arg_len(struct linux_binprm *bprm, long len) { return len <= MAX_ARG_STRLEN; @@ -351,12 +297,6 @@ static void flush_arg_page(struct linux_binprm *bprm, unsigned long pos, { } -static int __bprm_mm_init(struct linux_binprm *bprm) -{ - bprm->p = PAGE_SIZE * MAX_ARG_PAGES - sizeof(void *); - return 0; -} - static bool valid_arg_len(struct linux_binprm *bprm, long len) { return len <= bprm->p; @@ -385,9 +325,13 @@ static int bprm_mm_init(struct linux_binprm *bprm) bprm->rlim_stack = current->signal->rlim[RLIMIT_STACK]; task_unlock(current->group_leader); - err = __bprm_mm_init(bprm); +#ifndef CONFIG_MMU + bprm->p = PAGE_SIZE * MAX_ARG_PAGES - sizeof(void *); +#else + err = create_init_stack_vma(bprm->mm, &bprm->vma, &bprm->p); if (err) goto err; +#endif return 0; diff --git a/mm/mmap.c b/mm/mmap.c index 1794bf6f4dc0..9e09eac0021c 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1321,48 +1321,6 @@ destroy: vm_unacct_memory(nr_accounted); } -/* Insert vm structure into process list sorted by address - * and into the inode's i_mmap tree. If vm_file is non-NULL - * then i_mmap_rwsem is taken here. - */ -int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma) -{ - unsigned long charged = vma_pages(vma); - - - if (find_vma_intersection(mm, vma->vm_start, vma->vm_end)) - return -ENOMEM; - - if ((vma->vm_flags & VM_ACCOUNT) && - security_vm_enough_memory_mm(mm, charged)) - return -ENOMEM; - - /* - * The vm_pgoff of a purely anonymous vma should be irrelevant - * until its first write fault, when page's anon_vma and index - * are set. But now set the vm_pgoff it will almost certainly - * end up with (unless mremap moves it elsewhere before that - * first wfault), so /proc/pid/maps tells a consistent story. - * - * By setting it to reflect the virtual start address of the - * vma, merges and splits can happen in a seamless way, just - * using the existing file pgoff checks and manipulations. - * Similarly in do_mmap and in do_brk_flags. - */ - if (vma_is_anonymous(vma)) { - BUG_ON(vma->anon_vma); - vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT; - } - - if (vma_link(mm, vma)) { - if (vma->vm_flags & VM_ACCOUNT) - vm_unacct_memory(charged); - return -ENOMEM; - } - - return 0; -} - /* * Return true if the calling process may expand its vm space by the passed * number of pages diff --git a/mm/vma.c b/mm/vma.c index 8a6c5e835759..1f2634b29568 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -3052,3 +3052,46 @@ int __vm_munmap(unsigned long start, size_t len, bool unlock) userfaultfd_unmap_complete(mm, &uf); return ret; } + + +/* Insert vm structure into process list sorted by address + * and into the inode's i_mmap tree. If vm_file is non-NULL + * then i_mmap_rwsem is taken here. + */ +int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma) +{ + unsigned long charged = vma_pages(vma); + + + if (find_vma_intersection(mm, vma->vm_start, vma->vm_end)) + return -ENOMEM; + + if ((vma->vm_flags & VM_ACCOUNT) && + security_vm_enough_memory_mm(mm, charged)) + return -ENOMEM; + + /* + * The vm_pgoff of a purely anonymous vma should be irrelevant + * until its first write fault, when page's anon_vma and index + * are set. But now set the vm_pgoff it will almost certainly + * end up with (unless mremap moves it elsewhere before that + * first wfault), so /proc/pid/maps tells a consistent story. + * + * By setting it to reflect the virtual start address of the + * vma, merges and splits can happen in a seamless way, just + * using the existing file pgoff checks and manipulations. + * Similarly in do_mmap and in do_brk_flags. + */ + if (vma_is_anonymous(vma)) { + BUG_ON(vma->anon_vma); + vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT; + } + + if (vma_link(mm, vma)) { + if (vma->vm_flags & VM_ACCOUNT) + vm_unacct_memory(charged); + return -ENOMEM; + } + + return 0; +} diff --git a/mm/vma.h b/mm/vma.h index 4413445e074b..151e292263fd 100644 --- a/mm/vma.h +++ b/mm/vma.h @@ -548,8 +548,12 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address); int __vm_munmap(unsigned long start, size_t len, bool unlock); +int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma); + /* vma_exec.c */ #ifdef CONFIG_MMU +int create_init_stack_vma(struct mm_struct *mm, struct vm_area_struct **vmap, + unsigned long *top_mem_p); int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift); #endif diff --git a/mm/vma_exec.c b/mm/vma_exec.c index 6736ae37f748..2dffb02ed6a2 100644 --- a/mm/vma_exec.c +++ b/mm/vma_exec.c @@ -90,3 +90,72 @@ int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift) /* Shrink the vma to just the new range */ return vma_shrink(&vmi, vma, new_start, new_end, vma->vm_pgoff); } + +/* + * Establish the stack VMA in an execve'd process, located temporarily at the + * maximum stack address provided by the architecture. + * + * We later relocate this downwards in relocate_vma_down(). + * + * This function is almost certainly NOT what you want for anything other than + * early executable initialisation. + * + * On success, returns 0 and sets *vmap to the stack VMA and *top_mem_p to the + * maximum addressable location in the stack (that is capable of storing a + * system word of data). + */ +int create_init_stack_vma(struct mm_struct *mm, struct vm_area_struct **vmap, + unsigned long *top_mem_p) +{ + int err; + struct vm_area_struct *vma = vm_area_alloc(mm); + + if (!vma) + return -ENOMEM; + + vma_set_anonymous(vma); + + if (mmap_write_lock_killable(mm)) { + err = -EINTR; + goto err_free; + } + + /* + * Need to be called with mmap write lock + * held, to avoid race with ksmd. + */ + err = ksm_execve(mm); + if (err) + goto err_ksm; + + /* + * Place the stack at the largest stack address the architecture + * supports. Later, we'll move this to an appropriate place. We don't + * use STACK_TOP because that can depend on attributes which aren't + * configured yet. + */ + BUILD_BUG_ON(VM_STACK_FLAGS & VM_STACK_INCOMPLETE_SETUP); + vma->vm_end = STACK_TOP_MAX; + vma->vm_start = vma->vm_end - PAGE_SIZE; + vm_flags_init(vma, VM_SOFTDIRTY | VM_STACK_FLAGS | VM_STACK_INCOMPLETE_SETUP); + vma->vm_page_prot = vm_get_page_prot(vma->vm_flags); + + err = insert_vm_struct(mm, vma); + if (err) + goto err; + + mm->stack_vm = mm->total_vm = 1; + mmap_write_unlock(mm); + *vmap = vma; + *top_mem_p = vma->vm_end - sizeof(void *); + return 0; + +err: + ksm_exit(mm); +err_ksm: + mmap_write_unlock(mm); +err_free: + *vmap = NULL; + vm_area_free(vma); + return err; +} diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index 0df19ca0000a..32e990313158 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -56,6 +56,8 @@ extern unsigned long dac_mmap_min_addr; #define VM_PFNMAP 0x00000400 #define VM_LOCKED 0x00002000 #define VM_IO 0x00004000 +#define VM_SEQ_READ 0x00008000 /* App will access data sequentially */ +#define VM_RAND_READ 0x00010000 /* App will not benefit from clustered reads */ #define VM_DONTEXPAND 0x00040000 #define VM_LOCKONFAULT 0x00080000 #define VM_ACCOUNT 0x00100000 @@ -70,6 +72,20 @@ extern unsigned long dac_mmap_min_addr; #define VM_ACCESS_FLAGS (VM_READ | VM_WRITE | VM_EXEC) #define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_PFNMAP | VM_MIXEDMAP) +#ifdef CONFIG_STACK_GROWSUP +#define VM_STACK VM_GROWSUP +#define VM_STACK_EARLY VM_GROWSDOWN +#else +#define VM_STACK VM_GROWSDOWN +#define VM_STACK_EARLY 0 +#endif + +#define DEFAULT_MAP_WINDOW ((1UL << 47) - PAGE_SIZE) +#define TASK_SIZE_LOW DEFAULT_MAP_WINDOW +#define TASK_SIZE_MAX DEFAULT_MAP_WINDOW +#define STACK_TOP TASK_SIZE_LOW +#define STACK_TOP_MAX TASK_SIZE_MAX + /* This mask represents all the VMA flag bits used by mlock */ #define VM_LOCKED_MASK (VM_LOCKED | VM_LOCKONFAULT) @@ -82,6 +98,10 @@ extern unsigned long dac_mmap_min_addr; #define VM_STARTGAP_FLAGS (VM_GROWSDOWN | VM_SHADOW_STACK) +#define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS +#define VM_STACK_FLAGS (VM_STACK | VM_STACK_DEFAULT_FLAGS | VM_ACCOUNT) +#define VM_STACK_INCOMPLETE_SETUP (VM_RAND_READ | VM_SEQ_READ | VM_STACK_EARLY) + #define RLIMIT_STACK 3 /* max stack size */ #define RLIMIT_MEMLOCK 8 /* max locked-in-memory address space */ @@ -1280,4 +1300,16 @@ static inline void free_pgd_range(struct mmu_gather *tlb, (void)ceiling; } +static inline int ksm_execve(struct mm_struct *mm) +{ + (void)mm; + + return 0; +} + +static inline void ksm_exit(struct mm_struct *mm) +{ + (void)mm; +} + #endif /* __MM_VMA_INTERNAL_H */ From 26a8f57760c12ccd860b36ac5b73daede4396aa3 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Mon, 28 Apr 2025 16:28:16 +0100 Subject: [PATCH 214/285] mm: move dup_mmap() to mm This is a key step in our being able to abstract and isolate VMA allocation and destruction logic. This function is the last one where vm_area_free() and vm_area_dup() are directly referenced outside of mmap, so having this in mm allows us to isolate these. We do the same for the nommu version which is substantially simpler. We place the declaration for dup_mmap() in mm/internal.h and have kernel/fork.c import this in order to prevent improper use of this functionality elsewhere in the kernel. While we're here, we remove the useless #ifdef CONFIG_MMU check around mmap_read_lock_maybe_expand() in mmap.c, mmap.c is compiled only if CONFIG_MMU is set. Link: https://lkml.kernel.org/r/e49aad3d00212f5539d9fa5769bfda4ce451db3e.1745853549.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Suggested-by: Pedro Falcato Reviewed-by: Pedro Falcato Reviewed-by: Liam R. Howlett Reviewed-by: Suren Baghdasaryan Reviewed-by: David Hildenbrand Reviewed-by: Kees Cook Reviewed-by: Vlastimil Babka Cc: Al Viro Cc: Christian Brauner Cc: Jan Kara Cc: Jann Horn Signed-off-by: Andrew Morton --- kernel/fork.c | 189 ++------------------------------------------------ mm/internal.h | 2 + mm/mmap.c | 181 +++++++++++++++++++++++++++++++++++++++++++++-- mm/nommu.c | 8 +++ 4 files changed, 189 insertions(+), 191 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index 168681fc4b25..ac9f9267a473 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -112,6 +112,9 @@ #include #include +/* For dup_mmap(). */ +#include "../mm/internal.h" + #include #define CREATE_TRACE_POINTS @@ -589,7 +592,7 @@ void free_task(struct task_struct *tsk) } EXPORT_SYMBOL(free_task); -static void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm) +void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm) { struct file *exe_file; @@ -604,183 +607,6 @@ static void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm) } #ifdef CONFIG_MMU -static __latent_entropy int dup_mmap(struct mm_struct *mm, - struct mm_struct *oldmm) -{ - struct vm_area_struct *mpnt, *tmp; - int retval; - unsigned long charge = 0; - LIST_HEAD(uf); - VMA_ITERATOR(vmi, mm, 0); - - if (mmap_write_lock_killable(oldmm)) - return -EINTR; - flush_cache_dup_mm(oldmm); - uprobe_dup_mmap(oldmm, mm); - /* - * Not linked in yet - no deadlock potential: - */ - mmap_write_lock_nested(mm, SINGLE_DEPTH_NESTING); - - /* No ordering required: file already has been exposed. */ - dup_mm_exe_file(mm, oldmm); - - mm->total_vm = oldmm->total_vm; - mm->data_vm = oldmm->data_vm; - mm->exec_vm = oldmm->exec_vm; - mm->stack_vm = oldmm->stack_vm; - - /* Use __mt_dup() to efficiently build an identical maple tree. */ - retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_KERNEL); - if (unlikely(retval)) - goto out; - - mt_clear_in_rcu(vmi.mas.tree); - for_each_vma(vmi, mpnt) { - struct file *file; - - vma_start_write(mpnt); - if (mpnt->vm_flags & VM_DONTCOPY) { - retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start, - mpnt->vm_end, GFP_KERNEL); - if (retval) - goto loop_out; - - vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt)); - continue; - } - charge = 0; - /* - * Don't duplicate many vmas if we've been oom-killed (for - * example) - */ - if (fatal_signal_pending(current)) { - retval = -EINTR; - goto loop_out; - } - if (mpnt->vm_flags & VM_ACCOUNT) { - unsigned long len = vma_pages(mpnt); - - if (security_vm_enough_memory_mm(oldmm, len)) /* sic */ - goto fail_nomem; - charge = len; - } - tmp = vm_area_dup(mpnt); - if (!tmp) - goto fail_nomem; - - /* track_pfn_copy() will later take care of copying internal state. */ - if (unlikely(tmp->vm_flags & VM_PFNMAP)) - untrack_pfn_clear(tmp); - - retval = vma_dup_policy(mpnt, tmp); - if (retval) - goto fail_nomem_policy; - tmp->vm_mm = mm; - retval = dup_userfaultfd(tmp, &uf); - if (retval) - goto fail_nomem_anon_vma_fork; - if (tmp->vm_flags & VM_WIPEONFORK) { - /* - * VM_WIPEONFORK gets a clean slate in the child. - * Don't prepare anon_vma until fault since we don't - * copy page for current vma. - */ - tmp->anon_vma = NULL; - } else if (anon_vma_fork(tmp, mpnt)) - goto fail_nomem_anon_vma_fork; - vm_flags_clear(tmp, VM_LOCKED_MASK); - /* - * Copy/update hugetlb private vma information. - */ - if (is_vm_hugetlb_page(tmp)) - hugetlb_dup_vma_private(tmp); - - /* - * Link the vma into the MT. After using __mt_dup(), memory - * allocation is not necessary here, so it cannot fail. - */ - vma_iter_bulk_store(&vmi, tmp); - - mm->map_count++; - - if (tmp->vm_ops && tmp->vm_ops->open) - tmp->vm_ops->open(tmp); - - file = tmp->vm_file; - if (file) { - struct address_space *mapping = file->f_mapping; - - get_file(file); - i_mmap_lock_write(mapping); - if (vma_is_shared_maywrite(tmp)) - mapping_allow_writable(mapping); - flush_dcache_mmap_lock(mapping); - /* insert tmp into the share list, just after mpnt */ - vma_interval_tree_insert_after(tmp, mpnt, - &mapping->i_mmap); - flush_dcache_mmap_unlock(mapping); - i_mmap_unlock_write(mapping); - } - - if (!(tmp->vm_flags & VM_WIPEONFORK)) - retval = copy_page_range(tmp, mpnt); - - if (retval) { - mpnt = vma_next(&vmi); - goto loop_out; - } - } - /* a new mm has just been created */ - retval = arch_dup_mmap(oldmm, mm); -loop_out: - vma_iter_free(&vmi); - if (!retval) { - mt_set_in_rcu(vmi.mas.tree); - ksm_fork(mm, oldmm); - khugepaged_fork(mm, oldmm); - } else { - - /* - * The entire maple tree has already been duplicated. If the - * mmap duplication fails, mark the failure point with - * XA_ZERO_ENTRY. In exit_mmap(), if this marker is encountered, - * stop releasing VMAs that have not been duplicated after this - * point. - */ - if (mpnt) { - mas_set_range(&vmi.mas, mpnt->vm_start, mpnt->vm_end - 1); - mas_store(&vmi.mas, XA_ZERO_ENTRY); - /* Avoid OOM iterating a broken tree */ - set_bit(MMF_OOM_SKIP, &mm->flags); - } - /* - * The mm_struct is going to exit, but the locks will be dropped - * first. Set the mm_struct as unstable is advisable as it is - * not fully initialised. - */ - set_bit(MMF_UNSTABLE, &mm->flags); - } -out: - mmap_write_unlock(mm); - flush_tlb_mm(oldmm); - mmap_write_unlock(oldmm); - if (!retval) - dup_userfaultfd_complete(&uf); - else - dup_userfaultfd_fail(&uf); - return retval; - -fail_nomem_anon_vma_fork: - mpol_put(vma_policy(tmp)); -fail_nomem_policy: - vm_area_free(tmp); -fail_nomem: - retval = -ENOMEM; - vm_unacct_memory(charge); - goto loop_out; -} - static inline int mm_alloc_pgd(struct mm_struct *mm) { mm->pgd = pgd_alloc(mm); @@ -794,13 +620,6 @@ static inline void mm_free_pgd(struct mm_struct *mm) pgd_free(mm, mm->pgd); } #else -static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) -{ - mmap_write_lock(oldmm); - dup_mm_exe_file(mm, oldmm); - mmap_write_unlock(oldmm); - return 0; -} #define mm_alloc_pgd(mm) (0) #define mm_free_pgd(mm) #endif /* CONFIG_MMU */ diff --git a/mm/internal.h b/mm/internal.h index cf7c0e9ef7ec..6b8ed2017743 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1624,5 +1624,7 @@ static inline bool reclaim_pt_is_enabled(unsigned long start, unsigned long end, } #endif /* CONFIG_PT_RECLAIM */ +void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm); +int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm); #endif /* __MM_INTERNAL_H */ diff --git a/mm/mmap.c b/mm/mmap.c index 9e09eac0021c..5259df031e15 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1675,7 +1675,6 @@ static int __meminit init_reserve_notifier(void) } subsys_initcall(init_reserve_notifier); -#ifdef CONFIG_MMU /* * Obtain a read lock on mm->mmap_lock, if the specified address is below the * start of the VMA, the intent is to perform a write, and it is a @@ -1719,10 +1718,180 @@ bool mmap_read_lock_maybe_expand(struct mm_struct *mm, mmap_write_downgrade(mm); return true; } -#else -bool mmap_read_lock_maybe_expand(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long addr, bool write) + +__latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) { - return false; + struct vm_area_struct *mpnt, *tmp; + int retval; + unsigned long charge = 0; + LIST_HEAD(uf); + VMA_ITERATOR(vmi, mm, 0); + + if (mmap_write_lock_killable(oldmm)) + return -EINTR; + flush_cache_dup_mm(oldmm); + uprobe_dup_mmap(oldmm, mm); + /* + * Not linked in yet - no deadlock potential: + */ + mmap_write_lock_nested(mm, SINGLE_DEPTH_NESTING); + + /* No ordering required: file already has been exposed. */ + dup_mm_exe_file(mm, oldmm); + + mm->total_vm = oldmm->total_vm; + mm->data_vm = oldmm->data_vm; + mm->exec_vm = oldmm->exec_vm; + mm->stack_vm = oldmm->stack_vm; + + /* Use __mt_dup() to efficiently build an identical maple tree. */ + retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_KERNEL); + if (unlikely(retval)) + goto out; + + mt_clear_in_rcu(vmi.mas.tree); + for_each_vma(vmi, mpnt) { + struct file *file; + + vma_start_write(mpnt); + if (mpnt->vm_flags & VM_DONTCOPY) { + retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start, + mpnt->vm_end, GFP_KERNEL); + if (retval) + goto loop_out; + + vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt)); + continue; + } + charge = 0; + /* + * Don't duplicate many vmas if we've been oom-killed (for + * example) + */ + if (fatal_signal_pending(current)) { + retval = -EINTR; + goto loop_out; + } + if (mpnt->vm_flags & VM_ACCOUNT) { + unsigned long len = vma_pages(mpnt); + + if (security_vm_enough_memory_mm(oldmm, len)) /* sic */ + goto fail_nomem; + charge = len; + } + + tmp = vm_area_dup(mpnt); + if (!tmp) + goto fail_nomem; + + /* track_pfn_copy() will later take care of copying internal state. */ + if (unlikely(tmp->vm_flags & VM_PFNMAP)) + untrack_pfn_clear(tmp); + + retval = vma_dup_policy(mpnt, tmp); + if (retval) + goto fail_nomem_policy; + tmp->vm_mm = mm; + retval = dup_userfaultfd(tmp, &uf); + if (retval) + goto fail_nomem_anon_vma_fork; + if (tmp->vm_flags & VM_WIPEONFORK) { + /* + * VM_WIPEONFORK gets a clean slate in the child. + * Don't prepare anon_vma until fault since we don't + * copy page for current vma. + */ + tmp->anon_vma = NULL; + } else if (anon_vma_fork(tmp, mpnt)) + goto fail_nomem_anon_vma_fork; + vm_flags_clear(tmp, VM_LOCKED_MASK); + /* + * Copy/update hugetlb private vma information. + */ + if (is_vm_hugetlb_page(tmp)) + hugetlb_dup_vma_private(tmp); + + /* + * Link the vma into the MT. After using __mt_dup(), memory + * allocation is not necessary here, so it cannot fail. + */ + vma_iter_bulk_store(&vmi, tmp); + + mm->map_count++; + + if (tmp->vm_ops && tmp->vm_ops->open) + tmp->vm_ops->open(tmp); + + file = tmp->vm_file; + if (file) { + struct address_space *mapping = file->f_mapping; + + get_file(file); + i_mmap_lock_write(mapping); + if (vma_is_shared_maywrite(tmp)) + mapping_allow_writable(mapping); + flush_dcache_mmap_lock(mapping); + /* insert tmp into the share list, just after mpnt */ + vma_interval_tree_insert_after(tmp, mpnt, + &mapping->i_mmap); + flush_dcache_mmap_unlock(mapping); + i_mmap_unlock_write(mapping); + } + + if (!(tmp->vm_flags & VM_WIPEONFORK)) + retval = copy_page_range(tmp, mpnt); + + if (retval) { + mpnt = vma_next(&vmi); + goto loop_out; + } + } + /* a new mm has just been created */ + retval = arch_dup_mmap(oldmm, mm); +loop_out: + vma_iter_free(&vmi); + if (!retval) { + mt_set_in_rcu(vmi.mas.tree); + ksm_fork(mm, oldmm); + khugepaged_fork(mm, oldmm); + } else { + + /* + * The entire maple tree has already been duplicated. If the + * mmap duplication fails, mark the failure point with + * XA_ZERO_ENTRY. In exit_mmap(), if this marker is encountered, + * stop releasing VMAs that have not been duplicated after this + * point. + */ + if (mpnt) { + mas_set_range(&vmi.mas, mpnt->vm_start, mpnt->vm_end - 1); + mas_store(&vmi.mas, XA_ZERO_ENTRY); + /* Avoid OOM iterating a broken tree */ + set_bit(MMF_OOM_SKIP, &mm->flags); + } + /* + * The mm_struct is going to exit, but the locks will be dropped + * first. Set the mm_struct as unstable is advisable as it is + * not fully initialised. + */ + set_bit(MMF_UNSTABLE, &mm->flags); + } +out: + mmap_write_unlock(mm); + flush_tlb_mm(oldmm); + mmap_write_unlock(oldmm); + if (!retval) + dup_userfaultfd_complete(&uf); + else + dup_userfaultfd_fail(&uf); + return retval; + +fail_nomem_anon_vma_fork: + mpol_put(vma_policy(tmp)); +fail_nomem_policy: + vm_area_free(tmp); +fail_nomem: + retval = -ENOMEM; + vm_unacct_memory(charge); + goto loop_out; } -#endif diff --git a/mm/nommu.c b/mm/nommu.c index 2b4d304c6445..a142fc258d39 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -1874,3 +1874,11 @@ static int __meminit init_admin_reserve(void) return 0; } subsys_initcall(init_admin_reserve); + +int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) +{ + mmap_write_lock(oldmm); + dup_mm_exe_file(mm, oldmm); + mmap_write_unlock(oldmm); + return 0; +} From 3e43e260f1e44d21861815faa905a1829027600f Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Mon, 28 Apr 2025 16:28:17 +0100 Subject: [PATCH 215/285] mm: perform VMA allocation, freeing, duplication in mm Right now these are performed in kernel/fork.c which is odd and a violation of separation of concerns, as well as preventing us from integrating this and related logic into userland VMA testing going forward. There is a fly in the ointment - nommu - mmap.c is not compiled if CONFIG_MMU not set, and neither is vma.c. To square the circle, let's add a new file - vma_init.c. This will be compiled for both CONFIG_MMU and nommu builds, and will also form part of the VMA userland testing. This allows us to de-duplicate code, while maintaining separation of concerns and the ability for us to userland test this logic. Update the VMA userland tests accordingly, additionally adding a detach_free_vma() helper function to correctly detach VMAs before freeing them in test code, as this change was triggering the assert for this. [akpm@linux-foundation.org: remove stray newline, per Liam] Link: https://lkml.kernel.org/r/f97b3a85a6da0196b28070df331b99e22b263be8.1745853549.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Liam R. Howlett Reviewed-by: Pedro Falcato Reviewed-by: David Hildenbrand Reviewed-by: Kees Cook Reviewed-by: Suren Baghdasaryan Reviewed-by: Vlastimil Babka Cc: Al Viro Cc: Christian Brauner Cc: Jan Kara Cc: Jann Horn Signed-off-by: Andrew Morton --- MAINTAINERS | 1 + kernel/fork.c | 88 ------------------- mm/Makefile | 2 +- mm/mmap.c | 3 +- mm/nommu.c | 4 +- mm/vma.h | 6 ++ mm/vma_init.c | 101 ++++++++++++++++++++++ tools/testing/vma/Makefile | 2 +- tools/testing/vma/vma.c | 26 ++++-- tools/testing/vma/vma_internal.h | 143 +++++++++++++++++++++++++------ 10 files changed, 250 insertions(+), 126 deletions(-) create mode 100644 mm/vma_init.c diff --git a/MAINTAINERS b/MAINTAINERS index 836105023495..ab6b08dbc779 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -15683,6 +15683,7 @@ F: mm/mseal.c F: mm/vma.c F: mm/vma.h F: mm/vma_exec.c +F: mm/vma_init.c F: mm/vma_internal.h F: tools/testing/selftests/mm/merge.c F: tools/testing/vma/ diff --git a/kernel/fork.c b/kernel/fork.c index ac9f9267a473..9e4616dacd82 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -431,88 +431,9 @@ struct kmem_cache *files_cachep; /* SLAB cache for fs_struct structures (tsk->fs) */ struct kmem_cache *fs_cachep; -/* SLAB cache for vm_area_struct structures */ -static struct kmem_cache *vm_area_cachep; - /* SLAB cache for mm_struct structures (tsk->mm) */ static struct kmem_cache *mm_cachep; -struct vm_area_struct *vm_area_alloc(struct mm_struct *mm) -{ - struct vm_area_struct *vma; - - vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL); - if (!vma) - return NULL; - - vma_init(vma, mm); - - return vma; -} - -static void vm_area_init_from(const struct vm_area_struct *src, - struct vm_area_struct *dest) -{ - dest->vm_mm = src->vm_mm; - dest->vm_ops = src->vm_ops; - dest->vm_start = src->vm_start; - dest->vm_end = src->vm_end; - dest->anon_vma = src->anon_vma; - dest->vm_pgoff = src->vm_pgoff; - dest->vm_file = src->vm_file; - dest->vm_private_data = src->vm_private_data; - vm_flags_init(dest, src->vm_flags); - memcpy(&dest->vm_page_prot, &src->vm_page_prot, - sizeof(dest->vm_page_prot)); - /* - * src->shared.rb may be modified concurrently when called from - * dup_mmap(), but the clone will reinitialize it. - */ - data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared))); - memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx, - sizeof(dest->vm_userfaultfd_ctx)); -#ifdef CONFIG_ANON_VMA_NAME - dest->anon_name = src->anon_name; -#endif -#ifdef CONFIG_SWAP - memcpy(&dest->swap_readahead_info, &src->swap_readahead_info, - sizeof(dest->swap_readahead_info)); -#endif -#ifndef CONFIG_MMU - dest->vm_region = src->vm_region; -#endif -#ifdef CONFIG_NUMA - dest->vm_policy = src->vm_policy; -#endif -} - -struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig) -{ - struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL); - - if (!new) - return NULL; - - ASSERT_EXCLUSIVE_WRITER(orig->vm_flags); - ASSERT_EXCLUSIVE_WRITER(orig->vm_file); - vm_area_init_from(orig, new); - vma_lock_init(new, true); - INIT_LIST_HEAD(&new->anon_vma_chain); - vma_numab_state_init(new); - dup_anon_vma_name(orig, new); - - return new; -} - -void vm_area_free(struct vm_area_struct *vma) -{ - /* The vma should be detached while being destroyed. */ - vma_assert_detached(vma); - vma_numab_state_free(vma); - free_anon_vma_name(vma); - kmem_cache_free(vm_area_cachep, vma); -} - static void account_kernel_stack(struct task_struct *tsk, int account) { if (IS_ENABLED(CONFIG_VMAP_STACK)) { @@ -3033,11 +2954,6 @@ void __init mm_cache_init(void) void __init proc_caches_init(void) { - struct kmem_cache_args args = { - .use_freeptr_offset = true, - .freeptr_offset = offsetof(struct vm_area_struct, vm_freeptr), - }; - sighand_cachep = kmem_cache_create("sighand_cache", sizeof(struct sighand_struct), 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU| @@ -3054,10 +2970,6 @@ void __init proc_caches_init(void) sizeof(struct fs_struct), 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL); - vm_area_cachep = kmem_cache_create("vm_area_struct", - sizeof(struct vm_area_struct), &args, - SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU| - SLAB_ACCOUNT); mmap_init(); nsproxy_cache_init(); } diff --git a/mm/Makefile b/mm/Makefile index 7aadec97c37b..1a7a11d4933d 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -55,7 +55,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \ mm_init.o percpu.o slab_common.o \ compaction.o show_mem.o \ interval_tree.o list_lru.o workingset.o \ - debug.o gup.o mmap_lock.o $(mmu-y) + debug.o gup.o mmap_lock.o vma_init.o $(mmu-y) # Give 'page_alloc' its own module-parameter namespace page-alloc-y := page_alloc.o diff --git a/mm/mmap.c b/mm/mmap.c index 5259df031e15..81dd962a1cfc 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1554,7 +1554,7 @@ static const struct ctl_table mmap_table[] = { #endif /* CONFIG_SYSCTL */ /* - * initialise the percpu counter for VM + * initialise the percpu counter for VM, initialise VMA state. */ void __init mmap_init(void) { @@ -1565,6 +1565,7 @@ void __init mmap_init(void) #ifdef CONFIG_SYSCTL register_sysctl_init("vm", mmap_table); #endif + vma_state_init(); } /* diff --git a/mm/nommu.c b/mm/nommu.c index a142fc258d39..0bf4849b8204 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -399,7 +399,8 @@ static const struct ctl_table nommu_table[] = { }; /* - * initialise the percpu counter for VM and region record slabs + * initialise the percpu counter for VM and region record slabs, initialise VMA + * state. */ void __init mmap_init(void) { @@ -409,6 +410,7 @@ void __init mmap_init(void) VM_BUG_ON(ret); vm_region_jar = KMEM_CACHE(vm_region, SLAB_PANIC|SLAB_ACCOUNT); register_sysctl_init("vm", nommu_table); + vma_state_init(); } /* diff --git a/mm/vma.h b/mm/vma.h index 151e292263fd..9a8af9be29a8 100644 --- a/mm/vma.h +++ b/mm/vma.h @@ -550,6 +550,12 @@ int __vm_munmap(unsigned long start, size_t len, bool unlock); int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma); +/* vma_init.h, shared between CONFIG_MMU and nommu. */ +void __init vma_state_init(void); +struct vm_area_struct *vm_area_alloc(struct mm_struct *mm); +struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig); +void vm_area_free(struct vm_area_struct *vma); + /* vma_exec.c */ #ifdef CONFIG_MMU int create_init_stack_vma(struct mm_struct *mm, struct vm_area_struct **vmap, diff --git a/mm/vma_init.c b/mm/vma_init.c new file mode 100644 index 000000000000..967ca8517986 --- /dev/null +++ b/mm/vma_init.c @@ -0,0 +1,101 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +/* + * Functions for initialisaing, allocating, freeing and duplicating VMAs. Shared + * between CONFIG_MMU and non-CONFIG_MMU kernel configurations. + */ + +#include "vma_internal.h" +#include "vma.h" + +/* SLAB cache for vm_area_struct structures */ +static struct kmem_cache *vm_area_cachep; + +void __init vma_state_init(void) +{ + struct kmem_cache_args args = { + .use_freeptr_offset = true, + .freeptr_offset = offsetof(struct vm_area_struct, vm_freeptr), + }; + + vm_area_cachep = kmem_cache_create("vm_area_struct", + sizeof(struct vm_area_struct), &args, + SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU| + SLAB_ACCOUNT); +} + +struct vm_area_struct *vm_area_alloc(struct mm_struct *mm) +{ + struct vm_area_struct *vma; + + vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL); + if (!vma) + return NULL; + + vma_init(vma, mm); + + return vma; +} + +static void vm_area_init_from(const struct vm_area_struct *src, + struct vm_area_struct *dest) +{ + dest->vm_mm = src->vm_mm; + dest->vm_ops = src->vm_ops; + dest->vm_start = src->vm_start; + dest->vm_end = src->vm_end; + dest->anon_vma = src->anon_vma; + dest->vm_pgoff = src->vm_pgoff; + dest->vm_file = src->vm_file; + dest->vm_private_data = src->vm_private_data; + vm_flags_init(dest, src->vm_flags); + memcpy(&dest->vm_page_prot, &src->vm_page_prot, + sizeof(dest->vm_page_prot)); + /* + * src->shared.rb may be modified concurrently when called from + * dup_mmap(), but the clone will reinitialize it. + */ + data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared))); + memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx, + sizeof(dest->vm_userfaultfd_ctx)); +#ifdef CONFIG_ANON_VMA_NAME + dest->anon_name = src->anon_name; +#endif +#ifdef CONFIG_SWAP + memcpy(&dest->swap_readahead_info, &src->swap_readahead_info, + sizeof(dest->swap_readahead_info)); +#endif +#ifndef CONFIG_MMU + dest->vm_region = src->vm_region; +#endif +#ifdef CONFIG_NUMA + dest->vm_policy = src->vm_policy; +#endif +} + +struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig) +{ + struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL); + + if (!new) + return NULL; + + ASSERT_EXCLUSIVE_WRITER(orig->vm_flags); + ASSERT_EXCLUSIVE_WRITER(orig->vm_file); + vm_area_init_from(orig, new); + vma_lock_init(new, true); + INIT_LIST_HEAD(&new->anon_vma_chain); + vma_numab_state_init(new); + dup_anon_vma_name(orig, new); + + return new; +} + +void vm_area_free(struct vm_area_struct *vma) +{ + /* The vma should be detached while being destroyed. */ + vma_assert_detached(vma); + vma_numab_state_free(vma); + free_anon_vma_name(vma); + kmem_cache_free(vm_area_cachep, vma); +} diff --git a/tools/testing/vma/Makefile b/tools/testing/vma/Makefile index 624040fcf193..66f3831a668f 100644 --- a/tools/testing/vma/Makefile +++ b/tools/testing/vma/Makefile @@ -9,7 +9,7 @@ include ../shared/shared.mk OFILES = $(SHARED_OFILES) vma.o maple-shim.o TARGETS = vma -vma.o: vma.c vma_internal.h ../../../mm/vma.c ../../../mm/vma_exec.c ../../../mm/vma.h +vma.o: vma.c vma_internal.h ../../../mm/vma.c ../../../mm/vma_init.c ../../../mm/vma_exec.c ../../../mm/vma.h vma: $(OFILES) $(CC) $(CFLAGS) -o $@ $(OFILES) $(LDLIBS) diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c index 5832ae5d797d..2be7597a2ac2 100644 --- a/tools/testing/vma/vma.c +++ b/tools/testing/vma/vma.c @@ -28,6 +28,7 @@ unsigned long stack_guard_gap = 256UL<vm_pgoff, 0); ASSERT_EQ(vma->vm_flags, flags); - vm_area_free(vma); + detach_free_vma(vma); mtree_destroy(&mm.mm_mt); return true; @@ -361,7 +368,7 @@ static bool test_simple_modify(void) ASSERT_EQ(vma->vm_end, 0x1000); ASSERT_EQ(vma->vm_pgoff, 0); - vm_area_free(vma); + detach_free_vma(vma); vma_iter_clear(&vmi); vma = vma_next(&vmi); @@ -370,7 +377,7 @@ static bool test_simple_modify(void) ASSERT_EQ(vma->vm_end, 0x2000); ASSERT_EQ(vma->vm_pgoff, 1); - vm_area_free(vma); + detach_free_vma(vma); vma_iter_clear(&vmi); vma = vma_next(&vmi); @@ -379,7 +386,7 @@ static bool test_simple_modify(void) ASSERT_EQ(vma->vm_end, 0x3000); ASSERT_EQ(vma->vm_pgoff, 2); - vm_area_free(vma); + detach_free_vma(vma); mtree_destroy(&mm.mm_mt); return true; @@ -407,7 +414,7 @@ static bool test_simple_expand(void) ASSERT_EQ(vma->vm_end, 0x3000); ASSERT_EQ(vma->vm_pgoff, 0); - vm_area_free(vma); + detach_free_vma(vma); mtree_destroy(&mm.mm_mt); return true; @@ -428,7 +435,7 @@ static bool test_simple_shrink(void) ASSERT_EQ(vma->vm_end, 0x1000); ASSERT_EQ(vma->vm_pgoff, 0); - vm_area_free(vma); + detach_free_vma(vma); mtree_destroy(&mm.mm_mt); return true; @@ -619,7 +626,7 @@ static bool test_merge_new(void) ASSERT_EQ(vma->vm_pgoff, 0); ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); - vm_area_free(vma); + detach_free_vma(vma); count++; } @@ -1668,6 +1675,7 @@ int main(void) int num_tests = 0, num_fail = 0; maple_tree_init(); + vma_state_init(); #define TEST(name) \ do { \ diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index 32e990313158..198abe66de5a 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -155,6 +155,10 @@ typedef __bitwise unsigned int vm_fault_t; */ #define pr_warn_once pr_err +#define data_race(expr) expr + +#define ASSERT_EXCLUSIVE_WRITER(x) + struct kref { refcount_t refcount; }; @@ -255,6 +259,8 @@ struct file { #define VMA_LOCK_OFFSET 0x40000000 +typedef struct { unsigned long v; } freeptr_t; + struct vm_area_struct { /* The first cache line has the info for VMA tree walking. */ @@ -264,9 +270,7 @@ struct vm_area_struct { unsigned long vm_start; unsigned long vm_end; }; -#ifdef CONFIG_PER_VMA_LOCK - struct rcu_head vm_rcu; /* Used for deferred freeing. */ -#endif + freeptr_t vm_freeptr; /* Pointer used by SLAB_TYPESAFE_BY_RCU */ }; struct mm_struct *vm_mm; /* The address space we belong to. */ @@ -463,6 +467,65 @@ struct pagetable_move_control { .len_in = len_, \ } +struct kmem_cache_args { + /** + * @align: The required alignment for the objects. + * + * %0 means no specific alignment is requested. + */ + unsigned int align; + /** + * @useroffset: Usercopy region offset. + * + * %0 is a valid offset, when @usersize is non-%0 + */ + unsigned int useroffset; + /** + * @usersize: Usercopy region size. + * + * %0 means no usercopy region is specified. + */ + unsigned int usersize; + /** + * @freeptr_offset: Custom offset for the free pointer + * in &SLAB_TYPESAFE_BY_RCU caches + * + * By default &SLAB_TYPESAFE_BY_RCU caches place the free pointer + * outside of the object. This might cause the object to grow in size. + * Cache creators that have a reason to avoid this can specify a custom + * free pointer offset in their struct where the free pointer will be + * placed. + * + * Note that placing the free pointer inside the object requires the + * caller to ensure that no fields are invalidated that are required to + * guard against object recycling (See &SLAB_TYPESAFE_BY_RCU for + * details). + * + * Using %0 as a value for @freeptr_offset is valid. If @freeptr_offset + * is specified, %use_freeptr_offset must be set %true. + * + * Note that @ctor currently isn't supported with custom free pointers + * as a @ctor requires an external free pointer. + */ + unsigned int freeptr_offset; + /** + * @use_freeptr_offset: Whether a @freeptr_offset is used. + */ + bool use_freeptr_offset; + /** + * @ctor: A constructor for the objects. + * + * The constructor is invoked for each object in a newly allocated slab + * page. It is the cache user's responsibility to free object in the + * same state as after calling the constructor, or deal appropriately + * with any differences between a freshly constructed and a reallocated + * object. + * + * %NULL means no constructor. + */ + void (*ctor)(void *); +}; + static inline void vma_iter_invalidate(struct vma_iterator *vmi) { mas_pause(&vmi->mas); @@ -547,31 +610,38 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm) vma->vm_lock_seq = UINT_MAX; } -static inline struct vm_area_struct *vm_area_alloc(struct mm_struct *mm) +struct kmem_cache { + const char *name; + size_t object_size; + struct kmem_cache_args *args; +}; + +static inline struct kmem_cache *__kmem_cache_create(const char *name, + size_t object_size, + struct kmem_cache_args *args) { - struct vm_area_struct *vma = calloc(1, sizeof(struct vm_area_struct)); + struct kmem_cache *ret = malloc(sizeof(struct kmem_cache)); - if (!vma) - return NULL; + ret->name = name; + ret->object_size = object_size; + ret->args = args; - vma_init(vma, mm); - - return vma; + return ret; } -static inline struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig) +#define kmem_cache_create(__name, __object_size, __args, ...) \ + __kmem_cache_create((__name), (__object_size), (__args)) + +static inline void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags) { - struct vm_area_struct *new = calloc(1, sizeof(struct vm_area_struct)); + (void)gfpflags; - if (!new) - return NULL; + return calloc(s->object_size, 1); +} - memcpy(new, orig, sizeof(*new)); - refcount_set(&new->vm_refcnt, 0); - new->vm_lock_seq = UINT_MAX; - INIT_LIST_HEAD(&new->anon_vma_chain); - - return new; +static inline void kmem_cache_free(struct kmem_cache *s, void *x) +{ + free(x); } /* @@ -738,11 +808,6 @@ static inline void mpol_put(struct mempolicy *) { } -static inline void vm_area_free(struct vm_area_struct *vma) -{ - free(vma); -} - static inline void lru_add_drain(void) { } @@ -1312,4 +1377,32 @@ static inline void ksm_exit(struct mm_struct *mm) (void)mm; } +static inline void vma_lock_init(struct vm_area_struct *vma, bool reset_refcnt) +{ + (void)vma; + (void)reset_refcnt; +} + +static inline void vma_numab_state_init(struct vm_area_struct *vma) +{ + (void)vma; +} + +static inline void vma_numab_state_free(struct vm_area_struct *vma) +{ + (void)vma; +} + +static inline void dup_anon_vma_name(struct vm_area_struct *orig_vma, + struct vm_area_struct *new_vma) +{ + (void)orig_vma; + (void)new_vma; +} + +static inline void free_anon_vma_name(struct vm_area_struct *vma) +{ + (void)vma; +} + #endif /* __MM_VMA_INTERNAL_H */ From 69eadd6a05409ca3725cabf8d60ccf6c8f87e193 Mon Sep 17 00:00:00 2001 From: Guilherme Giacomo Simoes Date: Mon, 28 Apr 2025 17:14:09 -0300 Subject: [PATCH 216/285] mm: page-flags-layout.h: change the KASAN_TAG_WIDTH for HW_TAGS KASAN_TAG_WIDTH is 8 bits for both (HW_TAGS and SW_TAGS), but for HW_TAGS the KASAN_TAG_WIDTH can be 4 bits bits because due to the design of the MTE the memory words for storing metadata only need 4 bits. Change the preprocessor define KASAN_TAG_WIDTH for check if SW_TAGS is define, so KASAN_TAG_WIDTH should be 8 bits, but if HW_TAGS is define, so KASAN_TAG_WIDTH should be 4 bits to save a few flags bits. Link: https://lkml.kernel.org/r/20250428201409.5482-1-trintaeoitogc@gmail.com Signed-off-by: Guilherme Giacomo Simoes Suggested-by: Andrey Konovalov Reviewed-by: Andrey Konovalov Cc: Pasha Tatashin Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- include/linux/page-flags-layout.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h index 4f5c9e979bb9..760006b1c480 100644 --- a/include/linux/page-flags-layout.h +++ b/include/linux/page-flags-layout.h @@ -72,8 +72,10 @@ #define NODE_NOT_IN_PAGE_FLAGS 1 #endif -#if defined(CONFIG_KASAN_SW_TAGS) || defined(CONFIG_KASAN_HW_TAGS) +#if defined(CONFIG_KASAN_SW_TAGS) #define KASAN_TAG_WIDTH 8 +#elif defined(CONFIG_KASAN_HW_TAGS) +#define KASAN_TAG_WIDTH 4 #else #define KASAN_TAG_WIDTH 0 #endif From 3592a86a2b6be115000b82af78fe7f96fbc658a4 Mon Sep 17 00:00:00 2001 From: Gregory Price Date: Thu, 10 Apr 2025 10:28:31 -0400 Subject: [PATCH 217/285] DAX: warn when kmem regions are truncated for memory block alignment Device capacity intended for use as system ram should be aligned to the architecture-defined memory block size or that capacity will be silently truncated and capacity stranded. As hotplug dax memory becomes more prevelant, the memory block size alignment becomes more important for platform and device vendors to pay attention to - so this truncation should not be silent. This issue is particularly relevant for CXL Dynamic Capacity devices, whose capacity may arrive in spec-aligned but block-misaligned chunks. Link: https://lkml.kernel.org/r/20250410142831.217887-1-gourry@gourry.net Suggested-by: David Hildenbrand Suggested-by: Dan Williams Reviewed-by: Dan Williams Tested-by: Alison Schofield Reviewed-by: Jonathan Cameron Acked-by: David Hildenbrand Signed-off-by: Gregory Price Cc: Dave Jiang Cc: Vishal Verma Signed-off-by: Andrew Morton --- drivers/dax/kmem.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index e97d47f42ee2..584c70a34b52 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -13,6 +13,7 @@ #include #include #include +#include #include "dax-private.h" #include "bus.h" @@ -68,7 +69,7 @@ static void kmem_put_memory_types(void) static int dev_dax_kmem_probe(struct dev_dax *dev_dax) { struct device *dev = &dev_dax->dev; - unsigned long total_len = 0; + unsigned long total_len = 0, orig_len = 0; struct dax_kmem_data *data; struct memory_dev_type *mtype; int i, rc, mapped = 0; @@ -97,6 +98,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) for (i = 0; i < dev_dax->nr_range; i++) { struct range range; + orig_len += range_len(&dev_dax->ranges[i].range); rc = dax_kmem_range(dev_dax, i, &range); if (rc) { dev_info(dev, "mapping%d: %#llx-%#llx too small after alignment\n", @@ -109,6 +111,12 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) if (!total_len) { dev_warn(dev, "rejecting DAX region without any memory after alignment\n"); return -EINVAL; + } else if (total_len != orig_len) { + char buf[16]; + + string_get_size(orig_len - total_len, 1, STRING_UNITS_2, + buf, sizeof(buf)); + dev_warn(dev, "DAX region truncated by %s due to alignment\n", buf); } init_node_memory_type(numa_node, mtype); From 5ec56c1cb651eb70ab2cfc92264c82c3a6d584dc Mon Sep 17 00:00:00 2001 From: "Thushara.M.S" Date: Mon, 5 May 2025 14:29:12 -0700 Subject: [PATCH 218/285] docs/mm/damon/design: fix spelling mistake The word accuracy was misspelled as "accruracy". Signed-off-by: Thushara.M.S Reviewed-by: SeongJae Park Signed-off-by: Andrew Morton --- Documentation/mm/damon/design.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst index 728bf5754343..ddc50db3afa4 100644 --- a/Documentation/mm/damon/design.rst +++ b/Documentation/mm/damon/design.rst @@ -54,7 +54,7 @@ monitoring are address-space dependent. DAMON consolidates these implementations in a layer called DAMON Operations Set, and defines the interface between it and the upper layer. The upper layer is dedicated for DAMON's core logics including the mechanism for control of the -monitoring accruracy and the overhead. +monitoring accuracy and the overhead. Hence, DAMON can easily be extended for any address space and/or available hardware features by configuring the core logic to use the appropriate From f1c2bca2677b78c66edd1db010fd79d4a08e7146 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Wed, 7 May 2025 07:16:20 +0200 Subject: [PATCH 219/285] xarray: fix kerneldoc for __xa_cmpxchg Fix the documentation for __xa_cmpxchg to actually describe the cmpxch-like semantics correctly, based on the version for xa_cmpxchg. Link: https://lkml.kernel.org/r/20250507051656.3900864-1-hch@lst.de Signed-off-by: Christoph Hellwig Cc: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- lib/xarray.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/lib/xarray.c b/lib/xarray.c index 9644b18af18d..76dde3a1cacf 100644 --- a/lib/xarray.c +++ b/lib/xarray.c @@ -1742,20 +1742,23 @@ static inline void *__xa_cmpxchg_raw(struct xarray *xa, unsigned long index, void *old, void *entry, gfp_t gfp); /** - * __xa_cmpxchg() - Store this entry in the XArray. + * __xa_cmpxchg() - Conditionally replace an entry in the XArray. * @xa: XArray. * @index: Index into array. * @old: Old value to test against. - * @entry: New entry. + * @entry: New value to place in array. * @gfp: Memory allocation flags. * * You must already be holding the xa_lock when calling this function. * It will drop the lock if needed to allocate memory, and then reacquire * it afterwards. * + * If the entry at @index is the same as @old, replace it with @entry. + * If the return value is equal to @old, then the exchange was successful. + * * Context: Any context. Expects xa_lock to be held on entry. May * release and reacquire xa_lock if @gfp flags permit. - * Return: The old entry at this index or xa_err() if an error happened. + * Return: The old value at this index or xa_err() if an error happened. */ void *__xa_cmpxchg(struct xarray *xa, unsigned long index, void *old, void *entry, gfp_t gfp) From 74e6ee62a894be418ca80f3298069c2eae704cc5 Mon Sep 17 00:00:00 2001 From: Kairui Song Date: Thu, 1 May 2025 02:10:47 +0800 Subject: [PATCH 220/285] fuse: drop usage of folio_index Patch series "mm, swap: clean up swap cache mapping helper", v3. This series removes usage of folio_index usage in fs/, and remove swap cache checking in folio_contains. Currently, the swap cache is already no longer directly exposed to fs, and swap cache will be more different from page cache. Clean up the helpers first to simplify the code and eliminate the helpers used for resolving circular header dependency issue between filemap and swap headers, and prepare for further changes. This patch (of 6): folio_index is only needed for mixed usage of page cache and swap cache, for pure page cache usage, the caller can just use folio->index instead. It can't be a swap cache folio here. Swap mapping may only call into fs through `swap_rw` but fuse does not use that method for SWAP. Link: https://lkml.kernel.org/r/20250430181052.55698-1-ryncsn@gmail.com Link: https://lkml.kernel.org/r/20250430181052.55698-2-ryncsn@gmail.com Signed-off-by: Kairui Song Reviewed-by: Matthew Wilcox (Oracle) Reviewed-by: David Hildenbrand Cc: Miklos Szeredi Cc: Joanne Koong Cc: Josef Bacik Cc: Chris Li Cc: "Huang, Ying" Cc: Hugh Dickins Cc: Johannes Weiner Cc: Nhat Pham Cc: Yosry Ahmed Cc: Christian Brauner Cc: Chao Yu Cc: Chris Mason Cc: David Sterba Cc: Jaegeuk Kim Cc: Qu Wenruo Signed-off-by: Andrew Morton --- fs/fuse/file.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 754378dd9f71..6f19a4daa559 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -487,7 +487,7 @@ static inline bool fuse_folio_is_writeback(struct inode *inode, struct folio *folio) { pgoff_t last = folio_next_index(folio) - 1; - return fuse_range_is_writeback(inode, folio_index(folio), last); + return fuse_range_is_writeback(inode, folio->index, last); } static void fuse_wait_on_folio_writeback(struct inode *inode, @@ -2349,7 +2349,7 @@ static bool fuse_writepage_need_send(struct fuse_conn *fc, struct folio *folio, return true; /* Discontinuity */ - if (data->orig_folios[ap->num_folios - 1]->index + 1 != folio_index(folio)) + if (data->orig_folios[ap->num_folios - 1]->index + 1 != folio->index) return true; /* Need to grow the pages array? If so, did the expansion fail? */ From fe15ec0464319966c289b50118c3a3995435be52 Mon Sep 17 00:00:00 2001 From: Kairui Song Date: Thu, 1 May 2025 02:10:49 +0800 Subject: [PATCH 221/285] f2fs: drop usage of folio_index folio_index is only needed for mixed usage of page cache and swap cache, for pure page cache usage, the caller can just use folio->index instead. It can't be a swap cache folio here. Swap mapping may only call into fs through `swap_rw` but f2fs does not use that method for swap. Link: https://lkml.kernel.org/r/20250430181052.55698-4-ryncsn@gmail.com Signed-off-by: Kairui Song Reviewed-by: Matthew Wilcox (Oracle) Reviewed-by: David Hildenbrand Reviewed-by: Chao Yu Cc: Jaegeuk Kim Cc: Chris Li Cc: Chris Mason Cc: Christian Brauner Cc: David Sterba Cc: "Huang, Ying" Cc: Hugh Dickins Cc: Joanne Koong Cc: Johannes Weiner Cc: Josef Bacik Cc: Miklos Szeredi Cc: Nhat Pham Cc: Qu Wenruo Cc: Yosry Ahmed Signed-off-by: Andrew Morton --- fs/f2fs/data.c | 4 ++-- fs/f2fs/inline.c | 4 ++-- fs/f2fs/super.c | 2 +- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c index 54f89f0ee69b..5745b97ca1f0 100644 --- a/fs/f2fs/data.c +++ b/fs/f2fs/data.c @@ -2077,7 +2077,7 @@ static int f2fs_read_single_page(struct inode *inode, struct folio *folio, sector_t last_block; sector_t last_block_in_file; sector_t block_nr; - pgoff_t index = folio_index(folio); + pgoff_t index = folio->index; int ret = 0; block_in_file = (sector_t)index; @@ -2392,7 +2392,7 @@ static int f2fs_mpage_readpages(struct inode *inode, } #ifdef CONFIG_F2FS_FS_COMPRESSION - index = folio_index(folio); + index = folio->index; if (!f2fs_compressed_file(inode)) goto read_single_page; diff --git a/fs/f2fs/inline.c b/fs/f2fs/inline.c index ad92e9008781..aaaec3206538 100644 --- a/fs/f2fs/inline.c +++ b/fs/f2fs/inline.c @@ -86,7 +86,7 @@ void f2fs_do_read_inline_data(struct folio *folio, struct page *ipage) if (folio_test_uptodate(folio)) return; - f2fs_bug_on(F2FS_I_SB(inode), folio_index(folio)); + f2fs_bug_on(F2FS_I_SB(inode), folio->index); folio_zero_segment(folio, MAX_INLINE_DATA(inode), folio_size(folio)); @@ -130,7 +130,7 @@ int f2fs_read_inline_data(struct inode *inode, struct folio *folio) return -EAGAIN; } - if (folio_index(folio)) + if (folio->index) folio_zero_segment(folio, 0, folio_size(folio)); else f2fs_do_read_inline_data(folio, ipage); diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c index f087b2b71c89..eac1dcb44637 100644 --- a/fs/f2fs/super.c +++ b/fs/f2fs/super.c @@ -3432,7 +3432,7 @@ static int __f2fs_commit_super(struct f2fs_sb_info *sbi, struct folio *folio, bio = bio_alloc(sbi->sb->s_bdev, 1, opf, GFP_NOFS); /* it doesn't need to set crypto context for superblock update */ - bio->bi_iter.bi_sector = SECTOR_FROM_BLOCK(folio_index(folio)); + bio->bi_iter.bi_sector = SECTOR_FROM_BLOCK(folio->index); if (!bio_add_folio(bio, folio, folio_size(folio), 0)) f2fs_bug_on(sbi, 1); From 2b80f633c360ce3c56a7071782eae70852c8f344 Mon Sep 17 00:00:00 2001 From: Kairui Song Date: Thu, 1 May 2025 02:10:50 +0800 Subject: [PATCH 222/285] filemap: do not use folio_contains for swap cache folios Currently, none of the folio_contains callers should encounter swap cache folios. For fs/ callers, swap cache folios are never part of their workflow. For filemap and truncate, folio_contains is only used for sanity checks to verify the folio index matches the expected lookup / invalidation target. The swap cache does not utilize filemap or truncate helpers in ways that would trigger these checks, as it mostly implements its own cache management. Shmem won't trigger these sanity checks either unless thing went wrong, as it would directly trigger a BUG because swap cache index are unrelated and almost never matches shmem index. Shmem have to handle mixed values of folios, shadows, and swap entries, so it has its own way of handling the mapping. While some filemap helpers works for swap cache space, the swap cache is different from the page cache in many ways. So this particular helper will unlikely to work in a helpful way for swap cache folios. So make it explicit here that folio_contains should not be used for swap cache folios. This helps to avoid misuse, make swap cache less exposed and remove the folio_index usage here. [akpm@linux-foundation.org: s/VM_WARN_ON_FOLIO/VM_WARN_ON_ONCE_FOLIO/, per Kairui] Link: https://lkml.kernel.org/r/20250430181052.55698-5-ryncsn@gmail.com Signed-off-by: Kairui Song Acked-by: David Hildenbrand Cc: Chao Yu Cc: Chris Li Cc: Chris Mason Cc: Christian Brauner Cc: David Sterba Cc: "Huang, Ying" Cc: Hugh Dickins Cc: Jaegeuk Kim Cc: Joanne Koong Cc: Johannes Weiner Cc: Josef Bacik Cc: Matthew Wilcox (Oracle) Cc: Miklos Szeredi Cc: Nhat Pham Cc: Qu Wenruo Cc: Yosry Ahmed Signed-off-by: Andrew Morton --- include/linux/pagemap.h | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index af25fb640463..b2c2ff8de046 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -935,14 +935,14 @@ static inline struct page *folio_file_page(struct folio *folio, pgoff_t index) * @folio: The folio. * @index: The page index within the file. * - * Context: The caller should have the page locked in order to prevent - * (eg) shmem from moving the page between the page cache and swap cache - * and changing its index in the middle of the operation. + * Context: The caller should have the folio locked and ensure + * e.g., shmem did not move this folio to the swap cache. * Return: true or false. */ static inline bool folio_contains(struct folio *folio, pgoff_t index) { - return index - folio_index(folio) < folio_nr_pages(folio); + VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio); + return index - folio->index < folio_nr_pages(folio); } unsigned filemap_get_folios(struct address_space *mapping, pgoff_t *start, From 7d0f0f06153116bb737eef76d47406639e161613 Mon Sep 17 00:00:00 2001 From: Kairui Song Date: Thu, 1 May 2025 02:10:51 +0800 Subject: [PATCH 223/285] mm: move folio_index to mm/swap.h and remove no longer needed helper There are no remaining users of folio_index() outside the mm subsystem. Move it to mm/swap.h to co-locate it with swap_cache_index(), eliminating a forward declaration, and a function call overhead. Also remove the helper that was used to fix circular header dependency issue. Link: https://lkml.kernel.org/r/20250430181052.55698-6-ryncsn@gmail.com Signed-off-by: Kairui Song Acked-by: David Hildenbrand Cc: Chao Yu Cc: Chris Li Cc: Chris Mason Cc: Christian Brauner Cc: David Sterba Cc: "Huang, Ying" Cc: Hugh Dickins Cc: Jaegeuk Kim Cc: Joanne Koong Cc: Johannes Weiner Cc: Josef Bacik Cc: Matthew Wilcox (Oracle) Cc: Miklos Szeredi Cc: Nhat Pham Cc: Qu Wenruo Cc: Yosry Ahmed Signed-off-by: Andrew Morton --- include/linux/pagemap.h | 20 -------------------- mm/gup.c | 1 + mm/memfd.c | 1 + mm/migrate.c | 1 + mm/page-writeback.c | 1 + mm/swap.h | 18 ++++++++++++++++++ mm/swapfile.c | 6 ------ 7 files changed, 22 insertions(+), 26 deletions(-) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index b2c2ff8de046..5d66786867eb 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -884,26 +884,6 @@ static inline struct page *grab_cache_page_nowait(struct address_space *mapping, mapping_gfp_mask(mapping)); } -extern pgoff_t __folio_swap_cache_index(struct folio *folio); - -/** - * folio_index - File index of a folio. - * @folio: The folio. - * - * For a folio which is either in the page cache or the swap cache, - * return its index within the address_space it belongs to. If you know - * the page is definitely in the page cache, you can look at the folio's - * index directly. - * - * Return: The index (offset in units of pages) of a folio in its file. - */ -static inline pgoff_t folio_index(struct folio *folio) -{ - if (unlikely(folio_test_swapcache(folio))) - return __folio_swap_cache_index(folio); - return folio->index; -} - /** * folio_next_index - Get the index of the next folio. * @folio: The current folio. diff --git a/mm/gup.c b/mm/gup.c index f32168339390..91bbf57579f0 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -26,6 +26,7 @@ #include #include "internal.h" +#include "swap.h" struct follow_page_context { struct dev_pagemap *pgmap; diff --git a/mm/memfd.c b/mm/memfd.c index c64df1343059..ab367e61553d 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -20,6 +20,7 @@ #include #include #include +#include "swap.h" /* * We need a tag: a new tag would expand every xa_node by 8 bytes, diff --git a/mm/migrate.c b/mm/migrate.c index 273d46771a6c..784ac2256d08 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -50,6 +50,7 @@ #include #include "internal.h" +#include "swap.h" bool isolate_movable_page(struct page *page, isolate_mode_t mode) { diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 20e1d76f1eba..9ff44b64d3d6 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -41,6 +41,7 @@ #include #include "internal.h" +#include "swap.h" /* * Sleep at most 200ms at a time in balance_dirty_pages(). diff --git a/mm/swap.h b/mm/swap.h index 6f4a3f927edb..521bf510ec75 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -201,4 +201,22 @@ static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr, #endif /* CONFIG_SWAP */ +/** + * folio_index - File index of a folio. + * @folio: The folio. + * + * For a folio which is either in the page cache or the swap cache, + * return its index within the address_space it belongs to. If you know + * the folio is definitely in the page cache, you can look at the folio's + * index directly. + * + * Return: The index (offset in units of pages) of a folio in its file. + */ +static inline pgoff_t folio_index(struct folio *folio) +{ + if (unlikely(folio_test_swapcache(folio))) + return swap_cache_index(folio->swap); + return folio->index; +} + #endif /* _MM_SWAP_H */ diff --git a/mm/swapfile.c b/mm/swapfile.c index b86637cfb17a..9fe58284079d 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -3671,12 +3671,6 @@ struct address_space *swapcache_mapping(struct folio *folio) } EXPORT_SYMBOL_GPL(swapcache_mapping); -pgoff_t __folio_swap_cache_index(struct folio *folio) -{ - return swap_cache_index(folio->swap); -} -EXPORT_SYMBOL_GPL(__folio_swap_cache_index); - /* * add_swap_count_continuation - called when a swap count is duplicated * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entry's From dd309bfc68efc4522b4d33a95e92afff249e103b Mon Sep 17 00:00:00 2001 From: Kairui Song Date: Thu, 1 May 2025 02:10:52 +0800 Subject: [PATCH 224/285] mm, swap: remove no longer used swap mapping helper This helper existed to fix the circular header dependency issue but it is no longer used since commit 0d40cfe63a2f ("fs: remove folio_file_mapping()"), remove it. Link: https://lkml.kernel.org/r/20250430181052.55698-7-ryncsn@gmail.com Signed-off-by: Kairui Song Reviewed-by: Matthew Wilcox (Oracle) Acked-by: David Hildenbrand Cc: Chao Yu Cc: Chris Li Cc: Chris Mason Cc: Christian Brauner Cc: David Sterba Cc: "Huang, Ying" Cc: Hugh Dickins Cc: Jaegeuk Kim Cc: Joanne Koong Cc: Johannes Weiner Cc: Josef Bacik Cc: Miklos Szeredi Cc: Nhat Pham Cc: Qu Wenruo Cc: Yosry Ahmed Signed-off-by: Andrew Morton --- include/linux/pagemap.h | 1 - mm/swapfile.c | 9 --------- 2 files changed, 10 deletions(-) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 5d66786867eb..d2ced9920992 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -533,7 +533,6 @@ static inline void filemap_nr_thps_dec(struct address_space *mapping) } struct address_space *folio_mapping(struct folio *); -struct address_space *swapcache_mapping(struct folio *); /** * folio_flush_mapping - Find the file mapping this folio belongs to. diff --git a/mm/swapfile.c b/mm/swapfile.c index 9fe58284079d..026090bf3efe 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -3662,15 +3662,6 @@ struct swap_info_struct *swp_swap_info(swp_entry_t entry) return swap_type_to_swap_info(swp_type(entry)); } -/* - * out-of-line methods to avoid include hell. - */ -struct address_space *swapcache_mapping(struct folio *folio) -{ - return swp_swap_info(folio->swap)->swap_file->f_mapping; -} -EXPORT_SYMBOL_GPL(swapcache_mapping); - /* * add_swap_count_continuation - called when a swap count is duplicated * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entry's From 30f62b92e632d13420381f54210ad9d84d79b209 Mon Sep 17 00:00:00 2001 From: "Vishal Moola (Oracle)" Date: Tue, 29 Apr 2025 18:00:58 -0700 Subject: [PATCH 225/285] mm/gup: remove unnecessary check in memfd_pin_folios() Patch series "mm/gup: Cleanup memfd_pin_folios()". A couple straightforward cleanups to memfd_pin_folios() found through code inspection. Saves 124 bytes of kernel text overall and makes the code more readable. This patch (of 2): Commit 89c1905d9c14 ("mm/gup: introduce memfd_pin_folios() for pinning memfd folios") checks if filemap_get_folios_contig() returned duplicate folios to prevent multiple attempts at pinning the same folio. Commit 8ab1b1602396 ("mm: fix filemap_get_folios_contig returning batches of identical folios") ensures that filemap_get_folios_contig() returns a batch of distinct folios. We can remove the duplicate folio check to simplify the code and save 58 bytes of text. Link: https://lkml.kernel.org/r/20250430010059.892632-1-vishal.moola@gmail.com Link: https://lkml.kernel.org/r/20250430010059.892632-2-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) Acked-by: David Hildenbrand Cc: Vivek Kasireddy Signed-off-by: Andrew Morton --- mm/gup.c | 15 +-------------- 1 file changed, 1 insertion(+), 14 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index 91bbf57579f0..e6e2a9386850 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -3590,7 +3590,7 @@ long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end, { unsigned int flags, nr_folios, nr_found; unsigned int i, pgshift = PAGE_SHIFT; - pgoff_t start_idx, end_idx, next_idx; + pgoff_t start_idx, end_idx; struct folio *folio = NULL; struct folio_batch fbatch; struct hstate *h; @@ -3640,19 +3640,7 @@ long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end, folio = NULL; } - next_idx = 0; for (i = 0; i < nr_found; i++) { - /* - * As there can be multiple entries for a - * given folio in the batch returned by - * filemap_get_folios_contig(), the below - * check is to ensure that we pin and return a - * unique set of folios between start and end. - */ - if (next_idx && - next_idx != folio_index(fbatch.folios[i])) - continue; - folio = page_folio(&fbatch.folios[i]->page); if (try_grab_folio(folio, 1, FOLL_PIN)) { @@ -3665,7 +3653,6 @@ long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end, *offset = offset_in_folio(folio, start); folios[nr_folios] = folio; - next_idx = folio_next_index(folio); if (++nr_folios == max_folios) break; } From fe488d34edc4042f70d62dfd5d640c9fb11fc84e Mon Sep 17 00:00:00 2001 From: "Vishal Moola (Oracle)" Date: Tue, 29 Apr 2025 18:00:59 -0700 Subject: [PATCH 226/285] mm/gup: remove page_folio() in memfd_pin_folios() We can get the folio directly from the folio batch, so remove the unnecessary page_folio() call. Link: https://lkml.kernel.org/r/20250430010059.892632-3-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) Acked-by: David Hildenbrand Acked-by: Vivek Kasireddy Signed-off-by: Andrew Morton --- mm/gup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/gup.c b/mm/gup.c index e6e2a9386850..d3aac58862c0 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -3641,7 +3641,7 @@ long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end, } for (i = 0; i < nr_found; i++) { - folio = page_folio(&fbatch.folios[i]->page); + folio = fbatch.folios[i]; if (try_grab_folio(folio, 1, FOLL_PIN)) { folio_batch_release(&fbatch); From fa6b8b5d9f97778bc44c8a9fe33a0e4b8fae5f92 Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Thu, 1 May 2025 21:04:42 -0400 Subject: [PATCH 227/285] selftests: memcg: allow low event with no memory.low and memory_recursiveprot on MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Patch series "memcg: Fix test_memcg_min/low test failures", v8. The test_memcontrol selftest consistently fails its test_memcg_low sub-test (with memory_recursiveprot enabled) and sporadically fails its test_memcg_min sub-test. This patchset fixes the test_memcg_min and test_memcg_low failures by adjusting the test_memcontrol selftest to fix these test failures. This patch (of 8): The test_memcontrol selftest consistently fails its test_memcg_low sub-test due to the fact that its 3rd test child cgroup which have a memmory.low of 0 have low event count. This happens when memory_recursiveprot mount option is enabled which is the default setting used by systemd to mount cgroup2 filesystem. This issue was originally fixed by commit cdc69458a5f3 ("cgroup: account for memory_recursiveprot in test_memcg_low()"). It was later reverted by commit 1d09069f5313 ("selftests: memcg: expect no low events in unprotected sibling") expecting the memory reclaim code would be fixed. However, it turns out the unprotected cgroup may still have some residual effective memory.low protection depending on the memory.low settings in its parent and its siblings. As a result, low events may still be triggered. One way to fix the test failure is to revert the revert commit. However, Michal suggested that it might be better to ignore the low event count with memory_recursiveprot enabled as low event may or may not happen depending on the actual test configuration. Modify the test_memcontrol.c to ignore low event in the 3rd child cgroup with memory_recursiveprot on. The 4th child cgroup has no memory usage and so has an effective low of 0. It has no low event count because the mem_cgroup_below_low() check in shrink_node_memcgs() is skipped as mem_cgroup_below_min() returns true. If we ever change mem_cgroup_below_min() in such a way that it no longer skips the no usage case, we will have to add code to explicitly skip it. With this patch applied, the test_memcg_low sub-test finishes successfully without failure in most cases. Though both test_memcg_low and test_memcg_min sub-tests may still fail occasionally if the memory.current values fall outside of the expected ranges. Link: https://lkml.kernel.org/r/20250502010443.106022-1-longman@redhat.com Link: https://lkml.kernel.org/r/20250502010443.106022-2-longman@redhat.com Signed-off-by: Waiman Long Suggested-by: Michal Koutný Acked-by: Michal Koutný Acked-by: Tejun Heo Cc: Johannes Weiner Cc: Michal Hocko Cc: Muchun Song Cc: Roman Gushchin Cc: Shakeel Butt Cc: Shuah Khan Cc: Waiman Long Signed-off-by: Andrew Morton --- .../testing/selftests/cgroup/test_memcontrol.c | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c index 16f5d74ae762..58602c1831f1 100644 --- a/tools/testing/selftests/cgroup/test_memcontrol.c +++ b/tools/testing/selftests/cgroup/test_memcontrol.c @@ -380,10 +380,11 @@ static bool reclaim_until(const char *memcg, long goal); * * Then it checks actual memory usages and expects that: * A/B memory.current ~= 50M - * A/B/C memory.current ~= 29M - * A/B/D memory.current ~= 21M - * A/B/E memory.current ~= 0 - * A/B/F memory.current = 0 + * A/B/C memory.current ~= 29M [memory.events:low > 0] + * A/B/D memory.current ~= 21M [memory.events:low > 0] + * A/B/E memory.current ~= 0 [memory.events:low == 0 if !memory_recursiveprot, + * undefined otherwise] + * A/B/F memory.current = 0 [memory.events:low == 0] * (for origin of the numbers, see model in memcg_protection.m.) * * After that it tries to allocate more than there is @@ -525,7 +526,14 @@ static int test_memcg_protection(const char *root, bool min) goto cleanup; } + /* + * Child 2 has memory.low=0, but some low protection may still be + * distributed down from its parent with memory.low=50M if cgroup2 + * memory_recursiveprot mount option is enabled. Ignore the low + * event count in this case. + */ for (i = 0; i < ARRAY_SIZE(children); i++) { + int ignore_low_events_index = has_recursiveprot ? 2 : -1; int no_low_events_index = 1; long low, oom; @@ -534,6 +542,8 @@ static int test_memcg_protection(const char *root, bool min) if (oom) goto cleanup; + if (i == ignore_low_events_index) + continue; if (i <= no_low_events_index && low <= 0) goto cleanup; if (i > no_low_events_index && low) From d2def68ae06ab6a1f38bc1a2449b06ee4f108412 Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Thu, 1 May 2025 21:04:43 -0400 Subject: [PATCH 228/285] selftests: memcg: increase error tolerance of child memory.current check in test_memcg_protection() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The test_memcg_protection() function is used for the test_memcg_min and test_memcg_low sub-tests. This function generates a set of parent/child cgroups like: parent: memory.min/low = 50M child 0: memory.min/low = 75M, memory.current = 50M child 1: memory.min/low = 25M, memory.current = 50M child 2: memory.min/low = 0, memory.current = 50M After applying memory pressure, the function expects the following actual memory usages. parent: memory.current ~= 50M child 0: memory.current ~= 29M child 1: memory.current ~= 21M child 2: memory.current ~= 0 In reality, the actual memory usages can differ quite a bit from the expected values. It uses an error tolerance of 10% with the values_close() helper. Both the test_memcg_min and test_memcg_low sub-tests can fail sporadically because the actual memory usage exceeds the 10% error tolerance. Below are a sample of the usage data of the tests runs that fail. Child Actual usage Expected usage %err ----- ------------ -------------- ---- 1 16990208 22020096 -12.9% 1 17252352 22020096 -12.1% 0 37699584 30408704 +10.7% 1 14368768 22020096 -21.0% 1 16871424 22020096 -13.2% The current 10% error tolerenace might be right at the time test_memcontrol.c was first introduced in v4.18 kernel, but memory reclaim have certainly evolved quite a bit since then which may result in a bit more run-to-run variation than previously expected. Increase the error tolerance to 15% for child 0 and 20% for child 1 to minimize the chance of this type of failure. The tolerance is bigger for child 1 because an upswing in child 0 corresponds to a smaller %err than a similar downswing in child 1 due to the way %err is used in values_close(). Before this patch, a 100 test runs of test_memcontrol produced the following results: 17 not ok 1 test_memcg_min 22 not ok 2 test_memcg_low After applying this patch, there were no test failure for test_memcg_min and test_memcg_low in 100 test runs. However, these tests may still fail once in a while if the memory usage goes beyond the newly extended range. Link: https://lkml.kernel.org/r/20250502010443.106022-3-longman@redhat.com Signed-off-by: Waiman Long Acked-by: Tejun Heo Cc: Johannes Weiner Cc: Michal Hocko Cc: Michal Koutný Cc: Muchun Song Cc: Roman Gushchin Cc: Shakeel Butt Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/cgroup/test_memcontrol.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c index 58602c1831f1..d6534d7301a2 100644 --- a/tools/testing/selftests/cgroup/test_memcontrol.c +++ b/tools/testing/selftests/cgroup/test_memcontrol.c @@ -496,10 +496,10 @@ static int test_memcg_protection(const char *root, bool min) for (i = 0; i < ARRAY_SIZE(children); i++) c[i] = cg_read_long(children[i], "memory.current"); - if (!values_close(c[0], MB(29), 10)) + if (!values_close(c[0], MB(29), 15)) goto cleanup; - if (!values_close(c[1], MB(21), 10)) + if (!values_close(c[1], MB(21), 20)) goto cleanup; if (c[3] != 0) From c84bf6dd2b836b49bb2662668ff1692350d28236 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Fri, 9 May 2025 13:13:34 +0100 Subject: [PATCH 229/285] mm: introduce new .mmap_prepare() file callback Patch series "eliminate mmap() retry merge, add .mmap_prepare hook", v2. During the mmap() of a file-backed mapping, we invoke the underlying driver file's mmap() callback in order to perform driver/file system initialisation of the underlying VMA. This has been a source of issues in the past, including a significant security concern relating to unwinding of error state discovered by Jann Horn, as fixed in commit 5de195060b2e ("mm: resolve faulty mmap_region() error path behaviour") which performed the recent, significant, rework of mmap() as a whole. However, we have had a fly in the ointment remain - drivers have a great deal of freedom in the .mmap() hook to manipulate VMA state (as well as page table state). This can be problematic, as we can no longer reason sensibly about VMA state once the call is complete (the ability to do - anything - here does rather interfere with that). In addition, callers may choose to do odd or unusual things which might interfere with subsequent steps in the mmap() process, and it may do so and then raise an error, requiring very careful unwinding of state about which we can make no assumptions. Rather than providing such an open-ended interface, this series provides an alternative, far more restrictive one - we expose a whitelist of fields which can be adjusted by the driver, along with immutable state upon which the driver can make such decisions: struct vm_area_desc { /* Immutable state. */ struct mm_struct *mm; unsigned long start; unsigned long end; /* Mutable fields. Populated with initial state. */ pgoff_t pgoff; struct file *file; vm_flags_t vm_flags; pgprot_t page_prot; /* Write-only fields. */ const struct vm_operations_struct *vm_ops; void *private_data; }; The mmap logic then updates the state used to either merge with a VMA or establish a new VMA based upon this logic. This is achieved via new file hook .mmap_prepare(), which is, importantly, invoked very early on in the mmap() process. If an error arises, we can very simply abort the operation with very little unwinding of state required. The existing logic contains another, related, peccadillo - since the .mmap() callback might do anything, it may also cause a previously unmergeable VMA to become mergeable with adjacent VMAs. Right now the logic will retry a merge like this only if the driver changes VMA flags, and changes them in such a way that a merge might succeed (that is, the flags are not 'special', that is do not contain any of the flags specified in VM_SPECIAL). This has also been the source of a great deal of pain - it's hard to reason about an .mmap() callback that might do - anything - but it's also hard to reason about setting up a VMA and writing to the maple tree, only to do it again utilising a great deal of shared state. Since .mmap_prepare() sets fields before the first merge is even attempted, the use of this callback obviates the need for this retry merge logic. A driver may only specify .mmap_prepare() or the deprecated .mmap() callback. In future we may add futher callbacks beyond .mmap_prepare() to faciliate all use cass as we convert drivers. In researching this change, I examined every .mmap() callback, and discovered only a very few that set VMA state in such a way that a. the VMA flags changed and b. this would be mergeable. In the majority of cases, it turns out that drivers are mapping kernel memory and thus ultimately set VM_PFNMAP, VM_MIXEDMAP, or other unmergeable VM_SPECIAL flags. Of those that remain I identified a number of cases which are only applicable in DAX, setting the VM_HUGEPAGE flag: * dax_mmap() * erofs_file_mmap() * ext4_file_mmap() * xfs_file_mmap() For this remerge to not occur and to impact users, each of these cases would require a user to mmap() files using DAX, in parts, immediately adjacent to one another. This is a very unlikely usecase and so it does not appear to be worthwhile to adjust this functionality accordingly. We can, however, very quickly do so if needed by simply adding an .mmap_prepare() callback to these as required. There are two further non-DAX cases I idenitfied: * orangefs_file_mmap() - Clears VM_RAND_READ if set, replacing with VM_SEQ_READ. * usb_stream_hwdep_mmap() - Sets VM_DONTDUMP. Both of these cases again seem very unlikely to be mmap()'d immediately adjacent to one another in a fashion that would result in a merge. Finally, we are left with a viable case: * secretmem_mmap() - Set VM_LOCKED, VM_DONTDUMP. This is viable enough that the mm selftests trigger the logic as a matter of course. Therefore, this series replace the .secretmem_mmap() hook with .secret_mmap_prepare(). This patch (of 3): Provide a means by which drivers can specify which fields of those permitted to be changed should be altered to prior to mmap()'ing a range (which may either result from a merge or from mapping an entirely new VMA). Doing so is substantially safer than the existing .mmap() calback which provides unrestricted access to the part-constructed VMA and permits drivers and file systems to do 'creative' things which makes it hard to reason about the state of the VMA after the function returns. The existing .mmap() callback's freedom has caused a great deal of issues, especially in error handling, as unwinding the mmap() state has proven to be non-trivial and caused significant issues in the past, for instance those addressed in commit 5de195060b2e ("mm: resolve faulty mmap_region() error path behaviour"). It also necessitates a second attempt at merge once the .mmap() callback has completed, which has caused issues in the past, is awkward, adds overhead and is difficult to reason about. The .mmap_prepare() callback eliminates this requirement, as we can update fields prior to even attempting the first merge. It is safer, as we heavily restrict what can actually be modified, and being invoked very early in the mmap() process, error handling can be performed safely with very little unwinding of state required. The .mmap_prepare() and deprecated .mmap() callbacks are mutually exclusive, so we permit only one to be invoked at a time. Update vma userland test stubs to account for changes. Link: https://lkml.kernel.org/r/cover.1746792520.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/adb36a7c4affd7393b2fc4b54cc5cfe211e41f71.1746792520.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Vlastimil Babka Cc: Al Viro Cc: Christian Brauner Cc: David Hildenbrand Cc: Jan Kara Cc: Jann Horn Cc: Liam Howlett Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Mike Rapoport Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- include/linux/fs.h | 25 ++++++++++++ include/linux/mm_types.h | 24 +++++++++++ mm/memory.c | 3 +- mm/mmap.c | 2 +- mm/vma.c | 68 +++++++++++++++++++++++++++++++- tools/testing/vma/vma_internal.h | 66 ++++++++++++++++++++++++++++--- 6 files changed, 180 insertions(+), 8 deletions(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index 016b0fe1536e..e2721a1ff13d 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2169,6 +2169,7 @@ struct file_operations { int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags); int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *, unsigned int poll_flags); + int (*mmap_prepare)(struct vm_area_desc *); } __randomize_layout; /* Supports async buffered reads */ @@ -2238,11 +2239,35 @@ struct inode_operations { struct offset_ctx *(*get_offset_ctx)(struct inode *inode); } ____cacheline_aligned; +/* Did the driver provide valid mmap hook configuration? */ +static inline bool file_has_valid_mmap_hooks(struct file *file) +{ + bool has_mmap = file->f_op->mmap; + bool has_mmap_prepare = file->f_op->mmap_prepare; + + /* Hooks are mutually exclusive. */ + if (WARN_ON_ONCE(has_mmap && has_mmap_prepare)) + return false; + if (WARN_ON_ONCE(!has_mmap && !has_mmap_prepare)) + return false; + + return true; +} + static inline int call_mmap(struct file *file, struct vm_area_struct *vma) { + if (WARN_ON_ONCE(file->f_op->mmap_prepare)) + return -EINVAL; + return file->f_op->mmap(file, vma); } +static inline int __call_mmap_prepare(struct file *file, + struct vm_area_desc *desc) +{ + return file->f_op->mmap_prepare(desc); +} + extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *); extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *); extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *, diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index e76bade9ebb1..15808cad2bc1 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -763,6 +763,30 @@ struct vma_numab_state { int prev_scan_seq; }; +/* + * Describes a VMA that is about to be mmap()'ed. Drivers may choose to + * manipulate mutable fields which will cause those fields to be updated in the + * resultant VMA. + * + * Helper functions are not required for manipulating any field. + */ +struct vm_area_desc { + /* Immutable state. */ + struct mm_struct *mm; + unsigned long start; + unsigned long end; + + /* Mutable fields. Populated with initial state. */ + pgoff_t pgoff; + struct file *file; + vm_flags_t vm_flags; + pgprot_t page_prot; + + /* Write-only fields. */ + const struct vm_operations_struct *vm_ops; + void *private_data; +}; + /* * This struct describes a virtual memory area. There is one of these * per VM-area/task. A VM area is any part of the process virtual memory diff --git a/mm/memory.c b/mm/memory.c index 68c1d962d0ad..99af83434e7c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -527,10 +527,11 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr, dump_page(page, "bad pte"); pr_alert("addr:%px vm_flags:%08lx anon_vma:%px mapping:%px index:%lx\n", (void *)addr, vma->vm_flags, vma->anon_vma, mapping, index); - pr_alert("file:%pD fault:%ps mmap:%ps read_folio:%ps\n", + pr_alert("file:%pD fault:%ps mmap:%ps mmap_prepare: %ps read_folio:%ps\n", vma->vm_file, vma->vm_ops ? vma->vm_ops->fault : NULL, vma->vm_file ? vma->vm_file->f_op->mmap : NULL, + vma->vm_file ? vma->vm_file->f_op->mmap_prepare : NULL, mapping ? mapping->a_ops->read_folio : NULL); dump_stack(); add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE); diff --git a/mm/mmap.c b/mm/mmap.c index 81dd962a1cfc..50f902c08341 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -475,7 +475,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr, vm_flags &= ~VM_MAYEXEC; } - if (!file->f_op->mmap) + if (!file_has_valid_mmap_hooks(file)) return -ENODEV; if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP)) return -EINVAL; diff --git a/mm/vma.c b/mm/vma.c index 1f2634b29568..3f32e04bb6cc 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -17,6 +17,11 @@ struct mmap_state { unsigned long pglen; unsigned long flags; struct file *file; + pgprot_t page_prot; + + /* User-defined fields, perhaps updated by .mmap_prepare(). */ + const struct vm_operations_struct *vm_ops; + void *vm_private_data; unsigned long charged; bool retry_merge; @@ -40,6 +45,7 @@ struct mmap_state { .pglen = PHYS_PFN(len_), \ .flags = flags_, \ .file = file_, \ + .page_prot = vm_get_page_prot(flags_), \ } #define VMG_MMAP_STATE(name, map_, vma_) \ @@ -2385,6 +2391,10 @@ static int __mmap_new_file_vma(struct mmap_state *map, int error; vma->vm_file = get_file(map->file); + + if (!map->file->f_op->mmap) + return 0; + error = mmap_file(vma->vm_file, vma); if (error) { fput(vma->vm_file); @@ -2441,7 +2451,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap) vma_iter_config(vmi, map->addr, map->end); vma_set_range(vma, map->addr, map->end, map->pgoff); vm_flags_init(vma, map->flags); - vma->vm_page_prot = vm_get_page_prot(map->flags); + vma->vm_page_prot = map->page_prot; if (vma_iter_prealloc(vmi, vma)) { error = -ENOMEM; @@ -2528,6 +2538,56 @@ static void __mmap_complete(struct mmap_state *map, struct vm_area_struct *vma) vma_set_page_prot(vma); } +/* + * Invoke the f_op->mmap_prepare() callback for a file-backed mapping that + * specifies it. + * + * This is called prior to any merge attempt, and updates whitelisted fields + * that are permitted to be updated by the caller. + * + * All but user-defined fields will be pre-populated with original values. + * + * Returns 0 on success, or an error code otherwise. + */ +static int call_mmap_prepare(struct mmap_state *map) +{ + int err; + struct vm_area_desc desc = { + .mm = map->mm, + .start = map->addr, + .end = map->end, + + .pgoff = map->pgoff, + .file = map->file, + .vm_flags = map->flags, + .page_prot = map->page_prot, + }; + + /* Invoke the hook. */ + err = __call_mmap_prepare(map->file, &desc); + if (err) + return err; + + /* Update fields permitted to be changed. */ + map->pgoff = desc.pgoff; + map->file = desc.file; + map->flags = desc.vm_flags; + map->page_prot = desc.page_prot; + /* User-defined fields. */ + map->vm_ops = desc.vm_ops; + map->vm_private_data = desc.private_data; + + return 0; +} + +static void set_vma_user_defined_fields(struct vm_area_struct *vma, + struct mmap_state *map) +{ + if (map->vm_ops) + vma->vm_ops = map->vm_ops; + vma->vm_private_data = map->vm_private_data; +} + static unsigned long __mmap_region(struct file *file, unsigned long addr, unsigned long len, vm_flags_t vm_flags, unsigned long pgoff, struct list_head *uf) @@ -2535,10 +2595,13 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr, struct mm_struct *mm = current->mm; struct vm_area_struct *vma = NULL; int error; + bool have_mmap_prepare = file && file->f_op->mmap_prepare; VMA_ITERATOR(vmi, mm, addr); MMAP_STATE(map, mm, &vmi, addr, len, pgoff, vm_flags, file); error = __mmap_prepare(&map, uf); + if (!error && have_mmap_prepare) + error = call_mmap_prepare(&map); if (error) goto abort_munmap; @@ -2556,6 +2619,9 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr, goto unacct_error; } + if (have_mmap_prepare) + set_vma_user_defined_fields(vma, &map); + /* If flags changed, we might be able to merge, so try again. */ if (map.retry_merge) { struct vm_area_struct *merged; diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index 198abe66de5a..f6e45e62da3a 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -253,8 +253,40 @@ struct mm_struct { unsigned long flags; /* Must use atomic bitops to access */ }; +struct vm_area_struct; + +/* + * Describes a VMA that is about to be mmap()'ed. Drivers may choose to + * manipulate mutable fields which will cause those fields to be updated in the + * resultant VMA. + * + * Helper functions are not required for manipulating any field. + */ +struct vm_area_desc { + /* Immutable state. */ + struct mm_struct *mm; + unsigned long start; + unsigned long end; + + /* Mutable fields. Populated with initial state. */ + pgoff_t pgoff; + struct file *file; + vm_flags_t vm_flags; + pgprot_t page_prot; + + /* Write-only fields. */ + const struct vm_operations_struct *vm_ops; + void *private_data; +}; + +struct file_operations { + int (*mmap)(struct file *, struct vm_area_struct *); + int (*mmap_prepare)(struct vm_area_desc *); +}; + struct file { struct address_space *f_mapping; + const struct file_operations *f_op; }; #define VMA_LOCK_OFFSET 0x40000000 @@ -1125,11 +1157,6 @@ static inline void vm_flags_clear(struct vm_area_struct *vma, vma->__vm_flags &= ~flags; } -static inline int call_mmap(struct file *, struct vm_area_struct *) -{ - return 0; -} - static inline int shmem_zero_setup(struct vm_area_struct *) { return 0; @@ -1405,4 +1432,33 @@ static inline void free_anon_vma_name(struct vm_area_struct *vma) (void)vma; } +/* Did the driver provide valid mmap hook configuration? */ +static inline bool file_has_valid_mmap_hooks(struct file *file) +{ + bool has_mmap = file->f_op->mmap; + bool has_mmap_prepare = file->f_op->mmap_prepare; + + /* Hooks are mutually exclusive. */ + if (WARN_ON_ONCE(has_mmap && has_mmap_prepare)) + return false; + if (WARN_ON_ONCE(!has_mmap && !has_mmap_prepare)) + return false; + + return true; +} + +static inline int call_mmap(struct file *file, struct vm_area_struct *vma) +{ + if (WARN_ON_ONCE(file->f_op->mmap_prepare)) + return -EINVAL; + + return file->f_op->mmap(file, vma); +} + +static inline int __call_mmap_prepare(struct file *file, + struct vm_area_desc *desc) +{ + return file->f_op->mmap_prepare(desc); +} + #endif /* __MM_VMA_INTERNAL_H */ From 439b3fb0b01005e08e91578278475269215e9a7f Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Fri, 9 May 2025 13:13:35 +0100 Subject: [PATCH 230/285] mm: secretmem: convert to .mmap_prepare() hook Secretmem has a simple .mmap() hook which is easily converted to the new .mmap_prepare() callback. Importantly, it's a rare instance of an driver that manipulates a VMA which is mergeable (that is, not a VM_SPECIAL mapping) while also adjusting VMA flags which may adjust mergeability, meaning the retry merge logic might impact whether or not the VMA is merged. By using .mmap_prepare() there's no longer any need to retry the merge later as we can simply set the correct flags from the start. This change therefore allows us to remove the retry merge logic in a subsequent commit. Link: https://lkml.kernel.org/r/0f758474fa6a30197bdf25ba62f898a69d84eef3.1746792520.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Acked-by: Mike Rapoport (Microsoft) Reviewed-by: David Hildenbrand Reviewed-by: Vlastimil Babka Cc: Al Viro Cc: Christian Brauner Cc: Jan Kara Cc: Jann Horn Cc: Liam Howlett Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- mm/secretmem.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/mm/secretmem.c b/mm/secretmem.c index 1b0a214ee558..589b26c2d553 100644 --- a/mm/secretmem.c +++ b/mm/secretmem.c @@ -120,18 +120,18 @@ static int secretmem_release(struct inode *inode, struct file *file) return 0; } -static int secretmem_mmap(struct file *file, struct vm_area_struct *vma) +static int secretmem_mmap_prepare(struct vm_area_desc *desc) { - unsigned long len = vma->vm_end - vma->vm_start; + const unsigned long len = desc->end - desc->start; - if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0) + if ((desc->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0) return -EINVAL; - if (!mlock_future_ok(vma->vm_mm, vma->vm_flags | VM_LOCKED, len)) + if (!mlock_future_ok(desc->mm, desc->vm_flags | VM_LOCKED, len)) return -EAGAIN; - vm_flags_set(vma, VM_LOCKED | VM_DONTDUMP); - vma->vm_ops = &secretmem_vm_ops; + desc->vm_flags |= VM_LOCKED | VM_DONTDUMP; + desc->vm_ops = &secretmem_vm_ops; return 0; } @@ -143,7 +143,7 @@ bool vma_is_secretmem(struct vm_area_struct *vma) static const struct file_operations secretmem_fops = { .release = secretmem_release, - .mmap = secretmem_mmap, + .mmap_prepare = secretmem_mmap_prepare, }; static int secretmem_migrate_folio(struct address_space *mapping, From 3c06ee7c24c27466c73209e00ee4c99e0115da1e Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Fri, 9 May 2025 13:13:36 +0100 Subject: [PATCH 231/285] mm/vma: remove mmap() retry merge We have now introduced a mechanism that obviates the need for a reattempted merge via the mmap_prepare() file hook, so eliminate this functionality altogether. The retry merge logic has been the cause of a great deal of complexity in the past and required a great deal of careful manoeuvring of code to ensure its continued and correct functionality. It has also recently been involved in an issue surrounding maple tree state, which again points to its problematic nature. We make it much easier to reason about mmap() logic by eliminating this and simply writing a VMA once. This also opens the doors to future optimisation and improvement in the mmap() logic. For any device or file system which encounters unwanted VMA fragmentation as a result of this change (that is, having not implemented .mmap_prepare hooks), the issue is easily resolvable by doing so. Link: https://lkml.kernel.org/r/d5d8fc74f02b89d6bec5ae8bc0e36d7853b65cda.1746792520.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: David Hildenbrand Reviewed-by: Vlastimil Babka Cc: Al Viro Cc: Christian Brauner Cc: Jan Kara Cc: Jann Horn Cc: Liam Howlett Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Mike Rapoport (Microsoft) Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- mm/vma.c | 14 -------------- 1 file changed, 14 deletions(-) diff --git a/mm/vma.c b/mm/vma.c index 3f32e04bb6cc..3ff6cfbe3338 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -24,7 +24,6 @@ struct mmap_state { void *vm_private_data; unsigned long charged; - bool retry_merge; struct vm_area_struct *prev; struct vm_area_struct *next; @@ -2417,8 +2416,6 @@ static int __mmap_new_file_vma(struct mmap_state *map, !(map->flags & VM_MAYWRITE) && (vma->vm_flags & VM_MAYWRITE)); - /* If the flags change (and are mergeable), let's retry later. */ - map->retry_merge = vma->vm_flags != map->flags && !(vma->vm_flags & VM_SPECIAL); map->flags = vma->vm_flags; return 0; @@ -2622,17 +2619,6 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr, if (have_mmap_prepare) set_vma_user_defined_fields(vma, &map); - /* If flags changed, we might be able to merge, so try again. */ - if (map.retry_merge) { - struct vm_area_struct *merged; - VMG_MMAP_STATE(vmg, &map, vma); - - vma_iter_config(map.vmi, map.addr, map.end); - merged = vma_merge_existing_range(&vmg); - if (merged) - vma = merged; - } - __mmap_complete(&map, vma); return addr; From 0cad6736f4b9bbe8129f367ed5818fa4aef6c2b5 Mon Sep 17 00:00:00 2001 From: Feng Lee <379943137@qq.com> Date: Fri, 9 May 2025 14:32:30 +0800 Subject: [PATCH 232/285] mm: remove obsolete pgd_offset_gate() Remove pgd_offset_gate() completely and simply make the single caller use pgd_offset(). It appears that the gate area resides in the kernel-mapped segment exclusively on IA64. Therefore, removing pgd_offset_k is safe since IA64 is now obsolete. Link: https://lkml.kernel.org/r/tencent_503130C3CD56569191396268CF4D12F09A06@qq.com Signed-off-by: Feng Lee <379943137@qq.com> Reviewed-by: Barry Song Acked-by: David Hildenbrand Cc: Anshuman Khandual Cc: bibo mao Cc: Ingo Molnar Cc: Jason Gunthorpe Cc: John Hubbard Cc: Lance Yang Cc: Peter Xu Signed-off-by: Andrew Morton --- include/linux/pgtable.h | 4 ---- mm/gup.c | 5 +---- 2 files changed, 1 insertion(+), 8 deletions(-) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index b50447ef1c92..f1e890b60460 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1164,10 +1164,6 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio) } #endif -#ifndef __HAVE_ARCH_PGD_OFFSET_GATE -#define pgd_offset_gate(mm, addr) pgd_offset(mm, addr) -#endif - #ifndef __HAVE_ARCH_MOVE_PTE #define move_pte(pte, old_addr, new_addr) (pte) #endif diff --git a/mm/gup.c b/mm/gup.c index d3aac58862c0..329c5f7acc7a 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1102,10 +1102,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address, /* user gate pages are read-only */ if (gup_flags & FOLL_WRITE) return -EFAULT; - if (address > TASK_SIZE) - pgd = pgd_offset_k(address); - else - pgd = pgd_offset_gate(mm, address); + pgd = pgd_offset(mm, address); if (pgd_none(*pgd)) return -EFAULT; p4d = p4d_offset(pgd, address); From 2fba5961c64c56d70fa8380da3b14f32b20a4580 Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Tue, 6 May 2025 15:55:30 -0700 Subject: [PATCH 233/285] memcg: simplify consume_stock Patch series "memcg: decouple memcg and objcg stocks", v3. The per-cpu memcg charge cache and objcg charge cache are coupled in a single struct memcg_stock_pcp and a single local lock is used to protect both of the caches. This makes memcg charging and objcg charging nmi safe challenging. Decoupling memcg and objcg stocks would allow us to make them nmi safe and even work without disabling irqs independently. This series completely decouples memcg and objcg stocks. To evaluate the impact of this series with and without PREEMPT_RT config, we ran varying number of netperf clients in different cgroups on a 72 CPU machine. $ netserver -6 $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K PREEMPT_RT config: ------------------ number of clients | Without series | With series 6 | 38559.1 Mbps | 38652.6 Mbps 12 | 37388.8 Mbps | 37560.1 Mbps 18 | 30707.5 Mbps | 31378.3 Mbps 24 | 25908.4 Mbps | 26423.9 Mbps 30 | 22347.7 Mbps | 22326.5 Mbps 36 | 20235.1 Mbps | 20165.0 Mbps !PREEMPT_RT config: ------------------- number of clients | Without series | With series 6 | 50235.7 Mbps | 51415.4 Mbps 12 | 49336.5 Mbps | 49901.4 Mbps 18 | 46306.8 Mbps | 46482.7 Mbps 24 | 38145.7 Mbps | 38729.4 Mbps 30 | 30347.6 Mbps | 31698.2 Mbps 36 | 26976.6 Mbps | 27364.4 Mbps No performance regression was observed. This patch (of 4): consume_stock() does not need to check gfp_mask for spinning and can simply trylock the local lock to decide to proceed or fail. No need to spin at all for local lock. One of the concern raised was that on PREEMPT_RT kernels, this trylock can fail more often due to tasks having lock_lock can be preempted. This can potentially cause the task which have preempted the task having the local_lock to take the slow path of memcg charging. However this behavior will only impact the performance if memcg charging slowpath is worse than two context switches and possibly scheduling delay behavior of current code. From the network intensive workload experiment it does not seem like the case. We ran varying number of netperf clients in different cgroups on a 72 CPU machine for PREEMPT_RT config. $ netserver -6 $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K number of clients | Without series | With series 6 | 38559.1 Mbps | 38652.6 Mbps 12 | 37388.8 Mbps | 37560.1 Mbps 18 | 30707.5 Mbps | 31378.3 Mbps 24 | 25908.4 Mbps | 26423.9 Mbps 30 | 22347.7 Mbps | 22326.5 Mbps 36 | 20235.1 Mbps | 20165.0 Mbps We don't see any significant performance difference for the network intensive workload with this series. Link: https://lkml.kernel.org/r/20250506225533.2580386-1-shakeel.butt@linux.dev Link: https://lkml.kernel.org/r/20250506225533.2580386-2-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Reviewed-by: Vlastimil Babka Cc: Alexei Starovoitov Cc: Eric Dumaze Cc: Jakub Kacinski Cc: Johannes Weiner Cc: Michal Hocko Cc: Muchun Song Cc: Roman Gushchin Cc: Sebastian Andrzej Siewior Signed-off-by: Andrew Morton --- mm/memcontrol.c | 20 +++++++------------- 1 file changed, 7 insertions(+), 13 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 4108ff00124b..d5b1df2326f7 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1806,16 +1806,14 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, * consume_stock: Try to consume stocked charge on this cpu. * @memcg: memcg to consume from. * @nr_pages: how many pages to charge. - * @gfp_mask: allocation mask. * - * The charges will only happen if @memcg matches the current cpu's memcg - * stock, and at least @nr_pages are available in that stock. Failure to - * service an allocation will refill the stock. + * Consume the cached charge if enough nr_pages are present otherwise return + * failure. Also return failure for charge request larger than + * MEMCG_CHARGE_BATCH or if the local lock is already taken. * * returns true if successful, false otherwise. */ -static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages, - gfp_t gfp_mask) +static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages) { struct memcg_stock_pcp *stock; uint8_t stock_pages; @@ -1823,12 +1821,8 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages, bool ret = false; int i; - if (nr_pages > MEMCG_CHARGE_BATCH) - return ret; - - if (gfpflags_allow_spinning(gfp_mask)) - local_lock_irqsave(&memcg_stock.stock_lock, flags); - else if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags)) + if (nr_pages > MEMCG_CHARGE_BATCH || + !local_trylock_irqsave(&memcg_stock.stock_lock, flags)) return ret; stock = this_cpu_ptr(&memcg_stock); @@ -2331,7 +2325,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, unsigned long pflags; retry: - if (consume_stock(memcg, nr_pages, gfp_mask)) + if (consume_stock(memcg, nr_pages)) return 0; if (!gfpflags_allow_spinning(gfp_mask)) From 3523dd7af41363f97044ca80cf835ba6e853b9f7 Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Tue, 6 May 2025 15:55:31 -0700 Subject: [PATCH 234/285] memcg: separate local_trylock for memcg and obj The per-cpu stock_lock protects cached memcg and cached objcg and their respective fields. However there is no dependency between these fields and it is better to have fine grained separate locks for cached memcg and cached objcg. This decoupling of locks allows us to make the memcg charge cache and objcg charge cache to be nmi safe independently. At the moment, memcg charge cache is already nmi safe and this decoupling will allow to make memcg charge cache work without disabling irqs. Link: https://lkml.kernel.org/r/20250506225533.2580386-3-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Reviewed-by: Vlastimil Babka Cc: Alexei Starovoitov Cc: Eric Dumaze Cc: Jakub Kacinski Cc: Johannes Weiner Cc: Michal Hocko Cc: Muchun Song Cc: Roman Gushchin Cc: Sebastian Andrzej Siewior Signed-off-by: Andrew Morton --- mm/memcontrol.c | 49 ++++++++++++++++++++++++++----------------------- 1 file changed, 26 insertions(+), 23 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d5b1df2326f7..352aaaeaa536 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1779,13 +1779,14 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg) */ #define NR_MEMCG_STOCK 7 struct memcg_stock_pcp { - local_trylock_t stock_lock; + local_trylock_t memcg_lock; uint8_t nr_pages[NR_MEMCG_STOCK]; struct mem_cgroup *cached[NR_MEMCG_STOCK]; + local_trylock_t obj_lock; + unsigned int nr_bytes; struct obj_cgroup *cached_objcg; struct pglist_data *cached_pgdat; - unsigned int nr_bytes; int nr_slab_reclaimable_b; int nr_slab_unreclaimable_b; @@ -1794,7 +1795,8 @@ struct memcg_stock_pcp { #define FLUSHING_CACHED_CHARGE 0 }; static DEFINE_PER_CPU_ALIGNED(struct memcg_stock_pcp, memcg_stock) = { - .stock_lock = INIT_LOCAL_TRYLOCK(stock_lock), + .memcg_lock = INIT_LOCAL_TRYLOCK(memcg_lock), + .obj_lock = INIT_LOCAL_TRYLOCK(obj_lock), }; static DEFINE_MUTEX(percpu_charge_mutex); @@ -1822,7 +1824,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages) int i; if (nr_pages > MEMCG_CHARGE_BATCH || - !local_trylock_irqsave(&memcg_stock.stock_lock, flags)) + !local_trylock_irqsave(&memcg_stock.memcg_lock, flags)) return ret; stock = this_cpu_ptr(&memcg_stock); @@ -1839,7 +1841,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages) break; } - local_unlock_irqrestore(&memcg_stock.stock_lock, flags); + local_unlock_irqrestore(&memcg_stock.memcg_lock, flags); return ret; } @@ -1885,19 +1887,19 @@ static void drain_local_stock(struct work_struct *dummy) struct memcg_stock_pcp *stock; unsigned long flags; - /* - * The only protection from cpu hotplug (memcg_hotplug_cpu_dead) vs. - * drain_stock races is that we always operate on local CPU stock - * here with IRQ disabled - */ - local_lock_irqsave(&memcg_stock.stock_lock, flags); + if (WARN_ONCE(!in_task(), "drain in non-task context")) + return; + local_lock_irqsave(&memcg_stock.obj_lock, flags); stock = this_cpu_ptr(&memcg_stock); drain_obj_stock(stock); + local_unlock_irqrestore(&memcg_stock.obj_lock, flags); + + local_lock_irqsave(&memcg_stock.memcg_lock, flags); + stock = this_cpu_ptr(&memcg_stock); drain_stock_fully(stock); clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags); - - local_unlock_irqrestore(&memcg_stock.stock_lock, flags); + local_unlock_irqrestore(&memcg_stock.memcg_lock, flags); } static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) @@ -1920,10 +1922,10 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) VM_WARN_ON_ONCE(mem_cgroup_is_root(memcg)); if (nr_pages > MEMCG_CHARGE_BATCH || - !local_trylock_irqsave(&memcg_stock.stock_lock, flags)) { + !local_trylock_irqsave(&memcg_stock.memcg_lock, flags)) { /* * In case of larger than batch refill or unlikely failure to - * lock the percpu stock_lock, uncharge memcg directly. + * lock the percpu memcg_lock, uncharge memcg directly. */ memcg_uncharge(memcg, nr_pages); return; @@ -1955,7 +1957,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) WRITE_ONCE(stock->nr_pages[i], nr_pages); } - local_unlock_irqrestore(&memcg_stock.stock_lock, flags); + local_unlock_irqrestore(&memcg_stock.memcg_lock, flags); } static bool is_drain_needed(struct memcg_stock_pcp *stock, @@ -2030,11 +2032,12 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu) stock = &per_cpu(memcg_stock, cpu); - /* drain_obj_stock requires stock_lock */ - local_lock_irqsave(&memcg_stock.stock_lock, flags); + /* drain_obj_stock requires obj_lock */ + local_lock_irqsave(&memcg_stock.obj_lock, flags); drain_obj_stock(stock); - local_unlock_irqrestore(&memcg_stock.stock_lock, flags); + local_unlock_irqrestore(&memcg_stock.obj_lock, flags); + /* no need for the local lock */ drain_stock_fully(stock); return 0; @@ -2887,7 +2890,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, unsigned long flags; bool ret = false; - local_lock_irqsave(&memcg_stock.stock_lock, flags); + local_lock_irqsave(&memcg_stock.obj_lock, flags); stock = this_cpu_ptr(&memcg_stock); if (objcg == READ_ONCE(stock->cached_objcg) && stock->nr_bytes >= nr_bytes) { @@ -2898,7 +2901,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, __account_obj_stock(objcg, stock, nr_bytes, pgdat, idx); } - local_unlock_irqrestore(&memcg_stock.stock_lock, flags); + local_unlock_irqrestore(&memcg_stock.obj_lock, flags); return ret; } @@ -2987,7 +2990,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, unsigned long flags; unsigned int nr_pages = 0; - local_lock_irqsave(&memcg_stock.stock_lock, flags); + local_lock_irqsave(&memcg_stock.obj_lock, flags); stock = this_cpu_ptr(&memcg_stock); if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */ @@ -3009,7 +3012,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, stock->nr_bytes &= (PAGE_SIZE - 1); } - local_unlock_irqrestore(&memcg_stock.stock_lock, flags); + local_unlock_irqrestore(&memcg_stock.obj_lock, flags); if (nr_pages) obj_cgroup_uncharge_pages(objcg, nr_pages); From c80509ef65e4e48b1a5cf9dbb98a3ee8be1accb3 Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Tue, 6 May 2025 15:55:32 -0700 Subject: [PATCH 235/285] memcg: completely decouple memcg and obj stocks Let's completely decouple the memcg and obj per-cpu stocks. This will enable us to make memcg per-cpu stocks to used without disabling irqs. Also it will enable us to make obj stocks nmi safe independently which is required to make kmalloc/slab safe for allocations from nmi context. Link: https://lkml.kernel.org/r/20250506225533.2580386-4-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Acked-by: Vlastimil Babka Cc: Alexei Starovoitov Cc: Eric Dumaze Cc: Jakub Kacinski Cc: Johannes Weiner Cc: Michal Hocko Cc: Muchun Song Cc: Roman Gushchin Cc: Sebastian Andrzej Siewior Signed-off-by: Andrew Morton --- mm/memcontrol.c | 149 ++++++++++++++++++++++++++++++------------------ 1 file changed, 92 insertions(+), 57 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 352aaaeaa536..1f8611e9a8ff 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1778,12 +1778,22 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg) * nr_pages in a single cacheline. This may change in future. */ #define NR_MEMCG_STOCK 7 +#define FLUSHING_CACHED_CHARGE 0 struct memcg_stock_pcp { - local_trylock_t memcg_lock; + local_trylock_t lock; uint8_t nr_pages[NR_MEMCG_STOCK]; struct mem_cgroup *cached[NR_MEMCG_STOCK]; - local_trylock_t obj_lock; + struct work_struct work; + unsigned long flags; +}; + +static DEFINE_PER_CPU_ALIGNED(struct memcg_stock_pcp, memcg_stock) = { + .lock = INIT_LOCAL_TRYLOCK(lock), +}; + +struct obj_stock_pcp { + local_trylock_t lock; unsigned int nr_bytes; struct obj_cgroup *cached_objcg; struct pglist_data *cached_pgdat; @@ -1792,16 +1802,16 @@ struct memcg_stock_pcp { struct work_struct work; unsigned long flags; -#define FLUSHING_CACHED_CHARGE 0 }; -static DEFINE_PER_CPU_ALIGNED(struct memcg_stock_pcp, memcg_stock) = { - .memcg_lock = INIT_LOCAL_TRYLOCK(memcg_lock), - .obj_lock = INIT_LOCAL_TRYLOCK(obj_lock), + +static DEFINE_PER_CPU_ALIGNED(struct obj_stock_pcp, obj_stock) = { + .lock = INIT_LOCAL_TRYLOCK(lock), }; + static DEFINE_MUTEX(percpu_charge_mutex); -static void drain_obj_stock(struct memcg_stock_pcp *stock); -static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, +static void drain_obj_stock(struct obj_stock_pcp *stock); +static bool obj_stock_flush_required(struct obj_stock_pcp *stock, struct mem_cgroup *root_memcg); /** @@ -1824,7 +1834,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages) int i; if (nr_pages > MEMCG_CHARGE_BATCH || - !local_trylock_irqsave(&memcg_stock.memcg_lock, flags)) + !local_trylock_irqsave(&memcg_stock.lock, flags)) return ret; stock = this_cpu_ptr(&memcg_stock); @@ -1841,7 +1851,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages) break; } - local_unlock_irqrestore(&memcg_stock.memcg_lock, flags); + local_unlock_irqrestore(&memcg_stock.lock, flags); return ret; } @@ -1882,7 +1892,7 @@ static void drain_stock_fully(struct memcg_stock_pcp *stock) drain_stock(stock, i); } -static void drain_local_stock(struct work_struct *dummy) +static void drain_local_memcg_stock(struct work_struct *dummy) { struct memcg_stock_pcp *stock; unsigned long flags; @@ -1890,16 +1900,30 @@ static void drain_local_stock(struct work_struct *dummy) if (WARN_ONCE(!in_task(), "drain in non-task context")) return; - local_lock_irqsave(&memcg_stock.obj_lock, flags); - stock = this_cpu_ptr(&memcg_stock); - drain_obj_stock(stock); - local_unlock_irqrestore(&memcg_stock.obj_lock, flags); + local_lock_irqsave(&memcg_stock.lock, flags); - local_lock_irqsave(&memcg_stock.memcg_lock, flags); stock = this_cpu_ptr(&memcg_stock); drain_stock_fully(stock); clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags); - local_unlock_irqrestore(&memcg_stock.memcg_lock, flags); + + local_unlock_irqrestore(&memcg_stock.lock, flags); +} + +static void drain_local_obj_stock(struct work_struct *dummy) +{ + struct obj_stock_pcp *stock; + unsigned long flags; + + if (WARN_ONCE(!in_task(), "drain in non-task context")) + return; + + local_lock_irqsave(&obj_stock.lock, flags); + + stock = this_cpu_ptr(&obj_stock); + drain_obj_stock(stock); + clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags); + + local_unlock_irqrestore(&obj_stock.lock, flags); } static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) @@ -1922,10 +1946,10 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) VM_WARN_ON_ONCE(mem_cgroup_is_root(memcg)); if (nr_pages > MEMCG_CHARGE_BATCH || - !local_trylock_irqsave(&memcg_stock.memcg_lock, flags)) { + !local_trylock_irqsave(&memcg_stock.lock, flags)) { /* * In case of larger than batch refill or unlikely failure to - * lock the percpu memcg_lock, uncharge memcg directly. + * lock the percpu memcg_stock.lock, uncharge memcg directly. */ memcg_uncharge(memcg, nr_pages); return; @@ -1957,23 +1981,17 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) WRITE_ONCE(stock->nr_pages[i], nr_pages); } - local_unlock_irqrestore(&memcg_stock.memcg_lock, flags); + local_unlock_irqrestore(&memcg_stock.lock, flags); } -static bool is_drain_needed(struct memcg_stock_pcp *stock, - struct mem_cgroup *root_memcg) +static bool is_memcg_drain_needed(struct memcg_stock_pcp *stock, + struct mem_cgroup *root_memcg) { struct mem_cgroup *memcg; bool flush = false; int i; rcu_read_lock(); - - if (obj_stock_flush_required(stock, root_memcg)) { - flush = true; - goto out; - } - for (i = 0; i < NR_MEMCG_STOCK; ++i) { memcg = READ_ONCE(stock->cached[i]); if (!memcg) @@ -1985,7 +2003,6 @@ static bool is_drain_needed(struct memcg_stock_pcp *stock, break; } } -out: rcu_read_unlock(); return flush; } @@ -2010,15 +2027,27 @@ void drain_all_stock(struct mem_cgroup *root_memcg) migrate_disable(); curcpu = smp_processor_id(); for_each_online_cpu(cpu) { - struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu); - bool flush = is_drain_needed(stock, root_memcg); + struct memcg_stock_pcp *memcg_st = &per_cpu(memcg_stock, cpu); + struct obj_stock_pcp *obj_st = &per_cpu(obj_stock, cpu); - if (flush && - !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) { + if (!test_bit(FLUSHING_CACHED_CHARGE, &memcg_st->flags) && + is_memcg_drain_needed(memcg_st, root_memcg) && + !test_and_set_bit(FLUSHING_CACHED_CHARGE, + &memcg_st->flags)) { if (cpu == curcpu) - drain_local_stock(&stock->work); + drain_local_memcg_stock(&memcg_st->work); else if (!cpu_is_isolated(cpu)) - schedule_work_on(cpu, &stock->work); + schedule_work_on(cpu, &memcg_st->work); + } + + if (!test_bit(FLUSHING_CACHED_CHARGE, &obj_st->flags) && + obj_stock_flush_required(obj_st, root_memcg) && + !test_and_set_bit(FLUSHING_CACHED_CHARGE, + &obj_st->flags)) { + if (cpu == curcpu) + drain_local_obj_stock(&obj_st->work); + else if (!cpu_is_isolated(cpu)) + schedule_work_on(cpu, &obj_st->work); } } migrate_enable(); @@ -2027,18 +2056,18 @@ void drain_all_stock(struct mem_cgroup *root_memcg) static int memcg_hotplug_cpu_dead(unsigned int cpu) { - struct memcg_stock_pcp *stock; + struct obj_stock_pcp *obj_st; unsigned long flags; - stock = &per_cpu(memcg_stock, cpu); + obj_st = &per_cpu(obj_stock, cpu); - /* drain_obj_stock requires obj_lock */ - local_lock_irqsave(&memcg_stock.obj_lock, flags); - drain_obj_stock(stock); - local_unlock_irqrestore(&memcg_stock.obj_lock, flags); + /* drain_obj_stock requires objstock.lock */ + local_lock_irqsave(&obj_stock.lock, flags); + drain_obj_stock(obj_st); + local_unlock_irqrestore(&obj_stock.lock, flags); /* no need for the local lock */ - drain_stock_fully(stock); + drain_stock_fully(&per_cpu(memcg_stock, cpu)); return 0; } @@ -2835,7 +2864,7 @@ void __memcg_kmem_uncharge_page(struct page *page, int order) } static void __account_obj_stock(struct obj_cgroup *objcg, - struct memcg_stock_pcp *stock, int nr, + struct obj_stock_pcp *stock, int nr, struct pglist_data *pgdat, enum node_stat_item idx) { int *bytes; @@ -2886,13 +2915,13 @@ static void __account_obj_stock(struct obj_cgroup *objcg, static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, struct pglist_data *pgdat, enum node_stat_item idx) { - struct memcg_stock_pcp *stock; + struct obj_stock_pcp *stock; unsigned long flags; bool ret = false; - local_lock_irqsave(&memcg_stock.obj_lock, flags); + local_lock_irqsave(&obj_stock.lock, flags); - stock = this_cpu_ptr(&memcg_stock); + stock = this_cpu_ptr(&obj_stock); if (objcg == READ_ONCE(stock->cached_objcg) && stock->nr_bytes >= nr_bytes) { stock->nr_bytes -= nr_bytes; ret = true; @@ -2901,12 +2930,12 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, __account_obj_stock(objcg, stock, nr_bytes, pgdat, idx); } - local_unlock_irqrestore(&memcg_stock.obj_lock, flags); + local_unlock_irqrestore(&obj_stock.lock, flags); return ret; } -static void drain_obj_stock(struct memcg_stock_pcp *stock) +static void drain_obj_stock(struct obj_stock_pcp *stock) { struct obj_cgroup *old = READ_ONCE(stock->cached_objcg); @@ -2967,32 +2996,35 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock) obj_cgroup_put(old); } -static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, +static bool obj_stock_flush_required(struct obj_stock_pcp *stock, struct mem_cgroup *root_memcg) { struct obj_cgroup *objcg = READ_ONCE(stock->cached_objcg); struct mem_cgroup *memcg; + bool flush = false; + rcu_read_lock(); if (objcg) { memcg = obj_cgroup_memcg(objcg); if (memcg && mem_cgroup_is_descendant(memcg, root_memcg)) - return true; + flush = true; } + rcu_read_unlock(); - return false; + return flush; } static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, bool allow_uncharge, int nr_acct, struct pglist_data *pgdat, enum node_stat_item idx) { - struct memcg_stock_pcp *stock; + struct obj_stock_pcp *stock; unsigned long flags; unsigned int nr_pages = 0; - local_lock_irqsave(&memcg_stock.obj_lock, flags); + local_lock_irqsave(&obj_stock.lock, flags); - stock = this_cpu_ptr(&memcg_stock); + stock = this_cpu_ptr(&obj_stock); if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */ drain_obj_stock(stock); obj_cgroup_get(objcg); @@ -3012,7 +3044,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, stock->nr_bytes &= (PAGE_SIZE - 1); } - local_unlock_irqrestore(&memcg_stock.obj_lock, flags); + local_unlock_irqrestore(&obj_stock.lock, flags); if (nr_pages) obj_cgroup_uncharge_pages(objcg, nr_pages); @@ -5077,9 +5109,12 @@ int __init mem_cgroup_init(void) cpuhp_setup_state_nocalls(CPUHP_MM_MEMCQ_DEAD, "mm/memctrl:dead", NULL, memcg_hotplug_cpu_dead); - for_each_possible_cpu(cpu) + for_each_possible_cpu(cpu) { INIT_WORK(&per_cpu_ptr(&memcg_stock, cpu)->work, - drain_local_stock); + drain_local_memcg_stock); + INIT_WORK(&per_cpu_ptr(&obj_stock, cpu)->work, + drain_local_obj_stock); + } memcg_size = struct_size_t(struct mem_cgroup, nodeinfo, nr_node_ids); memcg_cachep = kmem_cache_create("mem_cgroup", memcg_size, 0, From 9e619cd4fefd19cdce16e169d5827bc64ae01aa1 Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Tue, 6 May 2025 15:55:33 -0700 Subject: [PATCH 236/285] memcg: no irq disable for memcg stock lock There is no need to disable irqs to use memcg per-cpu stock, so let's just not do that. One consequence of this change is if the kernel while in task context has the memcg stock lock and that cpu got interrupted. The memcg charges on that cpu in the irq context will take the slow path of memcg charging. However that should be super rare and should be fine in general. Link: https://lkml.kernel.org/r/20250506225533.2580386-5-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Acked-by: Vlastimil Babka Cc: Alexei Starovoitov Cc: Eric Dumaze Cc: Jakub Kacinski Cc: Johannes Weiner Cc: Michal Hocko Cc: Muchun Song Cc: Roman Gushchin Cc: Sebastian Andrzej Siewior Signed-off-by: Andrew Morton --- mm/memcontrol.c | 15 ++++++--------- 1 file changed, 6 insertions(+), 9 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1f8611e9a8ff..2b6768f0e916 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1829,12 +1829,11 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages) { struct memcg_stock_pcp *stock; uint8_t stock_pages; - unsigned long flags; bool ret = false; int i; if (nr_pages > MEMCG_CHARGE_BATCH || - !local_trylock_irqsave(&memcg_stock.lock, flags)) + !local_trylock(&memcg_stock.lock)) return ret; stock = this_cpu_ptr(&memcg_stock); @@ -1851,7 +1850,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages) break; } - local_unlock_irqrestore(&memcg_stock.lock, flags); + local_unlock(&memcg_stock.lock); return ret; } @@ -1895,18 +1894,17 @@ static void drain_stock_fully(struct memcg_stock_pcp *stock) static void drain_local_memcg_stock(struct work_struct *dummy) { struct memcg_stock_pcp *stock; - unsigned long flags; if (WARN_ONCE(!in_task(), "drain in non-task context")) return; - local_lock_irqsave(&memcg_stock.lock, flags); + local_lock(&memcg_stock.lock); stock = this_cpu_ptr(&memcg_stock); drain_stock_fully(stock); clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags); - local_unlock_irqrestore(&memcg_stock.lock, flags); + local_unlock(&memcg_stock.lock); } static void drain_local_obj_stock(struct work_struct *dummy) @@ -1931,7 +1929,6 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) struct memcg_stock_pcp *stock; struct mem_cgroup *cached; uint8_t stock_pages; - unsigned long flags; bool success = false; int empty_slot = -1; int i; @@ -1946,7 +1943,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) VM_WARN_ON_ONCE(mem_cgroup_is_root(memcg)); if (nr_pages > MEMCG_CHARGE_BATCH || - !local_trylock_irqsave(&memcg_stock.lock, flags)) { + !local_trylock(&memcg_stock.lock)) { /* * In case of larger than batch refill or unlikely failure to * lock the percpu memcg_stock.lock, uncharge memcg directly. @@ -1981,7 +1978,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) WRITE_ONCE(stock->nr_pages[i], nr_pages); } - local_unlock_irqrestore(&memcg_stock.lock, flags); + local_unlock(&memcg_stock.lock); } static bool is_memcg_drain_needed(struct memcg_stock_pcp *stock, From e341f9c3c8412e57fe0042a33a2640245ecdf619 Mon Sep 17 00:00:00 2001 From: Joshua Hahn Date: Mon, 5 May 2025 11:23:28 -0700 Subject: [PATCH 237/285] mm/mempolicy: Weighted Interleave Auto-tuning On machines with multiple memory nodes, interleaving page allocations across nodes allows for better utilization of each node's bandwidth. Previous work by Gregory Price [1] introduced weighted interleave, which allowed for pages to be allocated across nodes according to user-set ratios. Ideally, these weights should be proportional to their bandwidth, so that under bandwidth pressure, each node uses its maximal efficient bandwidth and prevents latency from increasing exponentially. Previously, weighted interleave's default weights were just 1s -- which would be equivalent to the (unweighted) interleave mempolicy, which goes through the nodes in a round-robin fashion, ignoring bandwidth information. This patch has two main goals: First, it makes weighted interleave easier to use for users who wish to relieve bandwidth pressure when using nodes with varying bandwidth (CXL). By providing a set of "real" default weights that just work out of the box, users who might not have the capability (or wish to) perform experimentation to find the most optimal weights for their system can still take advantage of bandwidth-informed weighted interleave. Second, it allows for weighted interleave to dynamically adjust to hotplugged memory with new bandwidth information. Instead of manually updating node weights every time new bandwidth information is reported or taken off, weighted interleave adjusts and provides a new set of default weights for weighted interleave to use when there is a change in bandwidth information. To meet these goals, this patch introduces an auto-configuration mode for the interleave weights that provides a reasonable set of default weights, calculated using bandwidth data reported by the system. In auto mode, weights are dynamically adjusted based on whatever the current bandwidth information reports (and responds to hotplug events). This patch still supports users manually writing weights into the nodeN sysfs interface by entering into manual mode. When a user enters manual mode, the system stops dynamically updating any of the node weights, even during hotplug events that shift the optimal weight distribution. A new sysfs interface "auto" is introduced, which allows users to switch between the auto (writing 1 or Y) and manual (writing 0 or N) modes. The system also automatically enters manual mode when a nodeN interface is manually written to. There is one functional change that this patch makes to the existing weighted_interleave ABI: previously, writing 0 directly to a nodeN interface was said to reset the weight to the system default. Before this patch, the default for all weights were 1, which meant that writing 0 and 1 were functionally equivalent. With this patch, writing 0 is invalid. Link: https://lkml.kernel.org/r/20250520141236.2987309-1-joshua.hahnjy@gmail.com [joshua.hahnjy@gmail.com: wordsmithing changes, simplification, fixes] Link: https://lkml.kernel.org/r/20250511025840.2410154-1-joshua.hahnjy@gmail.com [joshua.hahnjy@gmail.com: remove auto_kobj_attr field from struct sysfs_wi_group] Link: https://lkml.kernel.org/r/20250512142511.3959833-1-joshua.hahnjy@gmail.com https://lore.kernel.org/linux-mm/20240202170238.90004-1-gregory.price@memverge.com/ [1] Link: https://lkml.kernel.org/r/20250505182328.4148265-1-joshua.hahnjy@gmail.com Co-developed-by: Gregory Price Signed-off-by: Gregory Price Signed-off-by: Joshua Hahn Suggested-by: Yunjeong Mun Suggested-by: Oscar Salvador Suggested-by: Ying Huang Suggested-by: Harry Yoo Reviewed-by: Harry Yoo Reviewed-by: Huang Ying Reviewed-by: Honggyu Kim Cc: Dan Williams Cc: Dave Jiang Cc: Greg Kroah-Hartman Cc: Joanthan Cameron Cc: Johannes Weiner Cc: Len Brown Signed-off-by: Andrew Morton --- ...fs-kernel-mm-mempolicy-weighted-interleave | 35 +- drivers/base/node.c | 9 + include/linux/mempolicy.h | 4 + mm/mempolicy.c | 326 ++++++++++++++---- 4 files changed, 311 insertions(+), 63 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave index 0b7972de04e9..649c0e9b895c 100644 --- a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave @@ -20,6 +20,35 @@ Description: Weight configuration interface for nodeN Minimum weight: 1 Maximum weight: 255 - Writing an empty string or `0` will reset the weight to the - system default. The system default may be set by the kernel - or drivers at boot or during hotplug events. + Writing invalid values (i.e. any values not in [1,255], + empty string, ...) will return -EINVAL. + + Changing the weight to a valid value will automatically + switch the system to manual mode as well. + +What: /sys/kernel/mm/mempolicy/weighted_interleave/auto +Date: May 2025 +Contact: Linux memory management mailing list +Description: Auto-weighting configuration interface + + Configuration mode for weighted interleave. 'true' indicates + that the system is in auto mode, and a 'false' indicates that + the system is in manual mode. + + In auto mode, all node weights are re-calculated and overwritten + (visible via the nodeN interfaces) whenever new bandwidth data + is made available during either boot or hotplug events. + + In manual mode, node weights can only be updated by the user. + Note that nodes that are onlined with previously set weights + will reuse those weights. If they were not previously set or + are onlined with missing bandwidth data, the weights will use + a default weight of 1. + + Writing any true value string (e.g. Y or 1) will enable auto + mode, while writing any false value string (e.g. N or 0) will + enable manual mode. All other strings are ignored and will + return -EINVAL. + + Writing a new weight to a node directly via the nodeN interface + will also automatically switch the system to manual mode. diff --git a/drivers/base/node.c b/drivers/base/node.c index cd13ef287011..25ab9ec14eb8 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -7,6 +7,7 @@ #include #include #include +#include #include #include #include @@ -214,6 +215,14 @@ void node_set_perf_attrs(unsigned int nid, struct access_coordinate *coord, break; } } + + /* When setting CPU access coordinates, update mempolicy */ + if (access == ACCESS_COORDINATE_CPU) { + if (mempolicy_set_node_perf(nid, coord)) { + pr_info("failed to set mempolicy attrs for node %d\n", + nid); + } + } } EXPORT_SYMBOL_GPL(node_set_perf_attrs); diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index ce9885e0178a..0fe96f3ab3ef 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -11,6 +11,7 @@ #include #include #include +#include #include #include #include @@ -178,6 +179,9 @@ static inline bool mpol_is_preferred_many(struct mempolicy *pol) extern bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone); +extern int mempolicy_set_node_perf(unsigned int node, + struct access_coordinate *coords); + #else struct mempolicy {}; diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 9a2b4b36f558..72fd72e156b1 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -109,6 +109,7 @@ #include #include #include +#include #include #include @@ -140,31 +141,138 @@ static struct mempolicy default_policy = { static struct mempolicy preferred_node_policy[MAX_NUMNODES]; /* - * iw_table is the sysfs-set interleave weight table, a value of 0 denotes - * system-default value should be used. A NULL iw_table also denotes that - * system-default values should be used. Until the system-default table - * is implemented, the system-default is always 1. - * - * iw_table is RCU protected + * weightiness balances the tradeoff between small weights (cycles through nodes + * faster, more fair/even distribution) and large weights (smaller errors + * between actual bandwidth ratios and weight ratios). 32 is a number that has + * been found to perform at a reasonable compromise between the two goals. */ -static u8 __rcu *iw_table; -static DEFINE_MUTEX(iw_table_lock); +static const int weightiness = 32; + +/* + * A null weighted_interleave_state is interpreted as having .mode="auto", + * and .iw_table is interpreted as an array of 1s with length nr_node_ids. + */ +struct weighted_interleave_state { + bool mode_auto; + u8 iw_table[]; +}; +static struct weighted_interleave_state __rcu *wi_state; +static unsigned int *node_bw_table; + +/* + * wi_state_lock protects both wi_state and node_bw_table. + * node_bw_table is only used by writers to update wi_state. + */ +static DEFINE_MUTEX(wi_state_lock); static u8 get_il_weight(int node) { - u8 *table; - u8 weight; + struct weighted_interleave_state *state; + u8 weight = 1; rcu_read_lock(); - table = rcu_dereference(iw_table); - /* if no iw_table, use system default */ - weight = table ? table[node] : 1; - /* if value in iw_table is 0, use system default */ - weight = weight ? weight : 1; + state = rcu_dereference(wi_state); + if (state) + weight = state->iw_table[node]; rcu_read_unlock(); return weight; } +/* + * Convert bandwidth values into weighted interleave weights. + * Call with wi_state_lock. + */ +static void reduce_interleave_weights(unsigned int *bw, u8 *new_iw) +{ + u64 sum_bw = 0; + unsigned int cast_sum_bw, scaling_factor = 1, iw_gcd = 0; + int nid; + + for_each_node_state(nid, N_MEMORY) + sum_bw += bw[nid]; + + /* Scale bandwidths to whole numbers in the range [1, weightiness] */ + for_each_node_state(nid, N_MEMORY) { + /* + * Try not to perform 64-bit division. + * If sum_bw < scaling_factor, then sum_bw < U32_MAX. + * If sum_bw > scaling_factor, then round the weight up to 1. + */ + scaling_factor = weightiness * bw[nid]; + if (bw[nid] && sum_bw < scaling_factor) { + cast_sum_bw = (unsigned int)sum_bw; + new_iw[nid] = scaling_factor / cast_sum_bw; + } else { + new_iw[nid] = 1; + } + if (!iw_gcd) + iw_gcd = new_iw[nid]; + iw_gcd = gcd(iw_gcd, new_iw[nid]); + } + + /* 1:2 is strictly better than 16:32. Reduce by the weights' GCD. */ + for_each_node_state(nid, N_MEMORY) + new_iw[nid] /= iw_gcd; +} + +int mempolicy_set_node_perf(unsigned int node, struct access_coordinate *coords) +{ + struct weighted_interleave_state *new_wi_state, *old_wi_state = NULL; + unsigned int *old_bw, *new_bw; + unsigned int bw_val; + int i; + + bw_val = min(coords->read_bandwidth, coords->write_bandwidth); + new_bw = kcalloc(nr_node_ids, sizeof(unsigned int), GFP_KERNEL); + if (!new_bw) + return -ENOMEM; + + new_wi_state = kmalloc(struct_size(new_wi_state, iw_table, nr_node_ids), + GFP_KERNEL); + if (!new_wi_state) { + kfree(new_bw); + return -ENOMEM; + } + new_wi_state->mode_auto = true; + for (i = 0; i < nr_node_ids; i++) + new_wi_state->iw_table[i] = 1; + + /* + * Update bandwidth info, even in manual mode. That way, when switching + * to auto mode in the future, iw_table can be overwritten using + * accurate bw data. + */ + mutex_lock(&wi_state_lock); + + old_bw = node_bw_table; + if (old_bw) + memcpy(new_bw, old_bw, nr_node_ids * sizeof(*old_bw)); + new_bw[node] = bw_val; + node_bw_table = new_bw; + + old_wi_state = rcu_dereference_protected(wi_state, + lockdep_is_held(&wi_state_lock)); + if (old_wi_state && !old_wi_state->mode_auto) { + /* Manual mode; skip reducing weights and updating wi_state */ + mutex_unlock(&wi_state_lock); + kfree(new_wi_state); + goto out; + } + + /* NULL wi_state assumes auto=true; reduce weights and update wi_state*/ + reduce_interleave_weights(new_bw, new_wi_state->iw_table); + rcu_assign_pointer(wi_state, new_wi_state); + + mutex_unlock(&wi_state_lock); + if (old_wi_state) { + synchronize_rcu(); + kfree(old_wi_state); + } +out: + kfree(old_bw); + return 0; +} + /** * numa_nearest_node - Find nearest node by state * @node: Node id to start the search @@ -2023,26 +2131,28 @@ static unsigned int read_once_policy_nodemask(struct mempolicy *pol, static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx) { + struct weighted_interleave_state *state; nodemask_t nodemask; unsigned int target, nr_nodes; - u8 *table; + u8 *table = NULL; unsigned int weight_total = 0; u8 weight; - int nid; + int nid = 0; nr_nodes = read_once_policy_nodemask(pol, &nodemask); if (!nr_nodes) return numa_node_id(); rcu_read_lock(); - table = rcu_dereference(iw_table); + + state = rcu_dereference(wi_state); + /* Uninitialized wi_state means we should assume all weights are 1 */ + if (state) + table = state->iw_table; + /* calculate the total weight */ - for_each_node_mask(nid, nodemask) { - /* detect system default usage */ - weight = table ? table[nid] : 1; - weight = weight ? weight : 1; - weight_total += weight; - } + for_each_node_mask(nid, nodemask) + weight_total += table ? table[nid] : 1; /* Calculate the node offset based on totals */ target = ilx % weight_total; @@ -2050,7 +2160,6 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx) while (target) { /* detect system default usage */ weight = table ? table[nid] : 1; - weight = weight ? weight : 1; if (target < weight) break; target -= weight; @@ -2451,13 +2560,14 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, struct mempolicy *pol, unsigned long nr_pages, struct page **page_array) { + struct weighted_interleave_state *state; struct task_struct *me = current; unsigned int cpuset_mems_cookie; unsigned long total_allocated = 0; unsigned long nr_allocated = 0; unsigned long rounds; unsigned long node_pages, delta; - u8 *table, *weights, weight; + u8 *weights, weight; unsigned int weight_total = 0; unsigned long rem_pages = nr_pages; nodemask_t nodes; @@ -2507,17 +2617,19 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, return total_allocated; rcu_read_lock(); - table = rcu_dereference(iw_table); - if (table) - memcpy(weights, table, nr_node_ids); - rcu_read_unlock(); + state = rcu_dereference(wi_state); + if (state) { + memcpy(weights, state->iw_table, nr_node_ids * sizeof(u8)); + rcu_read_unlock(); + } else { + rcu_read_unlock(); + for (i = 0; i < nr_node_ids; i++) + weights[i] = 1; + } /* calculate total, detect system default usage */ - for_each_node_mask(node, nodes) { - if (!weights[node]) - weights[node] = 1; + for_each_node_mask(node, nodes) weight_total += weights[node]; - } /* * Calculate rounds/partial rounds to minimize __alloc_pages_bulk calls. @@ -3450,31 +3562,109 @@ static ssize_t node_show(struct kobject *kobj, struct kobj_attribute *attr, static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { + struct weighted_interleave_state *new_wi_state, *old_wi_state = NULL; struct iw_node_attr *node_attr; - u8 *new; - u8 *old; u8 weight = 0; + int i; node_attr = container_of(attr, struct iw_node_attr, kobj_attr); - if (count == 0 || sysfs_streq(buf, "")) - weight = 0; - else if (kstrtou8(buf, 0, &weight)) + if (count == 0 || sysfs_streq(buf, "") || + kstrtou8(buf, 0, &weight) || weight == 0) return -EINVAL; - new = kzalloc(nr_node_ids, GFP_KERNEL); - if (!new) + new_wi_state = kzalloc(struct_size(new_wi_state, iw_table, nr_node_ids), + GFP_KERNEL); + if (!new_wi_state) return -ENOMEM; - mutex_lock(&iw_table_lock); - old = rcu_dereference_protected(iw_table, - lockdep_is_held(&iw_table_lock)); - if (old) - memcpy(new, old, nr_node_ids); - new[node_attr->nid] = weight; - rcu_assign_pointer(iw_table, new); - mutex_unlock(&iw_table_lock); - synchronize_rcu(); - kfree(old); + mutex_lock(&wi_state_lock); + old_wi_state = rcu_dereference_protected(wi_state, + lockdep_is_held(&wi_state_lock)); + if (old_wi_state) { + memcpy(new_wi_state->iw_table, old_wi_state->iw_table, + nr_node_ids * sizeof(u8)); + } else { + for (i = 0; i < nr_node_ids; i++) + new_wi_state->iw_table[i] = 1; + } + new_wi_state->iw_table[node_attr->nid] = weight; + new_wi_state->mode_auto = false; + + rcu_assign_pointer(wi_state, new_wi_state); + mutex_unlock(&wi_state_lock); + if (old_wi_state) { + synchronize_rcu(); + kfree(old_wi_state); + } + return count; +} + +static ssize_t weighted_interleave_auto_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + struct weighted_interleave_state *state; + bool wi_auto = true; + + rcu_read_lock(); + state = rcu_dereference(wi_state); + if (state) + wi_auto = state->mode_auto; + rcu_read_unlock(); + + return sysfs_emit(buf, "%s\n", str_true_false(wi_auto)); +} + +static ssize_t weighted_interleave_auto_store(struct kobject *kobj, + struct kobj_attribute *attr, const char *buf, size_t count) +{ + struct weighted_interleave_state *new_wi_state, *old_wi_state = NULL; + unsigned int *bw; + bool input; + int i; + + if (kstrtobool(buf, &input)) + return -EINVAL; + + new_wi_state = kzalloc(struct_size(new_wi_state, iw_table, nr_node_ids), + GFP_KERNEL); + if (!new_wi_state) + return -ENOMEM; + for (i = 0; i < nr_node_ids; i++) + new_wi_state->iw_table[i] = 1; + + mutex_lock(&wi_state_lock); + if (!input) { + old_wi_state = rcu_dereference_protected(wi_state, + lockdep_is_held(&wi_state_lock)); + if (!old_wi_state) + goto update_wi_state; + if (input == old_wi_state->mode_auto) { + mutex_unlock(&wi_state_lock); + return count; + } + + memcpy(new_wi_state->iw_table, old_wi_state->iw_table, + nr_node_ids * sizeof(u8)); + goto update_wi_state; + } + + bw = node_bw_table; + if (!bw) { + mutex_unlock(&wi_state_lock); + kfree(new_wi_state); + return -ENODEV; + } + + new_wi_state->mode_auto = true; + reduce_interleave_weights(bw, new_wi_state->iw_table); + +update_wi_state: + rcu_assign_pointer(wi_state, new_wi_state); + mutex_unlock(&wi_state_lock); + if (old_wi_state) { + synchronize_rcu(); + kfree(old_wi_state); + } return count; } @@ -3508,23 +3698,35 @@ static void sysfs_wi_node_delete_all(void) sysfs_wi_node_delete(nid); } -static void iw_table_free(void) +static void wi_state_free(void) { - u8 *old; + struct weighted_interleave_state *old_wi_state; - mutex_lock(&iw_table_lock); - old = rcu_dereference_protected(iw_table, - lockdep_is_held(&iw_table_lock)); - rcu_assign_pointer(iw_table, NULL); - mutex_unlock(&iw_table_lock); + mutex_lock(&wi_state_lock); + old_wi_state = rcu_dereference_protected(wi_state, + lockdep_is_held(&wi_state_lock)); + if (!old_wi_state) { + mutex_unlock(&wi_state_lock); + goto out; + } + + rcu_assign_pointer(wi_state, NULL); + mutex_unlock(&wi_state_lock); synchronize_rcu(); - kfree(old); + kfree(old_wi_state); +out: + kfree(&wi_group->wi_kobj); } +static struct kobj_attribute wi_auto_attr = + __ATTR(auto, 0664, weighted_interleave_auto_show, + weighted_interleave_auto_store); + static void wi_cleanup(void) { + sysfs_remove_file(&wi_group->wi_kobj, &wi_auto_attr.attr); sysfs_wi_node_delete_all(); - iw_table_free(); + wi_state_free(); } static void wi_kobj_release(struct kobject *wi_kobj) @@ -3627,6 +3829,10 @@ static int __init add_weighted_interleave_group(struct kobject *mempolicy_kobj) if (err) goto err_put_kobj; + err = sysfs_create_file(&wi_group->wi_kobj, &wi_auto_attr.attr); + if (err) + goto err_put_kobj; + for_each_online_node(nid) { if (!node_state(nid, N_MEMORY)) continue; From 1c1db467068d5e57e58756666f7dc54c9dda9b2c Mon Sep 17 00:00:00 2001 From: Alexander Potapenko Date: Wed, 7 May 2025 18:00:08 +0200 Subject: [PATCH 238/285] kmsan: apply clang-format to files mm/kmsan/ KMSAN source files are expected to be formatted with clang-format, fix some nits that slipped in. No functional change. Link: https://lkml.kernel.org/r/20250507160012.3311104-1-glider@google.com Signed-off-by: Alexander Potapenko Cc: Ilya Leoshkevich Cc: Bart van Assche Cc: Dmitriy Vyukov Cc: Kent Overstreet Cc: Macro Elver Signed-off-by: Andrew Morton --- mm/kmsan/core.c | 4 ++-- mm/kmsan/hooks.c | 4 +--- mm/kmsan/init.c | 3 +-- mm/kmsan/shadow.c | 3 +-- 4 files changed, 5 insertions(+), 9 deletions(-) diff --git a/mm/kmsan/core.c b/mm/kmsan/core.c index a495debf1436..a97dc90fa6a9 100644 --- a/mm/kmsan/core.c +++ b/mm/kmsan/core.c @@ -159,8 +159,8 @@ depot_stack_handle_t kmsan_internal_chain_origin(depot_stack_handle_t id) * Make sure we have enough spare bits in @id to hold the UAF bit and * the chain depth. */ - BUILD_BUG_ON( - (1 << STACK_DEPOT_EXTRA_BITS) <= (KMSAN_MAX_ORIGIN_DEPTH << 1)); + BUILD_BUG_ON((1 << STACK_DEPOT_EXTRA_BITS) <= + (KMSAN_MAX_ORIGIN_DEPTH << 1)); extra_bits = stack_depot_get_extra_bits(id); depth = kmsan_depth_from_eb(extra_bits); diff --git a/mm/kmsan/hooks.c b/mm/kmsan/hooks.c index 3df45c25c1f6..05f2faa54054 100644 --- a/mm/kmsan/hooks.c +++ b/mm/kmsan/hooks.c @@ -114,9 +114,7 @@ void kmsan_kfree_large(const void *ptr) kmsan_enter_runtime(); page = virt_to_head_page((void *)ptr); KMSAN_WARN_ON(ptr != page_address(page)); - kmsan_internal_poison_memory((void *)ptr, - page_size(page), - GFP_KERNEL, + kmsan_internal_poison_memory((void *)ptr, page_size(page), GFP_KERNEL, KMSAN_POISON_CHECK | KMSAN_POISON_FREE); kmsan_leave_runtime(); } diff --git a/mm/kmsan/init.c b/mm/kmsan/init.c index 10f52c085e6c..b14ce3417e65 100644 --- a/mm/kmsan/init.c +++ b/mm/kmsan/init.c @@ -35,8 +35,7 @@ static void __init kmsan_record_future_shadow_range(void *start, void *end) KMSAN_WARN_ON(future_index == NUM_FUTURE_RANGES); KMSAN_WARN_ON((nstart >= nend) || /* Virtual address 0 is valid on s390. */ - (!IS_ENABLED(CONFIG_S390) && !nstart) || - !nend); + (!IS_ENABLED(CONFIG_S390) && !nstart) || !nend); nstart = ALIGN_DOWN(nstart, PAGE_SIZE); nend = ALIGN(nend, PAGE_SIZE); diff --git a/mm/kmsan/shadow.c b/mm/kmsan/shadow.c index 1bb505a08415..6d32bfc18d6a 100644 --- a/mm/kmsan/shadow.c +++ b/mm/kmsan/shadow.c @@ -207,8 +207,7 @@ void kmsan_free_page(struct page *page, unsigned int order) if (!kmsan_enabled || kmsan_in_runtime()) return; kmsan_enter_runtime(); - kmsan_internal_poison_memory(page_address(page), - page_size(page), + kmsan_internal_poison_memory(page_address(page), page_size(page), GFP_KERNEL, KMSAN_POISON_CHECK | KMSAN_POISON_FREE); kmsan_leave_runtime(); From 8312ab31d362fcb6d68f1f2da4d1e89bc5d3f48c Mon Sep 17 00:00:00 2001 From: Alexander Potapenko Date: Wed, 7 May 2025 18:00:09 +0200 Subject: [PATCH 239/285] kmsan: fix usage of kmsan_enter_runtime() in kmsan_vmap_pages_range_noflush() Only enter the runtime to call __vmap_pages_range_noflush(), so that error handling does not skip kmsan_leave_runtime(). This bug was spotted by CONFIG_WARN_CAPABILITY_ANALYSIS=y Link: https://lkml.kernel.org/r/20250507160012.3311104-2-glider@google.com Signed-off-by: Alexander Potapenko Acked-by: Marco Elver Cc: Bart Van Assche Cc: Kent Overstreet Cc: Dmitriy Vyukov Cc: Ilya Leoshkevich Signed-off-by: Andrew Morton --- mm/kmsan/shadow.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/mm/kmsan/shadow.c b/mm/kmsan/shadow.c index 6d32bfc18d6a..54f3c3c962f0 100644 --- a/mm/kmsan/shadow.c +++ b/mm/kmsan/shadow.c @@ -247,17 +247,19 @@ int kmsan_vmap_pages_range_noflush(unsigned long start, unsigned long end, kmsan_enter_runtime(); mapped = __vmap_pages_range_noflush(shadow_start, shadow_end, prot, s_pages, page_shift); + kmsan_leave_runtime(); if (mapped) { err = mapped; goto ret; } + kmsan_enter_runtime(); mapped = __vmap_pages_range_noflush(origin_start, origin_end, prot, o_pages, page_shift); + kmsan_leave_runtime(); if (mapped) { err = mapped; goto ret; } - kmsan_leave_runtime(); flush_tlb_kernel_range(shadow_start, shadow_end); flush_tlb_kernel_range(origin_start, origin_end); flush_cache_vmap(shadow_start, shadow_end); From ce6a1c978f9c2beb38dbb1793799614255094290 Mon Sep 17 00:00:00 2001 From: Alexander Potapenko Date: Wed, 7 May 2025 18:00:10 +0200 Subject: [PATCH 240/285] kmsan: drop the declaration of kmsan_save_stack() This function is not defined anywhere. Link: https://lkml.kernel.org/r/20250507160012.3311104-3-glider@google.com Signed-off-by: Alexander Potapenko Acked-by: Marco Elver Cc: Bart van Assche Cc: Dmitriy Vyukov Cc: Ilya Leoshkevich Cc: Kent Overstreet Signed-off-by: Andrew Morton --- mm/kmsan/kmsan.h | 1 - 1 file changed, 1 deletion(-) diff --git a/mm/kmsan/kmsan.h b/mm/kmsan/kmsan.h index 29555a8bc315..bc3d1810f352 100644 --- a/mm/kmsan/kmsan.h +++ b/mm/kmsan/kmsan.h @@ -121,7 +121,6 @@ static __always_inline void kmsan_leave_runtime(void) KMSAN_WARN_ON(--ctx->kmsan_in_runtime); } -depot_stack_handle_t kmsan_save_stack(void); depot_stack_handle_t kmsan_save_stack_with_flags(gfp_t flags, unsigned int extra_bits); From e17c1f15b0ccfa4802cedbd464d00ece50a10cf1 Mon Sep 17 00:00:00 2001 From: Alexander Potapenko Date: Wed, 7 May 2025 18:00:11 +0200 Subject: [PATCH 241/285] kmsan: enter the runtime around kmsan_internal_memmove_metadata() call kmsan_internal_memmove_metadata() transitively calls stack_depot_save() (via kmsan_internal_chain_origin() and kmsan_save_stack_with_flags()), which may allocate memory. Guard it with kmsan_enter_runtime() and kmsan_leave_runtime() to avoid recursion. This bug was spotted by CONFIG_WARN_CAPABILITY_ANALYSIS=y Link: https://lkml.kernel.org/r/20250507160012.3311104-4-glider@google.com Signed-off-by: Alexander Potapenko Acked-by: Marco Elver Cc: Bart Van Assche Cc: Kent Overstreet Cc: Dmitriy Vyukov Cc: Ilya Leoshkevich Signed-off-by: Andrew Morton --- mm/kmsan/hooks.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/kmsan/hooks.c b/mm/kmsan/hooks.c index 05f2faa54054..97de3d6194f0 100644 --- a/mm/kmsan/hooks.c +++ b/mm/kmsan/hooks.c @@ -275,8 +275,10 @@ void kmsan_copy_to_user(void __user *to, const void *from, size_t to_copy, * Don't check anything, just copy the shadow of the copied * bytes. */ + kmsan_enter_runtime(); kmsan_internal_memmove_metadata((void *)to, (void *)from, to_copy - left); + kmsan_leave_runtime(); } user_access_restore(ua_flags); } From b65e4b56e9f406f291656b3e92723952c180fbab Mon Sep 17 00:00:00 2001 From: Alexander Potapenko Date: Wed, 7 May 2025 18:00:12 +0200 Subject: [PATCH 242/285] kmsan: rework kmsan_in_runtime() handling in kmsan_report() kmsan_report() calls used to require entering/leaving the runtime around them. To simplify the things, drop this requirement and move calls to kmsan_enter_runtime()/kmsan_leave_runtime() into kmsan_report(). Link: https://lkml.kernel.org/r/20250507160012.3311104-5-glider@google.com Signed-off-by: Alexander Potapenko Cc: Marco Elver Cc: Bart Van Assche Cc: Kent Overstreet Cc: Dmitriy Vyukov Cc: Ilya Leoshkevich Signed-off-by: Andrew Morton --- mm/kmsan/core.c | 8 -------- mm/kmsan/instrumentation.c | 4 ---- mm/kmsan/report.c | 6 +++--- 3 files changed, 3 insertions(+), 15 deletions(-) diff --git a/mm/kmsan/core.c b/mm/kmsan/core.c index a97dc90fa6a9..1ea711786c52 100644 --- a/mm/kmsan/core.c +++ b/mm/kmsan/core.c @@ -274,11 +274,9 @@ void kmsan_internal_check_memory(void *addr, size_t size, * bytes before, report them. */ if (cur_origin) { - kmsan_enter_runtime(); kmsan_report(cur_origin, addr, size, cur_off_start, pos - 1, user_addr, reason); - kmsan_leave_runtime(); } cur_origin = 0; cur_off_start = -1; @@ -292,11 +290,9 @@ void kmsan_internal_check_memory(void *addr, size_t size, * poisoned bytes before, report them. */ if (cur_origin) { - kmsan_enter_runtime(); kmsan_report(cur_origin, addr, size, cur_off_start, pos + i - 1, user_addr, reason); - kmsan_leave_runtime(); } cur_origin = 0; cur_off_start = -1; @@ -312,11 +308,9 @@ void kmsan_internal_check_memory(void *addr, size_t size, */ if (cur_origin != new_origin) { if (cur_origin) { - kmsan_enter_runtime(); kmsan_report(cur_origin, addr, size, cur_off_start, pos + i - 1, user_addr, reason); - kmsan_leave_runtime(); } cur_origin = new_origin; cur_off_start = pos + i; @@ -326,10 +320,8 @@ void kmsan_internal_check_memory(void *addr, size_t size, } KMSAN_WARN_ON(pos != size); if (cur_origin) { - kmsan_enter_runtime(); kmsan_report(cur_origin, addr, size, cur_off_start, pos - 1, user_addr, reason); - kmsan_leave_runtime(); } } diff --git a/mm/kmsan/instrumentation.c b/mm/kmsan/instrumentation.c index 02a405e55d6c..69f0a57a401c 100644 --- a/mm/kmsan/instrumentation.c +++ b/mm/kmsan/instrumentation.c @@ -312,13 +312,9 @@ EXPORT_SYMBOL(__msan_unpoison_alloca); void __msan_warning(u32 origin); void __msan_warning(u32 origin) { - if (!kmsan_enabled || kmsan_in_runtime()) - return; - kmsan_enter_runtime(); kmsan_report(origin, /*address*/ NULL, /*size*/ 0, /*off_first*/ 0, /*off_last*/ 0, /*user_addr*/ NULL, REASON_ANY); - kmsan_leave_runtime(); } EXPORT_SYMBOL(__msan_warning); diff --git a/mm/kmsan/report.c b/mm/kmsan/report.c index 94a3303fb65e..d6853ce08954 100644 --- a/mm/kmsan/report.c +++ b/mm/kmsan/report.c @@ -157,14 +157,14 @@ void kmsan_report(depot_stack_handle_t origin, void *address, int size, unsigned long ua_flags; bool is_uaf; - if (!kmsan_enabled) + if (!kmsan_enabled || kmsan_in_runtime()) return; if (current->kmsan_ctx.depth) return; if (!origin) return; - kmsan_disable_current(); + kmsan_enter_runtime(); ua_flags = user_access_save(); raw_spin_lock(&kmsan_report_lock); pr_err("=====================================================\n"); @@ -217,5 +217,5 @@ void kmsan_report(depot_stack_handle_t origin, void *address, int size, if (panic_on_kmsan) panic("kmsan.panic set ...\n"); user_access_restore(ua_flags); - kmsan_enable_current(); + kmsan_leave_runtime(); } From 5c5f0468d172ddec2e333d738d2a1f85402cf0bc Mon Sep 17 00:00:00 2001 From: Jeongjun Park Date: Fri, 9 May 2025 01:56:20 +0900 Subject: [PATCH 243/285] mm/vmalloc: fix data race in show_numa_info() The following data-race was found in show_numa_info(): ================================================================== BUG: KCSAN: data-race in vmalloc_info_show / vmalloc_info_show read to 0xffff88800971fe30 of 4 bytes by task 8289 on cpu 0: show_numa_info mm/vmalloc.c:4936 [inline] vmalloc_info_show+0x5a8/0x7e0 mm/vmalloc.c:5016 seq_read_iter+0x373/0xb40 fs/seq_file.c:230 proc_reg_read_iter+0x11e/0x170 fs/proc/inode.c:299 .... write to 0xffff88800971fe30 of 4 bytes by task 8287 on cpu 1: show_numa_info mm/vmalloc.c:4934 [inline] vmalloc_info_show+0x38f/0x7e0 mm/vmalloc.c:5016 seq_read_iter+0x373/0xb40 fs/seq_file.c:230 proc_reg_read_iter+0x11e/0x170 fs/proc/inode.c:299 .... value changed: 0x0000008f -> 0x00000000 ================================================================== According to this report,there is a read/write data-race because m->private is accessible to multiple CPUs. To fix this, instead of allocating the heap in proc_vmalloc_init() and passing the heap address to m->private, vmalloc_info_show() should allocate the heap. Link: https://lkml.kernel.org/r/20250508165620.15321-1-aha310510@gmail.com Fixes: 8e1d743f2c26 ("mm: vmalloc: support multiple nodes in vmallocinfo") Signed-off-by: Jeongjun Park Suggested-by: Eric Dumazet Suggested-by: Andrew Morton Reviewed-by: "Uladzislau Rezki (Sony)" Signed-off-by: Andrew Morton --- mm/vmalloc.c | 63 +++++++++++++++++++++++++++++----------------------- 1 file changed, 35 insertions(+), 28 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 701ea4ec8950..49df04e1fbe1 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -3109,7 +3109,7 @@ static void clear_vm_uninitialized_flag(struct vm_struct *vm) /* * Before removing VM_UNINITIALIZED, * we should make sure that vm has proper values. - * Pair with smp_rmb() in show_numa_info(). + * Pair with smp_rmb() in vread_iter() and vmalloc_info_show(). */ smp_wmb(); vm->flags &= ~VM_UNINITIALIZED; @@ -4939,28 +4939,29 @@ bool vmalloc_dump_obj(void *object) #endif #ifdef CONFIG_PROC_FS -static void show_numa_info(struct seq_file *m, struct vm_struct *v) + +/* + * Print number of pages allocated on each memory node. + * + * This function can only be called if CONFIG_NUMA is enabled + * and VM_UNINITIALIZED bit in v->flags is disabled. + */ +static void show_numa_info(struct seq_file *m, struct vm_struct *v, + unsigned int *counters) { - if (IS_ENABLED(CONFIG_NUMA)) { - unsigned int nr, *counters = m->private; - unsigned int step = 1U << vm_area_page_order(v); + unsigned int nr; + unsigned int step = 1U << vm_area_page_order(v); - if (!counters) - return; + if (!counters) + return; - if (v->flags & VM_UNINITIALIZED) - return; - /* Pair with smp_wmb() in clear_vm_uninitialized_flag() */ - smp_rmb(); + memset(counters, 0, nr_node_ids * sizeof(unsigned int)); - memset(counters, 0, nr_node_ids * sizeof(unsigned int)); - - for (nr = 0; nr < v->nr_pages; nr += step) - counters[page_to_nid(v->pages[nr])] += step; - for_each_node_state(nr, N_HIGH_MEMORY) - if (counters[nr]) - seq_printf(m, " N%u=%u", nr, counters[nr]); - } + for (nr = 0; nr < v->nr_pages; nr += step) + counters[page_to_nid(v->pages[nr])] += step; + for_each_node_state(nr, N_HIGH_MEMORY) + if (counters[nr]) + seq_printf(m, " N%u=%u", nr, counters[nr]); } static void show_purge_info(struct seq_file *m) @@ -4984,6 +4985,10 @@ static int vmalloc_info_show(struct seq_file *m, void *p) struct vmap_node *vn; struct vmap_area *va; struct vm_struct *v; + unsigned int *counters; + + if (IS_ENABLED(CONFIG_NUMA)) + counters = kmalloc(nr_node_ids * sizeof(unsigned int), GFP_KERNEL); for_each_vmap_node(vn) { spin_lock(&vn->busy.lock); @@ -4998,6 +5003,11 @@ static int vmalloc_info_show(struct seq_file *m, void *p) } v = va->vm; + if (v->flags & VM_UNINITIALIZED) + continue; + + /* Pair with smp_wmb() in clear_vm_uninitialized_flag() */ + smp_rmb(); seq_printf(m, "0x%pK-0x%pK %7ld", v->addr, v->addr + v->size, v->size); @@ -5032,7 +5042,9 @@ static int vmalloc_info_show(struct seq_file *m, void *p) if (is_vmalloc_addr(v->pages)) seq_puts(m, " vpages"); - show_numa_info(m, v); + if (IS_ENABLED(CONFIG_NUMA)) + show_numa_info(m, v, counters); + seq_putc(m, '\n'); } spin_unlock(&vn->busy.lock); @@ -5042,19 +5054,14 @@ static int vmalloc_info_show(struct seq_file *m, void *p) * As a final step, dump "unpurged" areas. */ show_purge_info(m); + if (IS_ENABLED(CONFIG_NUMA)) + kfree(counters); return 0; } static int __init proc_vmalloc_init(void) { - void *priv_data = NULL; - - if (IS_ENABLED(CONFIG_NUMA)) - priv_data = kmalloc(nr_node_ids * sizeof(unsigned int), GFP_KERNEL); - - proc_create_single_data("vmallocinfo", - 0400, NULL, vmalloc_info_show, priv_data); - + proc_create_single("vmallocinfo", 0400, NULL, vmalloc_info_show); return 0; } module_init(proc_vmalloc_init); From 3f12680913fda8de06c21e836dd5f246fe1684e5 Mon Sep 17 00:00:00 2001 From: Yuquan Wang Date: Thu, 8 May 2025 10:27:19 +0800 Subject: [PATCH 244/285] mm: numa_memblks: introduce numa_add_reserved_memblk acpi_parse_cfmws() currently adds empty CFMWS ranges to numa_meminfo with the expectation that numa_cleanup_meminfo moves them to numa_reserved_meminfo. There is no need for that indirection when it is known in advance that these unpopulated ranges are meant for numa_reserved_meminfo in support of future hotplug / CXL provisioning. Introduce and use numa_add_reserved_memblk() to add the empty CFMWS ranges directly. Link: https://lkml.kernel.org/r/20250508022719.3941335-1-wangyuquan1236@phytium.com.cn Signed-off-by: Yuquan Wang Reviewed-by: Alison Schofield Cc: Bruno Faccini Cc: Chen Baozi Cc: Dan Williams Cc: David Hildenbrand Cc: Haibo Xu Cc: Huacai Chen Cc: Joanthan Cameron Cc: Len Brown Cc: Mike Rapoport Cc: Robert Richter Signed-off-by: Andrew Morton --- drivers/acpi/numa/srat.c | 2 +- include/linux/numa_memblks.h | 1 + mm/numa_memblks.c | 22 ++++++++++++++++++++++ 3 files changed, 24 insertions(+), 1 deletion(-) diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c index 5d0cbc5c88a0..53816dfab645 100644 --- a/drivers/acpi/numa/srat.c +++ b/drivers/acpi/numa/srat.c @@ -464,7 +464,7 @@ static int __init acpi_parse_cfmws(union acpi_subtable_headers *header, return -EINVAL; } - if (numa_add_memblk(node, start, end) < 0) { + if (numa_add_reserved_memblk(node, start, end) < 0) { /* CXL driver must handle the NUMA_NO_NODE case */ pr_warn("ACPI NUMA: Failed to add memblk for CFMWS node %d [mem %#llx-%#llx]\n", node, start, end); diff --git a/include/linux/numa_memblks.h b/include/linux/numa_memblks.h index dd85613cdd86..991076cba7c5 100644 --- a/include/linux/numa_memblks.h +++ b/include/linux/numa_memblks.h @@ -22,6 +22,7 @@ struct numa_meminfo { }; int __init numa_add_memblk(int nodeid, u64 start, u64 end); +int __init numa_add_reserved_memblk(int nid, u64 start, u64 end); void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi); int __init numa_cleanup_meminfo(struct numa_meminfo *mi); diff --git a/mm/numa_memblks.c b/mm/numa_memblks.c index ff4054f4334d..541a99c4071a 100644 --- a/mm/numa_memblks.c +++ b/mm/numa_memblks.c @@ -200,6 +200,28 @@ int __init numa_add_memblk(int nid, u64 start, u64 end) return numa_add_memblk_to(nid, start, end, &numa_meminfo); } +/** + * numa_add_reserved_memblk - Add one numa_memblk to numa_reserved_meminfo + * @nid: NUMA node ID of the new memblk + * @start: Start address of the new memblk + * @end: End address of the new memblk + * + * Add a new memblk to the numa_reserved_meminfo. + * + * Usage Case: numa_cleanup_meminfo() reconciles all numa_memblk instances + * against memblock_type information and moves any that intersect reserved + * ranges to numa_reserved_meminfo. However, when that information is known + * ahead of time, we use numa_add_reserved_memblk() to add the numa_memblk + * to numa_reserved_meminfo directly. + * + * RETURNS: + * 0 on success, -errno on failure. + */ +int __init numa_add_reserved_memblk(int nid, u64 start, u64 end) +{ + return numa_add_memblk_to(nid, start, end, &numa_reserved_meminfo); +} + /** * numa_cleanup_meminfo - Cleanup a numa_meminfo * @mi: numa_meminfo to clean up From 2616b370323a953c437ed2bf40a277e9deaa3709 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 9 May 2025 17:30:32 +0200 Subject: [PATCH 245/285] selftests/mm: add simple VM_PFNMAP tests based on mmap'ing /dev/mem Let's test some basic functionality using /dev/mem. These tests will implicitly cover some PAT (Page Attribute Handling) handling on x86. These tests will only run when /dev/mem access to the first two pages in physical address space is possible and allowed; otherwise, the tests are skipped. On current x86-64 with PAT inside a VM, all tests pass: TAP version 13 1..6 # Starting 6 tests from 1 test cases. # RUN pfnmap.madvise_disallowed ... # OK pfnmap.madvise_disallowed ok 1 pfnmap.madvise_disallowed # RUN pfnmap.munmap_split ... # OK pfnmap.munmap_split ok 2 pfnmap.munmap_split # RUN pfnmap.mremap_fixed ... # OK pfnmap.mremap_fixed ok 3 pfnmap.mremap_fixed # RUN pfnmap.mremap_shrink ... # OK pfnmap.mremap_shrink ok 4 pfnmap.mremap_shrink # RUN pfnmap.mremap_expand ... # OK pfnmap.mremap_expand ok 5 pfnmap.mremap_expand # RUN pfnmap.fork ... # OK pfnmap.fork ok 6 pfnmap.fork # PASSED: 6 / 6 tests passed. # Totals: pass:6 fail:0 xfail:0 xpass:0 skip:0 error:0 However, we are able to trigger: [ 27.888251] x86/PAT: pfnmap:1790 freeing invalid memtype [mem 0x00000000-0x00000fff] There are probably more things worth testing in the future, such as MAP_PRIVATE handling. But this set of tests is sufficient to cover most of the things we will rework regarding PAT handling. Link: https://lkml.kernel.org/r/20250509153033.952746-1-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Lorenzo Stoakes Cc: Shuah Khan Cc: Ingo Molnar Cc: Peter Xu Cc: Dev Jain Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/pfnmap.c | 196 ++++++++++++++++++++++ tools/testing/selftests/mm/run_vmtests.sh | 4 + 4 files changed, 202 insertions(+) create mode 100644 tools/testing/selftests/mm/pfnmap.c diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore index 91db34941a14..824266982aa3 100644 --- a/tools/testing/selftests/mm/.gitignore +++ b/tools/testing/selftests/mm/.gitignore @@ -20,6 +20,7 @@ mremap_test on-fault-limit transhuge-stress pagemap_ioctl +pfnmap *.tmp* protection_keys protection_keys_32 diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index ad4d6043a60f..ae6f994d3add 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -84,6 +84,7 @@ TEST_GEN_FILES += mremap_test TEST_GEN_FILES += mseal_test TEST_GEN_FILES += on-fault-limit TEST_GEN_FILES += pagemap_ioctl +TEST_GEN_FILES += pfnmap TEST_GEN_FILES += thuge-gen TEST_GEN_FILES += transhuge-stress TEST_GEN_FILES += uffd-stress diff --git a/tools/testing/selftests/mm/pfnmap.c b/tools/testing/selftests/mm/pfnmap.c new file mode 100644 index 000000000000..8a9d19b6020c --- /dev/null +++ b/tools/testing/selftests/mm/pfnmap.c @@ -0,0 +1,196 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Basic VM_PFNMAP tests relying on mmap() of '/dev/mem' + * + * Copyright 2025, Red Hat, Inc. + * + * Author(s): David Hildenbrand + */ +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "../kselftest_harness.h" +#include "vm_util.h" + +static sigjmp_buf sigjmp_buf_env; + +static void signal_handler(int sig) +{ + siglongjmp(sigjmp_buf_env, -EFAULT); +} + +static int test_read_access(char *addr, size_t size, size_t pagesize) +{ + size_t offs; + int ret; + + if (signal(SIGSEGV, signal_handler) == SIG_ERR) + return -EINVAL; + + ret = sigsetjmp(sigjmp_buf_env, 1); + if (!ret) { + for (offs = 0; offs < size; offs += pagesize) + /* Force a read that the compiler cannot optimize out. */ + *((volatile char *)(addr + offs)); + } + if (signal(SIGSEGV, signal_handler) == SIG_ERR) + return -EINVAL; + + return ret; +} + +FIXTURE(pfnmap) +{ + size_t pagesize; + int dev_mem_fd; + char *addr1; + size_t size1; + char *addr2; + size_t size2; +}; + +FIXTURE_SETUP(pfnmap) +{ + self->pagesize = getpagesize(); + + self->dev_mem_fd = open("/dev/mem", O_RDONLY); + if (self->dev_mem_fd < 0) + SKIP(return, "Cannot open '/dev/mem'\n"); + + /* We'll require the first two pages throughout our tests ... */ + self->size1 = self->pagesize * 2; + self->addr1 = mmap(NULL, self->size1, PROT_READ, MAP_SHARED, + self->dev_mem_fd, 0); + if (self->addr1 == MAP_FAILED) + SKIP(return, "Cannot mmap '/dev/mem'\n"); + + /* ... and want to be able to read from them. */ + if (test_read_access(self->addr1, self->size1, self->pagesize)) + SKIP(return, "Cannot read-access mmap'ed '/dev/mem'\n"); + + self->size2 = 0; + self->addr2 = MAP_FAILED; +} + +FIXTURE_TEARDOWN(pfnmap) +{ + if (self->addr2 != MAP_FAILED) + munmap(self->addr2, self->size2); + if (self->addr1 != MAP_FAILED) + munmap(self->addr1, self->size1); + if (self->dev_mem_fd >= 0) + close(self->dev_mem_fd); +} + +TEST_F(pfnmap, madvise_disallowed) +{ + int advices[] = { + MADV_DONTNEED, + MADV_DONTNEED_LOCKED, + MADV_FREE, + MADV_WIPEONFORK, + MADV_COLD, + MADV_PAGEOUT, + MADV_POPULATE_READ, + MADV_POPULATE_WRITE, + }; + int i; + + /* All these advices must be rejected. */ + for (i = 0; i < ARRAY_SIZE(advices); i++) { + EXPECT_LT(madvise(self->addr1, self->pagesize, advices[i]), 0); + EXPECT_EQ(errno, EINVAL); + } +} + +TEST_F(pfnmap, munmap_split) +{ + /* + * Unmap the first page. This munmap() call is not really expected to + * fail, but we might be able to trigger other internal issues. + */ + ASSERT_EQ(munmap(self->addr1, self->pagesize), 0); + + /* + * Remap the first page while the second page is still mapped. This + * makes sure that any PAT tracking on x86 will allow for mmap()'ing + * a page again while some parts of the first mmap() are still + * around. + */ + self->size2 = self->pagesize; + self->addr2 = mmap(NULL, self->pagesize, PROT_READ, MAP_SHARED, + self->dev_mem_fd, 0); + ASSERT_NE(self->addr2, MAP_FAILED); +} + +TEST_F(pfnmap, mremap_fixed) +{ + char *ret; + + /* Reserve a destination area. */ + self->size2 = self->size1; + self->addr2 = mmap(NULL, self->size2, PROT_READ, MAP_ANON | MAP_PRIVATE, + -1, 0); + ASSERT_NE(self->addr2, MAP_FAILED); + + /* mremap() over our destination. */ + ret = mremap(self->addr1, self->size1, self->size2, + MREMAP_FIXED | MREMAP_MAYMOVE, self->addr2); + ASSERT_NE(ret, MAP_FAILED); +} + +TEST_F(pfnmap, mremap_shrink) +{ + char *ret; + + /* Shrinking is expected to work. */ + ret = mremap(self->addr1, self->size1, self->size1 - self->pagesize, 0); + ASSERT_NE(ret, MAP_FAILED); +} + +TEST_F(pfnmap, mremap_expand) +{ + /* + * Growing is not expected to work, and getting it right would + * be challenging. So this test primarily serves as an early warning + * that something that probably should never work suddenly works. + */ + self->size2 = self->size1 + self->pagesize; + self->addr2 = mremap(self->addr1, self->size1, self->size2, MREMAP_MAYMOVE); + ASSERT_EQ(self->addr2, MAP_FAILED); +} + +TEST_F(pfnmap, fork) +{ + pid_t pid; + int ret; + + /* fork() a child and test if the child can access the pages. */ + pid = fork(); + ASSERT_GE(pid, 0); + + if (!pid) { + EXPECT_EQ(test_read_access(self->addr1, self->size1, + self->pagesize), 0); + exit(0); + } + + wait(&ret); + if (WIFEXITED(ret)) + ret = WEXITSTATUS(ret); + else + ret = -EINVAL; + ASSERT_EQ(ret, 0); +} + +TEST_HARNESS_MAIN diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh index 188b125bf1f6..dddd1dd8af14 100755 --- a/tools/testing/selftests/mm/run_vmtests.sh +++ b/tools/testing/selftests/mm/run_vmtests.sh @@ -63,6 +63,8 @@ separated by spaces: test soft dirty page bit semantics - pagemap test pagemap_scan IOCTL +- pfnmap + tests for VM_PFNMAP handling - cow test copy-on-write semantics - thp @@ -472,6 +474,8 @@ fi CATEGORY="pagemap" run_test ./pagemap_ioctl +CATEGORY="pfnmap" run_test ./pfnmap + # COW tests CATEGORY="cow" run_test ./cow From 83b6d498d027002e79c2ce40b5729137500c3170 Mon Sep 17 00:00:00 2001 From: Zhongkun He Date: Fri, 9 May 2025 16:35:28 +0800 Subject: [PATCH 246/285] mm: cma: set early_pfn and bitmap as a union in cma_memrange Since early_pfn and bitmap are never used at the same time, they can be defined as a union to reduce the size of the data structure. This change can save 8 * u64 entries per CMA. Link: https://lkml.kernel.org/r/20250509083528.1360952-1-hezhongkun.hzk@bytedance.com Signed-off-by: Zhongkun He Signed-off-by: Andrew Morton --- mm/cma.c | 11 ++++++----- mm/cma.h | 6 ++++-- 2 files changed, 10 insertions(+), 7 deletions(-) diff --git a/mm/cma.c b/mm/cma.c index 15632939f20a..ec4b9a401b7d 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -143,13 +143,14 @@ bool cma_validate_zones(struct cma *cma) static void __init cma_activate_area(struct cma *cma) { - unsigned long pfn, end_pfn; + unsigned long pfn, end_pfn, early_pfn[CMA_MAX_RANGES]; int allocrange, r; struct cma_memrange *cmr; unsigned long bitmap_count, count; for (allocrange = 0; allocrange < cma->nranges; allocrange++) { cmr = &cma->ranges[allocrange]; + early_pfn[allocrange] = cmr->early_pfn; cmr->bitmap = bitmap_zalloc(cma_bitmap_maxno(cma, cmr), GFP_KERNEL); if (!cmr->bitmap) @@ -161,13 +162,13 @@ static void __init cma_activate_area(struct cma *cma) for (r = 0; r < cma->nranges; r++) { cmr = &cma->ranges[r]; - if (cmr->early_pfn != cmr->base_pfn) { - count = cmr->early_pfn - cmr->base_pfn; + if (early_pfn[r] != cmr->base_pfn) { + count = early_pfn[r] - cmr->base_pfn; bitmap_count = cma_bitmap_pages_to_bits(cma, count); bitmap_set(cmr->bitmap, 0, bitmap_count); } - for (pfn = cmr->early_pfn; pfn < cmr->base_pfn + cmr->count; + for (pfn = early_pfn[r]; pfn < cmr->base_pfn + cmr->count; pfn += pageblock_nr_pages) init_cma_reserved_pageblock(pfn_to_page(pfn)); } @@ -193,7 +194,7 @@ cleanup: for (r = 0; r < allocrange; r++) { cmr = &cma->ranges[r]; end_pfn = cmr->base_pfn + cmr->count; - for (pfn = cmr->early_pfn; pfn < end_pfn; pfn++) + for (pfn = early_pfn[r]; pfn < end_pfn; pfn++) free_reserved_page(pfn_to_page(pfn)); } } diff --git a/mm/cma.h b/mm/cma.h index 41a3ab0ec3de..c70180c36559 100644 --- a/mm/cma.h +++ b/mm/cma.h @@ -25,9 +25,11 @@ struct cma_kobject { */ struct cma_memrange { unsigned long base_pfn; - unsigned long early_pfn; unsigned long count; - unsigned long *bitmap; + union { + unsigned long early_pfn; + unsigned long *bitmap; + }; #ifdef CONFIG_CMA_DEBUGFS struct debugfs_u32_array dfs_bitmap; #endif From 4df65651f7075481e44c03bb39855a38783d87ac Mon Sep 17 00:00:00 2001 From: Baolin Wang Date: Fri, 9 May 2025 08:45:21 +0800 Subject: [PATCH 247/285] mm: mincore: use pte_batch_hint() to batch process large folios When I tested the mincore() syscall, I observed that it takes longer with 64K mTHP enabled on my Arm64 server. The reason is the mincore_pte_range() still checks each PTE individually, even when the PTEs are contiguous, which is not efficient. Thus we can use pte_batch_hint() to get the batch number of the present contiguous PTEs, which can improve the performance. I tested the mincore() syscall with 1G anonymous memory populated with 64K mTHP, and observed an obvious performance improvement: w/o patch w/ patch changes 6022us 549us +91% Moreover, I also tested mincore() with disabling mTHP/THP, and did not see any obvious regression for base pages. Link: https://lkml.kernel.org/r/99cb00ee626ceb6e788102ca36821815cd832237.1746697240.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang Reviewed-by: Barry Song Reviewed-by: Dev Jain Acked-by: David Hildenbrand Cc: Dev Jain Cc: Ryan Roberts Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/mincore.c | 22 +++++++++++++++++----- 1 file changed, 17 insertions(+), 5 deletions(-) diff --git a/mm/mincore.c b/mm/mincore.c index 832f29f46767..42d6c9c8da86 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -21,6 +21,7 @@ #include #include "swap.h" +#include "internal.h" static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr, unsigned long end, struct mm_walk *walk) @@ -105,6 +106,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, pte_t *ptep; unsigned char *vec = walk->private; int nr = (end - addr) >> PAGE_SHIFT; + int step, i; ptl = pmd_trans_huge_lock(pmd, vma); if (ptl) { @@ -118,16 +120,26 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, walk->action = ACTION_AGAIN; return 0; } - for (; addr != end; ptep++, addr += PAGE_SIZE) { + for (; addr != end; ptep += step, addr += step * PAGE_SIZE) { pte_t pte = ptep_get(ptep); + step = 1; /* We need to do cache lookup too for pte markers */ if (pte_none_mostly(pte)) __mincore_unmapped_range(addr, addr + PAGE_SIZE, vma, vec); - else if (pte_present(pte)) - *vec = 1; - else { /* pte is a swap entry */ + else if (pte_present(pte)) { + unsigned int batch = pte_batch_hint(ptep, pte); + + if (batch > 1) { + unsigned int max_nr = (end - addr) >> PAGE_SHIFT; + + step = min_t(unsigned int, batch, max_nr); + } + + for (i = 0; i < step; i++) + vec[i] = 1; + } else { /* pte is a swap entry */ swp_entry_t entry = pte_to_swp_entry(pte); if (non_swap_entry(entry)) { @@ -146,7 +158,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, #endif } } - vec++; + vec += step; } pte_unmap_unlock(ptep - 1, ptl); out: From ed1a7814036c60ced6198548558342ef75af3ad8 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Mon, 12 May 2025 14:34:14 +0200 Subject: [PATCH 248/285] x86/mm/pat: factor out setting cachemode into pgprot_set_cachemode() VM_PAT annoyed me too much and wasted too much of my time, let's clean PAT handling up and remove VM_PAT. This should sort out various issues with VM_PAT we discovered recently, and will hopefully make the whole code more stable and easier to maintain. In essence: we stop letting PAT mode mess with VMAs and instead lift what to track/untrack to the MM core. We remember per VMA which pfn range we tracked in a new struct we attach to a VMA (we have space without exceeding 192 bytes), use a kref to share it among VMAs during split/mremap/fork, and automatically untrack once the kref drops to 0. This implies that we'll keep tracking a full pfn range even after partially unmapping it, until fully unmapping it; but as that case was mostly broken before, this at least makes it work in a way that is least intrusive to VMA handling. Shrinking with mremap() used to work in a hacky way, now we'll similarly keep the original pfn range tacked even after this form of partial unmap. Does anybody care about that? Unlikely. If we run into issues, we could likely handled that (adjust the tracking) when our kref drops to 1 while freeing a VMA. But it adds more complexity, so avoid that for now. Briefly tested with the new pfnmap selftests [1]. This patch (of 11): Let's factor it out to make the code easier to grasp. Drop one comment where it is now rather obvious what is happening. Use it also in pgprot_writecombine()/pgprot_writethrough() where clearing the old cachemode might not be required, but given that we are already doing a function call, no need to care about this micro-optimization. Link: https://lkml.kernel.org/r/20250512123424.637989-1-david@redhat.com Link: https://lkml.kernel.org/r/20250512123424.637989-2-david@redhat.com Link: https://lkml.kernel.org/r/20250509153033.952746-1-david@redhat.com [1] Signed-off-by: David Hildenbrand Reviewed-by: Lorenzo Stoakes Acked-by: Ingo Molnar [x86 bits] Reviewed-by: Liam R. Howlett Cc: Andy Lutomirski Cc: Borislav Betkov Cc: Dave Airlie Cc: David Hildenbrand Cc: "H. Peter Anvin" Cc: Jani Nikula Cc: Jann Horn Cc: Jonas Lahtinen Cc: "Masami Hiramatsu (Google)" Cc: Mathieu Desnoyers Cc: Peter Xu Cc: Peter Zijlstra Cc: Rodrigo Vivi Cc: Steven Rostedt Cc: Thomas Gleinxer Cc: Tvrtko Ursulin Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- arch/x86/mm/pat/memtype.c | 33 +++++++++++++++------------------ 1 file changed, 15 insertions(+), 18 deletions(-) diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c index 72d8cbc61158..edec5859651d 100644 --- a/arch/x86/mm/pat/memtype.c +++ b/arch/x86/mm/pat/memtype.c @@ -800,6 +800,12 @@ static inline int range_is_allowed(unsigned long pfn, unsigned long size) } #endif /* CONFIG_STRICT_DEVMEM */ +static inline void pgprot_set_cachemode(pgprot_t *prot, enum page_cache_mode pcm) +{ + *prot = __pgprot((pgprot_val(*prot) & ~_PAGE_CACHE_MASK) | + cachemode2protval(pcm)); +} + int phys_mem_access_prot_allowed(struct file *file, unsigned long pfn, unsigned long size, pgprot_t *vma_prot) { @@ -811,8 +817,7 @@ int phys_mem_access_prot_allowed(struct file *file, unsigned long pfn, if (file->f_flags & O_DSYNC) pcm = _PAGE_CACHE_MODE_UC_MINUS; - *vma_prot = __pgprot((pgprot_val(*vma_prot) & ~_PAGE_CACHE_MASK) | - cachemode2protval(pcm)); + pgprot_set_cachemode(vma_prot, pcm); return 1; } @@ -880,9 +885,7 @@ static int reserve_pfn_range(u64 paddr, unsigned long size, pgprot_t *vma_prot, (unsigned long long)paddr, (unsigned long long)(paddr + size - 1), cattr_name(pcm)); - *vma_prot = __pgprot((pgprot_val(*vma_prot) & - (~_PAGE_CACHE_MASK)) | - cachemode2protval(pcm)); + pgprot_set_cachemode(vma_prot, pcm); } return 0; } @@ -907,9 +910,7 @@ static int reserve_pfn_range(u64 paddr, unsigned long size, pgprot_t *vma_prot, * We allow returning different type than the one requested in * non strict case. */ - *vma_prot = __pgprot((pgprot_val(*vma_prot) & - (~_PAGE_CACHE_MASK)) | - cachemode2protval(pcm)); + pgprot_set_cachemode(vma_prot, pcm); } if (memtype_kernel_map_sync(paddr, size, pcm) < 0) { @@ -1060,9 +1061,7 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot, return -EINVAL; } - *prot = __pgprot((pgprot_val(*prot) & (~_PAGE_CACHE_MASK)) | - cachemode2protval(pcm)); - + pgprot_set_cachemode(prot, pcm); return 0; } @@ -1073,10 +1072,8 @@ void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot, pfn_t pfn) if (!pat_enabled()) return; - /* Set prot based on lookup */ pcm = lookup_memtype(pfn_t_to_phys(pfn)); - *prot = __pgprot((pgprot_val(*prot) & (~_PAGE_CACHE_MASK)) | - cachemode2protval(pcm)); + pgprot_set_cachemode(prot, pcm); } /* @@ -1115,15 +1112,15 @@ void untrack_pfn_clear(struct vm_area_struct *vma) pgprot_t pgprot_writecombine(pgprot_t prot) { - return __pgprot(pgprot_val(prot) | - cachemode2protval(_PAGE_CACHE_MODE_WC)); + pgprot_set_cachemode(&prot, _PAGE_CACHE_MODE_WC); + return prot; } EXPORT_SYMBOL_GPL(pgprot_writecombine); pgprot_t pgprot_writethrough(pgprot_t prot) { - return __pgprot(pgprot_val(prot) | - cachemode2protval(_PAGE_CACHE_MODE_WT)); + pgprot_set_cachemode(&prot, _PAGE_CACHE_MODE_WT); + return prot; } EXPORT_SYMBOL_GPL(pgprot_writethrough); From e1e1a3ae7f9f0cb06e80af0f24927be63149d081 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Mon, 12 May 2025 14:34:15 +0200 Subject: [PATCH 249/285] mm: convert track_pfn_insert() to pfnmap_setup_cachemode*() ... by factoring it out from track_pfn_remap() into pfnmap_setup_cachemode() and provide pfnmap_setup_cachemode_pfn() as a replacement for track_pfn_insert(). For PMDs/PUDs, we keep checking a single pfn only. Add some documentation, and also document why it is valid to not check the whole pfn range. We'll reuse pfnmap_setup_cachemode() from core MM next. Link: https://lkml.kernel.org/r/20250512123424.637989-3-david@redhat.com Signed-off-by: David Hildenbrand Acked-by: Ingo Molnar [x86 bits] Reviewed-by: Lorenzo Stoakes Reviewed-by: Liam R. Howlett Cc: Andy Lutomirski Cc: Borislav Betkov Cc: Dave Airlie Cc: "H. Peter Anvin" Cc: Jani Nikula Cc: Jann Horn Cc: Jonas Lahtinen Cc: "Masami Hiramatsu (Google)" Cc: Mathieu Desnoyers Cc: Peter Xu Cc: Peter Zijlstra Cc: Rodrigo Vivi Cc: Steven Rostedt Cc: Thomas Gleinxer Cc: Tvrtko Ursulin Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- arch/x86/mm/pat/memtype.c | 24 ++++++------------ include/linux/pgtable.h | 52 +++++++++++++++++++++++++++++++++------ mm/huge_memory.c | 5 ++-- mm/memory.c | 4 +-- 4 files changed, 57 insertions(+), 28 deletions(-) diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c index edec5859651d..fa78facc6f63 100644 --- a/arch/x86/mm/pat/memtype.c +++ b/arch/x86/mm/pat/memtype.c @@ -1031,7 +1031,6 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot, unsigned long pfn, unsigned long addr, unsigned long size) { resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT; - enum page_cache_mode pcm; /* reserve the whole chunk starting from paddr */ if (!vma || (addr == vma->vm_start @@ -1044,13 +1043,17 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot, return ret; } + return pfnmap_setup_cachemode(pfn, size, prot); +} + +int pfnmap_setup_cachemode(unsigned long pfn, unsigned long size, pgprot_t *prot) +{ + resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT; + enum page_cache_mode pcm; + if (!pat_enabled()) return 0; - /* - * For anything smaller than the vma size we set prot based on the - * lookup. - */ pcm = lookup_memtype(paddr); /* Check memtype for the remaining pages */ @@ -1065,17 +1068,6 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot, return 0; } -void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot, pfn_t pfn) -{ - enum page_cache_mode pcm; - - if (!pat_enabled()) - return; - - pcm = lookup_memtype(pfn_t_to_phys(pfn)); - pgprot_set_cachemode(prot, pcm); -} - /* * untrack_pfn is called while unmapping a pfnmap for a region. * untrack can be called for a specific region indicated by pfn and size or diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index f1e890b60460..be1745839871 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1496,13 +1496,10 @@ static inline int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot, return 0; } -/* - * track_pfn_insert is called when a _new_ single pfn is established - * by vmf_insert_pfn(). - */ -static inline void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot, - pfn_t pfn) +static inline int pfnmap_setup_cachemode(unsigned long pfn, unsigned long size, + pgprot_t *prot) { + return 0; } /* @@ -1552,8 +1549,32 @@ static inline void untrack_pfn_clear(struct vm_area_struct *vma) extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot, unsigned long pfn, unsigned long addr, unsigned long size); -extern void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot, - pfn_t pfn); + +/** + * pfnmap_setup_cachemode - setup the cachemode in the pgprot for a pfn range + * @pfn: the start of the pfn range + * @size: the size of the pfn range in bytes + * @prot: the pgprot to modify + * + * Lookup the cachemode for the pfn range starting at @pfn with the size + * @size and store it in @prot, leaving other data in @prot unchanged. + * + * This allows for a hardware implementation to have fine-grained control of + * memory cache behavior at page level granularity. Without a hardware + * implementation, this function does nothing. + * + * Currently there is only one implementation for this - x86 Page Attribute + * Table (PAT). See Documentation/arch/x86/pat.rst for more details. + * + * This function can fail if the pfn range spans pfns that require differing + * cachemodes. If the pfn range was previously verified to have a single + * cachemode, it is sufficient to query only a single pfn. The assumption is + * that this is the case for drivers using the vmf_insert_pfn*() interface. + * + * Returns 0 on success and -EINVAL on error. + */ +int pfnmap_setup_cachemode(unsigned long pfn, unsigned long size, + pgprot_t *prot); extern int track_pfn_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, unsigned long *pfn); extern void untrack_pfn_copy(struct vm_area_struct *dst_vma, @@ -1563,6 +1584,21 @@ extern void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn, extern void untrack_pfn_clear(struct vm_area_struct *vma); #endif +/** + * pfnmap_setup_cachemode_pfn - setup the cachemode in the pgprot for a pfn + * @pfn: the pfn + * @prot: the pgprot to modify + * + * Lookup the cachemode for @pfn and store it in @prot, leaving other + * data in @prot unchanged. + * + * See pfnmap_setup_cachemode() for details. + */ +static inline void pfnmap_setup_cachemode_pfn(unsigned long pfn, pgprot_t *prot) +{ + pfnmap_setup_cachemode(pfn, PAGE_SIZE, prot); +} + #ifdef CONFIG_MMU #ifdef __HAVE_COLOR_ZERO_PAGE static inline int is_zero_pfn(unsigned long pfn) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 2780a12b25f0..d3e66136e41a 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1455,7 +1455,8 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write) return VM_FAULT_OOM; } - track_pfn_insert(vma, &pgprot, pfn); + pfnmap_setup_cachemode_pfn(pfn_t_to_pfn(pfn), &pgprot); + ptl = pmd_lock(vma->vm_mm, vmf->pmd); error = insert_pfn_pmd(vma, addr, vmf->pmd, pfn, pgprot, write, pgtable); @@ -1577,7 +1578,7 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write) if (addr < vma->vm_start || addr >= vma->vm_end) return VM_FAULT_SIGBUS; - track_pfn_insert(vma, &pgprot, pfn); + pfnmap_setup_cachemode_pfn(pfn_t_to_pfn(pfn), &pgprot); ptl = pud_lock(vma->vm_mm, vmf->pud); insert_pfn_pud(vma, addr, vmf->pud, pfn, write); diff --git a/mm/memory.c b/mm/memory.c index 99af83434e7c..064fc55d8eab 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2564,7 +2564,7 @@ vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr, if (!pfn_modify_allowed(pfn, pgprot)) return VM_FAULT_SIGBUS; - track_pfn_insert(vma, &pgprot, __pfn_to_pfn_t(pfn, PFN_DEV)); + pfnmap_setup_cachemode_pfn(pfn, &pgprot); return insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot, false); @@ -2627,7 +2627,7 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma, if (addr < vma->vm_start || addr >= vma->vm_end) return VM_FAULT_SIGBUS; - track_pfn_insert(vma, &pgprot, pfn); + pfnmap_setup_cachemode_pfn(pfn_t_to_pfn(pfn), &pgprot); if (!pfn_modify_allowed(pfn_t_to_pfn(pfn), pgprot)) return VM_FAULT_SIGBUS; From db44863a4d9df3604c4ff76507bb2056b6392e58 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Mon, 12 May 2025 14:34:16 +0200 Subject: [PATCH 250/285] mm: introduce pfnmap_track() and pfnmap_untrack() and use them for memremap Let's provide variants of track_pfn_remap() and untrack_pfn() that won't mess with VMAs, and replace the usage in mm/memremap.c. Add some documentation. Link: https://lkml.kernel.org/r/20250512123424.637989-4-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Lorenzo Stoakes Acked-by: Ingo Molnar [x86 bits] Reviewed-by: Liam R. Howlett Cc: Andy Lutomirski Cc: Borislav Betkov Cc: Dave Airlie Cc: "H. Peter Anvin" Cc: Jani Nikula Cc: Jann Horn Cc: Jonas Lahtinen Cc: "Masami Hiramatsu (Google)" Cc: Mathieu Desnoyers Cc: Peter Xu Cc: Peter Zijlstra Cc: Rodrigo Vivi Cc: Steven Rostedt Cc: Thomas Gleinxer Cc: Tvrtko Ursulin Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- arch/x86/mm/pat/memtype.c | 14 ++++++++++++++ include/linux/pgtable.h | 39 +++++++++++++++++++++++++++++++++++++++ mm/memremap.c | 8 ++++---- 3 files changed, 57 insertions(+), 4 deletions(-) diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c index fa78facc6f63..1ec8af6cad6b 100644 --- a/arch/x86/mm/pat/memtype.c +++ b/arch/x86/mm/pat/memtype.c @@ -1068,6 +1068,20 @@ int pfnmap_setup_cachemode(unsigned long pfn, unsigned long size, pgprot_t *prot return 0; } +int pfnmap_track(unsigned long pfn, unsigned long size, pgprot_t *prot) +{ + const resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT; + + return reserve_pfn_range(paddr, size, prot, 0); +} + +void pfnmap_untrack(unsigned long pfn, unsigned long size) +{ + const resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT; + + free_pfn_range(paddr, size); +} + /* * untrack_pfn is called while unmapping a pfnmap for a region. * untrack can be called for a specific region indicated by pfn and size or diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index be1745839871..90f72cd35839 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1502,6 +1502,16 @@ static inline int pfnmap_setup_cachemode(unsigned long pfn, unsigned long size, return 0; } +static inline int pfnmap_track(unsigned long pfn, unsigned long size, + pgprot_t *prot) +{ + return 0; +} + +static inline void pfnmap_untrack(unsigned long pfn, unsigned long size) +{ +} + /* * track_pfn_copy is called when a VM_PFNMAP VMA is about to get the page * tables copied during copy_page_range(). Will store the pfn to be @@ -1575,6 +1585,35 @@ extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot, */ int pfnmap_setup_cachemode(unsigned long pfn, unsigned long size, pgprot_t *prot); + +/** + * pfnmap_track - track a pfn range + * @pfn: the start of the pfn range + * @size: the size of the pfn range in bytes + * @prot: the pgprot to track + * + * Requested the pfn range to be 'tracked' by a hardware implementation and + * setup the cachemode in @prot similar to pfnmap_setup_cachemode(). + * + * This allows for fine-grained control of memory cache behaviour at page + * level granularity. Tracking memory this way is persisted across VMA splits + * (VMA merging does not apply for VM_PFNMAP). + * + * Currently, there is only one implementation for this - x86 Page Attribute + * Table (PAT). See Documentation/arch/x86/pat.rst for more details. + * + * Returns 0 on success and -EINVAL on error. + */ +int pfnmap_track(unsigned long pfn, unsigned long size, pgprot_t *prot); + +/** + * pfnmap_untrack - untrack a pfn range + * @pfn: the start of the pfn range + * @size: the size of the pfn range in bytes + * + * Untrack a pfn range previously tracked through pfnmap_track(). + */ +void pfnmap_untrack(unsigned long pfn, unsigned long size); extern int track_pfn_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, unsigned long *pfn); extern void untrack_pfn_copy(struct vm_area_struct *dst_vma, diff --git a/mm/memremap.c b/mm/memremap.c index 2aebc1b192da..c417c843e9b1 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -130,7 +130,7 @@ static void pageunmap_range(struct dev_pagemap *pgmap, int range_id) } mem_hotplug_done(); - untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range), true); + pfnmap_untrack(PHYS_PFN(range->start), range_len(range)); pgmap_array_delete(range); } @@ -211,8 +211,8 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params, if (nid < 0) nid = numa_mem_id(); - error = track_pfn_remap(NULL, ¶ms->pgprot, PHYS_PFN(range->start), 0, - range_len(range)); + error = pfnmap_track(PHYS_PFN(range->start), range_len(range), + ¶ms->pgprot); if (error) goto err_pfn_remap; @@ -277,7 +277,7 @@ err_add_memory: if (!is_private) kasan_remove_zero_shadow(__va(range->start), range_len(range)); err_kasan: - untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range), true); + pfnmap_untrack(PHYS_PFN(range->start), range_len(range)); err_pfn_remap: pgmap_array_delete(range); return error; From f8e97613fed25758ddf52159b87e1c66e619a23a Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Mon, 12 May 2025 14:34:17 +0200 Subject: [PATCH 251/285] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack() Let's use our new interface. In remap_pfn_range(), we'll now decide whether we have to track (full VMA covered) or only lookup the cachemode (partial VMA covered). Remember what we have to untrack by linking it from the VMA. When duplicating VMAs (e.g., splitting, mremap, fork), we'll handle it similar to anon VMA names, and use a kref to share the tracking. Once the last VMA un-refs our tracking data, we'll do the untracking, which simplifies things a lot and should sort our various issues we saw recently, for example, when partially unmapping/zapping a tracked VMA. This change implies that we'll keep tracking the original PFN range even after splitting + partially unmapping it: not too bad, because it was not working reliably before. The only thing that kind-of worked before was shrinking such a mapping using mremap(): we managed to adjust the reservation in a hacky way, now we won't adjust the reservation but leave it around until all involved VMAs are gone. If that ever turns out to be an issue, we could hook into VM splitting code and split the tracking; however, that adds complexity that might not be required, so we'll keep it simple for now. Link: https://lkml.kernel.org/r/20250512123424.637989-5-david@redhat.com Signed-off-by: David Hildenbrand Acked-by: Ingo Molnar [x86 bits] Reviewed-by: Lorenzo Stoakes Reviewed-by: Liam R. Howlett Cc: Andy Lutomirski Cc: Borislav Betkov Cc: Dave Airlie Cc: "H. Peter Anvin" Cc: Jani Nikula Cc: Jann Horn Cc: Jonas Lahtinen Cc: "Masami Hiramatsu (Google)" Cc: Mathieu Desnoyers Cc: Peter Xu Cc: Peter Zijlstra Cc: Rodrigo Vivi Cc: Steven Rostedt Cc: Thomas Gleinxer Cc: Tvrtko Ursulin Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- include/linux/mm_inline.h | 2 + include/linux/mm_types.h | 11 ++++++ mm/memory.c | 82 +++++++++++++++++++++++++++++++-------- mm/mmap.c | 5 --- mm/mremap.c | 4 -- mm/vma_init.c | 50 ++++++++++++++++++++++++ 6 files changed, 129 insertions(+), 25 deletions(-) diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index f9157a0c42a5..89b518ff097e 100644 --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h @@ -447,6 +447,8 @@ static inline bool anon_vma_name_eq(struct anon_vma_name *anon_name1, #endif /* CONFIG_ANON_VMA_NAME */ +void pfnmap_track_ctx_release(struct kref *ref); + static inline void init_tlb_flush_pending(struct mm_struct *mm) { atomic_set(&mm->tlb_flush_pending, 0); diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 15808cad2bc1..3e934dc6057c 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -763,6 +763,14 @@ struct vma_numab_state { int prev_scan_seq; }; +#ifdef __HAVE_PFNMAP_TRACKING +struct pfnmap_track_ctx { + struct kref kref; + unsigned long pfn; + unsigned long size; /* in bytes */ +}; +#endif + /* * Describes a VMA that is about to be mmap()'ed. Drivers may choose to * manipulate mutable fields which will cause those fields to be updated in the @@ -900,6 +908,9 @@ struct vm_area_struct { struct anon_vma_name *anon_name; #endif struct vm_userfaultfd_ctx vm_userfaultfd_ctx; +#ifdef __HAVE_PFNMAP_TRACKING + struct pfnmap_track_ctx *pfnmap_track_ctx; +#endif } __randomize_layout; #ifdef CONFIG_NUMA diff --git a/mm/memory.c b/mm/memory.c index 064fc55d8eab..4cf4adb0de26 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1371,7 +1371,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) struct mm_struct *dst_mm = dst_vma->vm_mm; struct mm_struct *src_mm = src_vma->vm_mm; struct mmu_notifier_range range; - unsigned long next, pfn = 0; + unsigned long next; bool is_cow; int ret; @@ -1381,12 +1381,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) if (is_vm_hugetlb_page(src_vma)) return copy_hugetlb_page_range(dst_mm, src_mm, dst_vma, src_vma); - if (unlikely(src_vma->vm_flags & VM_PFNMAP)) { - ret = track_pfn_copy(dst_vma, src_vma, &pfn); - if (ret) - return ret; - } - /* * We need to invalidate the secondary MMU mappings only when * there could be a permission downgrade on the ptes of the @@ -1428,8 +1422,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) raw_write_seqcount_end(&src_mm->write_protect_seq); mmu_notifier_invalidate_range_end(&range); } - if (ret && unlikely(src_vma->vm_flags & VM_PFNMAP)) - untrack_pfn_copy(dst_vma, pfn); return ret; } @@ -1924,9 +1916,6 @@ static void unmap_single_vma(struct mmu_gather *tlb, if (vma->vm_file) uprobe_munmap(vma, start, end); - if (unlikely(vma->vm_flags & VM_PFNMAP)) - untrack_pfn(vma, 0, 0, mm_wr_locked); - if (start != end) { if (unlikely(is_vm_hugetlb_page(vma))) { /* @@ -2872,6 +2861,36 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr, return error; } +#ifdef __HAVE_PFNMAP_TRACKING +static inline struct pfnmap_track_ctx *pfnmap_track_ctx_alloc(unsigned long pfn, + unsigned long size, pgprot_t *prot) +{ + struct pfnmap_track_ctx *ctx; + + if (pfnmap_track(pfn, size, prot)) + return ERR_PTR(-EINVAL); + + ctx = kmalloc(sizeof(*ctx), GFP_KERNEL); + if (unlikely(!ctx)) { + pfnmap_untrack(pfn, size); + return ERR_PTR(-ENOMEM); + } + + ctx->pfn = pfn; + ctx->size = size; + kref_init(&ctx->kref); + return ctx; +} + +void pfnmap_track_ctx_release(struct kref *ref) +{ + struct pfnmap_track_ctx *ctx = container_of(ref, struct pfnmap_track_ctx, kref); + + pfnmap_untrack(ctx->pfn, ctx->size); + kfree(ctx); +} +#endif /* __HAVE_PFNMAP_TRACKING */ + /** * remap_pfn_range - remap kernel memory to userspace * @vma: user vma to map to @@ -2884,20 +2903,51 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr, * * Return: %0 on success, negative error code otherwise. */ +#ifdef __HAVE_PFNMAP_TRACKING int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn, unsigned long size, pgprot_t prot) { + struct pfnmap_track_ctx *ctx = NULL; int err; - err = track_pfn_remap(vma, &prot, pfn, addr, PAGE_ALIGN(size)); - if (err) + size = PAGE_ALIGN(size); + + /* + * If we cover the full VMA, we'll perform actual tracking, and + * remember to untrack when the last reference to our tracking + * context from a VMA goes away. We'll keep tracking the whole pfn + * range even during VMA splits and partial unmapping. + * + * If we only cover parts of the VMA, we'll only setup the cachemode + * in the pgprot for the pfn range. + */ + if (addr == vma->vm_start && addr + size == vma->vm_end) { + if (vma->pfnmap_track_ctx) + return -EINVAL; + ctx = pfnmap_track_ctx_alloc(pfn, size, &prot); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + } else if (pfnmap_setup_cachemode(pfn, size, &prot)) { return -EINVAL; + } err = remap_pfn_range_notrack(vma, addr, pfn, size, prot); - if (err) - untrack_pfn(vma, pfn, PAGE_ALIGN(size), true); + if (ctx) { + if (err) + kref_put(&ctx->kref, pfnmap_track_ctx_release); + else + vma->pfnmap_track_ctx = ctx; + } return err; } + +#else +int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr, + unsigned long pfn, unsigned long size, pgprot_t prot) +{ + return remap_pfn_range_notrack(vma, addr, pfn, size, prot); +} +#endif EXPORT_SYMBOL(remap_pfn_range); /** diff --git a/mm/mmap.c b/mm/mmap.c index 50f902c08341..09c563c95112 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1784,11 +1784,6 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) tmp = vm_area_dup(mpnt); if (!tmp) goto fail_nomem; - - /* track_pfn_copy() will later take care of copying internal state. */ - if (unlikely(tmp->vm_flags & VM_PFNMAP)) - untrack_pfn_clear(tmp); - retval = vma_dup_policy(mpnt, tmp); if (retval) goto fail_nomem_policy; diff --git a/mm/mremap.c b/mm/mremap.c index 7db9da609c84..6e78e02f74bd 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -1191,10 +1191,6 @@ static int copy_vma_and_data(struct vma_remap_struct *vrm, if (is_vm_hugetlb_page(vma)) clear_vma_resv_huge_pages(vma); - /* Tell pfnmap has moved from this vma */ - if (unlikely(vma->vm_flags & VM_PFNMAP)) - untrack_pfn_clear(vma); - *new_vma_ptr = new_vma; return err; } diff --git a/mm/vma_init.c b/mm/vma_init.c index 967ca8517986..8e53c7943561 100644 --- a/mm/vma_init.c +++ b/mm/vma_init.c @@ -71,8 +71,52 @@ static void vm_area_init_from(const struct vm_area_struct *src, #ifdef CONFIG_NUMA dest->vm_policy = src->vm_policy; #endif +#ifdef __HAVE_PFNMAP_TRACKING + dest->pfnmap_track_ctx = NULL; +#endif } +#ifdef __HAVE_PFNMAP_TRACKING +static inline int vma_pfnmap_track_ctx_dup(struct vm_area_struct *orig, + struct vm_area_struct *new) +{ + struct pfnmap_track_ctx *ctx = orig->pfnmap_track_ctx; + + if (likely(!ctx)) + return 0; + + /* + * We don't expect to ever hit this. If ever required, we would have + * to duplicate the tracking. + */ + if (unlikely(kref_read(&ctx->kref) >= REFCOUNT_MAX)) + return -ENOMEM; + kref_get(&ctx->kref); + new->pfnmap_track_ctx = ctx; + return 0; +} + +static inline void vma_pfnmap_track_ctx_release(struct vm_area_struct *vma) +{ + struct pfnmap_track_ctx *ctx = vma->pfnmap_track_ctx; + + if (likely(!ctx)) + return; + + kref_put(&ctx->kref, pfnmap_track_ctx_release); + vma->pfnmap_track_ctx = NULL; +} +#else +static inline int vma_pfnmap_track_ctx_dup(struct vm_area_struct *orig, + struct vm_area_struct *new) +{ + return 0; +} +static inline void vma_pfnmap_track_ctx_release(struct vm_area_struct *vma) +{ +} +#endif + struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig) { struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL); @@ -83,6 +127,11 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig) ASSERT_EXCLUSIVE_WRITER(orig->vm_flags); ASSERT_EXCLUSIVE_WRITER(orig->vm_file); vm_area_init_from(orig, new); + + if (vma_pfnmap_track_ctx_dup(orig, new)) { + kmem_cache_free(vm_area_cachep, new); + return NULL; + } vma_lock_init(new, true); INIT_LIST_HEAD(&new->anon_vma_chain); vma_numab_state_init(new); @@ -97,5 +146,6 @@ void vm_area_free(struct vm_area_struct *vma) vma_assert_detached(vma); vma_numab_state_free(vma); free_anon_vma_name(vma); + vma_pfnmap_track_ctx_release(vma); kmem_cache_free(vm_area_cachep, vma); } From 7bd7d74ec01954fde9eb65b065eb55bcda4f86e2 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Mon, 12 May 2025 14:34:18 +0200 Subject: [PATCH 252/285] x86/mm/pat: remove old pfnmap tracking interface We can now get rid of the old interface along with get_pat_info() and follow_phys(). Link: https://lkml.kernel.org/r/20250512123424.637989-6-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Lorenzo Stoakes Acked-by: Ingo Molnar [x86 bits] Reviewed-by: Liam R. Howlett Cc: Andy Lutomirski Cc: Borislav Betkov Cc: Dave Airlie Cc: "H. Peter Anvin" Cc: Jani Nikula Cc: Jann Horn Cc: Jonas Lahtinen Cc: "Masami Hiramatsu (Google)" Cc: Mathieu Desnoyers Cc: Peter Xu Cc: Peter Zijlstra Cc: Rodrigo Vivi Cc: Steven Rostedt Cc: Thomas Gleinxer Cc: Tvrtko Ursulin Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- arch/x86/mm/pat/memtype.c | 147 -------------------------------------- include/linux/pgtable.h | 66 ----------------- 2 files changed, 213 deletions(-) diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c index 1ec8af6cad6b..c88d1cbdc1de 100644 --- a/arch/x86/mm/pat/memtype.c +++ b/arch/x86/mm/pat/memtype.c @@ -933,119 +933,6 @@ static void free_pfn_range(u64 paddr, unsigned long size) memtype_free(paddr, paddr + size); } -static int follow_phys(struct vm_area_struct *vma, unsigned long *prot, - resource_size_t *phys) -{ - struct follow_pfnmap_args args = { .vma = vma, .address = vma->vm_start }; - - if (follow_pfnmap_start(&args)) - return -EINVAL; - - /* Never return PFNs of anon folios in COW mappings. */ - if (!args.special) { - follow_pfnmap_end(&args); - return -EINVAL; - } - - *prot = pgprot_val(args.pgprot); - *phys = (resource_size_t)args.pfn << PAGE_SHIFT; - follow_pfnmap_end(&args); - return 0; -} - -static int get_pat_info(struct vm_area_struct *vma, resource_size_t *paddr, - pgprot_t *pgprot) -{ - unsigned long prot; - - VM_WARN_ON_ONCE(!(vma->vm_flags & VM_PAT)); - - /* - * We need the starting PFN and cachemode used for track_pfn_remap() - * that covered the whole VMA. For most mappings, we can obtain that - * information from the page tables. For COW mappings, we might now - * suddenly have anon folios mapped and follow_phys() will fail. - * - * Fallback to using vma->vm_pgoff, see remap_pfn_range_notrack(), to - * detect the PFN. If we need the cachemode as well, we're out of luck - * for now and have to fail fork(). - */ - if (!follow_phys(vma, &prot, paddr)) { - if (pgprot) - *pgprot = __pgprot(prot); - return 0; - } - if (is_cow_mapping(vma->vm_flags)) { - if (pgprot) - return -EINVAL; - *paddr = (resource_size_t)vma->vm_pgoff << PAGE_SHIFT; - return 0; - } - WARN_ON_ONCE(1); - return -EINVAL; -} - -int track_pfn_copy(struct vm_area_struct *dst_vma, - struct vm_area_struct *src_vma, unsigned long *pfn) -{ - const unsigned long vma_size = src_vma->vm_end - src_vma->vm_start; - resource_size_t paddr; - pgprot_t pgprot; - int rc; - - if (!(src_vma->vm_flags & VM_PAT)) - return 0; - - /* - * Duplicate the PAT information for the dst VMA based on the src - * VMA. - */ - if (get_pat_info(src_vma, &paddr, &pgprot)) - return -EINVAL; - rc = reserve_pfn_range(paddr, vma_size, &pgprot, 1); - if (rc) - return rc; - - /* Reservation for the destination VMA succeeded. */ - vm_flags_set(dst_vma, VM_PAT); - *pfn = PHYS_PFN(paddr); - return 0; -} - -void untrack_pfn_copy(struct vm_area_struct *dst_vma, unsigned long pfn) -{ - untrack_pfn(dst_vma, pfn, dst_vma->vm_end - dst_vma->vm_start, true); - /* - * Reservation was freed, any copied page tables will get cleaned - * up later, but without getting PAT involved again. - */ -} - -/* - * prot is passed in as a parameter for the new mapping. If the vma has - * a linear pfn mapping for the entire range, or no vma is provided, - * reserve the entire pfn + size range with single reserve_pfn_range - * call. - */ -int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot, - unsigned long pfn, unsigned long addr, unsigned long size) -{ - resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT; - - /* reserve the whole chunk starting from paddr */ - if (!vma || (addr == vma->vm_start - && size == (vma->vm_end - vma->vm_start))) { - int ret; - - ret = reserve_pfn_range(paddr, size, prot, 0); - if (ret == 0 && vma) - vm_flags_set(vma, VM_PAT); - return ret; - } - - return pfnmap_setup_cachemode(pfn, size, prot); -} - int pfnmap_setup_cachemode(unsigned long pfn, unsigned long size, pgprot_t *prot) { resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT; @@ -1082,40 +969,6 @@ void pfnmap_untrack(unsigned long pfn, unsigned long size) free_pfn_range(paddr, size); } -/* - * untrack_pfn is called while unmapping a pfnmap for a region. - * untrack can be called for a specific region indicated by pfn and size or - * can be for the entire vma (in which case pfn, size are zero). - */ -void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn, - unsigned long size, bool mm_wr_locked) -{ - resource_size_t paddr; - - if (vma && !(vma->vm_flags & VM_PAT)) - return; - - /* free the chunk starting from pfn or the whole chunk */ - paddr = (resource_size_t)pfn << PAGE_SHIFT; - if (!paddr && !size) { - if (get_pat_info(vma, &paddr, NULL)) - return; - size = vma->vm_end - vma->vm_start; - } - free_pfn_range(paddr, size); - if (vma) { - if (mm_wr_locked) - vm_flags_clear(vma, VM_PAT); - else - __vm_flags_mod(vma, 0, VM_PAT); - } -} - -void untrack_pfn_clear(struct vm_area_struct *vma) -{ - vm_flags_clear(vma, VM_PAT); -} - pgprot_t pgprot_writecombine(pgprot_t prot) { pgprot_set_cachemode(&prot, _PAGE_CACHE_MODE_WC); diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 90f72cd35839..0b6e1f781d86 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1485,17 +1485,6 @@ static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd) * vmf_insert_pfn. */ -/* - * track_pfn_remap is called when a _new_ pfn mapping is being established - * by remap_pfn_range() for physical range indicated by pfn and size. - */ -static inline int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot, - unsigned long pfn, unsigned long addr, - unsigned long size) -{ - return 0; -} - static inline int pfnmap_setup_cachemode(unsigned long pfn, unsigned long size, pgprot_t *prot) { @@ -1511,55 +1500,7 @@ static inline int pfnmap_track(unsigned long pfn, unsigned long size, static inline void pfnmap_untrack(unsigned long pfn, unsigned long size) { } - -/* - * track_pfn_copy is called when a VM_PFNMAP VMA is about to get the page - * tables copied during copy_page_range(). Will store the pfn to be - * passed to untrack_pfn_copy() only if there is something to be untracked. - * Callers should initialize the pfn to 0. - */ -static inline int track_pfn_copy(struct vm_area_struct *dst_vma, - struct vm_area_struct *src_vma, unsigned long *pfn) -{ - return 0; -} - -/* - * untrack_pfn_copy is called when a VM_PFNMAP VMA failed to copy during - * copy_page_range(), but after track_pfn_copy() was already called. Can - * be called even if track_pfn_copy() did not actually track anything: - * handled internally. - */ -static inline void untrack_pfn_copy(struct vm_area_struct *dst_vma, - unsigned long pfn) -{ -} - -/* - * untrack_pfn is called while unmapping a pfnmap for a region. - * untrack can be called for a specific region indicated by pfn and size or - * can be for the entire vma (in which case pfn, size are zero). - */ -static inline void untrack_pfn(struct vm_area_struct *vma, - unsigned long pfn, unsigned long size, - bool mm_wr_locked) -{ -} - -/* - * untrack_pfn_clear is called in the following cases on a VM_PFNMAP VMA: - * - * 1) During mremap() on the src VMA after the page tables were moved. - * 2) During fork() on the dst VMA, immediately after duplicating the src VMA. - */ -static inline void untrack_pfn_clear(struct vm_area_struct *vma) -{ -} #else -extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot, - unsigned long pfn, unsigned long addr, - unsigned long size); - /** * pfnmap_setup_cachemode - setup the cachemode in the pgprot for a pfn range * @pfn: the start of the pfn range @@ -1614,13 +1555,6 @@ int pfnmap_track(unsigned long pfn, unsigned long size, pgprot_t *prot); * Untrack a pfn range previously tracked through pfnmap_track(). */ void pfnmap_untrack(unsigned long pfn, unsigned long size); -extern int track_pfn_copy(struct vm_area_struct *dst_vma, - struct vm_area_struct *src_vma, unsigned long *pfn); -extern void untrack_pfn_copy(struct vm_area_struct *dst_vma, - unsigned long pfn); -extern void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn, - unsigned long size, bool mm_wr_locked); -extern void untrack_pfn_clear(struct vm_area_struct *vma); #endif /** From cba4dbeb7bfcf8b69d491288c3bf877ead883214 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Mon, 12 May 2025 14:34:19 +0200 Subject: [PATCH 253/285] mm: remove VM_PAT It's unused, so let's remove it. Link: https://lkml.kernel.org/r/20250512123424.637989-7-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Lorenzo Stoakes Acked-by: Ingo Molnar [x86 bits] Reviewed-by: Liam R. Howlett Cc: Andy Lutomirski Cc: Borislav Betkov Cc: Dave Airlie Cc: "H. Peter Anvin" Cc: Jani Nikula Cc: Jann Horn Cc: Jonas Lahtinen Cc: "Masami Hiramatsu (Google)" Cc: Mathieu Desnoyers Cc: Peter Xu Cc: Peter Zijlstra Cc: Rodrigo Vivi Cc: Steven Rostedt Cc: Thomas Gleinxer Cc: Tvrtko Ursulin Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- include/linux/mm.h | 4 +--- include/trace/events/mmflags.h | 4 +--- 2 files changed, 2 insertions(+), 6 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 43748c8f3454..a916ea42cfd5 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -357,9 +357,7 @@ extern unsigned int kobjsize(const void *objp); # define VM_SHADOW_STACK VM_NONE #endif -#if defined(CONFIG_X86) -# define VM_PAT VM_ARCH_1 /* PAT reserves whole VMA at once (x86) */ -#elif defined(CONFIG_PPC64) +#if defined(CONFIG_PPC64) # define VM_SAO VM_ARCH_1 /* Strong Access Ordering (powerpc) */ #elif defined(CONFIG_PARISC) # define VM_GROWSUP VM_ARCH_1 diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index 15aae955a10b..aa441f593e9a 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -172,9 +172,7 @@ IF_HAVE_PG_ARCH_3(arch_3) __def_pageflag_names \ ) : "none" -#if defined(CONFIG_X86) -#define __VM_ARCH_SPECIFIC_1 {VM_PAT, "pat" } -#elif defined(CONFIG_PPC64) +#if defined(CONFIG_PPC64) #define __VM_ARCH_SPECIFIC_1 {VM_SAO, "sao" } #elif defined(CONFIG_PARISC) #define __VM_ARCH_SPECIFIC_1 {VM_GROWSUP, "growsup" } From b3662fb91b98fc9503589ea98634fd48eee19426 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Mon, 12 May 2025 14:34:20 +0200 Subject: [PATCH 254/285] x86/mm/pat: remove strict_prot parameter from reserve_pfn_range() Always set to 0, so let's remove it. Link: https://lkml.kernel.org/r/20250512123424.637989-8-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Lorenzo Stoakes Acked-by: Ingo Molnar [x86 bits] Reviewed-by: Liam R. Howlett Cc: Andy Lutomirski Cc: Borislav Betkov Cc: Dave Airlie Cc: "H. Peter Anvin" Cc: Jani Nikula Cc: Jann Horn Cc: Jonas Lahtinen Cc: "Masami Hiramatsu (Google)" Cc: Mathieu Desnoyers Cc: Peter Xu Cc: Peter Zijlstra Cc: Rodrigo Vivi Cc: Steven Rostedt Cc: Thomas Gleinxer Cc: Tvrtko Ursulin Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- arch/x86/mm/pat/memtype.c | 12 +++--------- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c index c88d1cbdc1de..ccc55c00b4c8 100644 --- a/arch/x86/mm/pat/memtype.c +++ b/arch/x86/mm/pat/memtype.c @@ -858,8 +858,7 @@ int memtype_kernel_map_sync(u64 base, unsigned long size, * Reserved non RAM regions only and after successful memtype_reserve, * this func also keeps identity mapping (if any) in sync with this new prot. */ -static int reserve_pfn_range(u64 paddr, unsigned long size, pgprot_t *vma_prot, - int strict_prot) +static int reserve_pfn_range(u64 paddr, unsigned long size, pgprot_t *vma_prot) { int is_ram = 0; int ret; @@ -895,8 +894,7 @@ static int reserve_pfn_range(u64 paddr, unsigned long size, pgprot_t *vma_prot, return ret; if (pcm != want_pcm) { - if (strict_prot || - !is_new_memtype_allowed(paddr, size, want_pcm, pcm)) { + if (!is_new_memtype_allowed(paddr, size, want_pcm, pcm)) { memtype_free(paddr, paddr + size); pr_err("x86/PAT: %s:%d map pfn expected mapping type %s for [mem %#010Lx-%#010Lx], got %s\n", current->comm, current->pid, @@ -906,10 +904,6 @@ static int reserve_pfn_range(u64 paddr, unsigned long size, pgprot_t *vma_prot, cattr_name(pcm)); return -EINVAL; } - /* - * We allow returning different type than the one requested in - * non strict case. - */ pgprot_set_cachemode(vma_prot, pcm); } @@ -959,7 +953,7 @@ int pfnmap_track(unsigned long pfn, unsigned long size, pgprot_t *prot) { const resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT; - return reserve_pfn_range(paddr, size, prot, 0); + return reserve_pfn_range(paddr, size, prot); } void pfnmap_untrack(unsigned long pfn, unsigned long size) From 81baf8450165ac57cca52f73b541a1056a2ca676 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Mon, 12 May 2025 14:34:21 +0200 Subject: [PATCH 255/285] x86/mm/pat: remove MEMTYPE_*_MATCH The "memramp() shrinking" scenario no longer applies, so let's remove that now-unnecessary handling. Link: https://lkml.kernel.org/r/20250512123424.637989-9-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Lorenzo Stoakes Acked-by: Ingo Molnar [x86 bits] Reviewed-by: Liam R. Howlett Cc: Andy Lutomirski Cc: Borislav Betkov Cc: Dave Airlie Cc: "H. Peter Anvin" Cc: Jani Nikula Cc: Jann Horn Cc: Jonas Lahtinen Cc: "Masami Hiramatsu (Google)" Cc: Mathieu Desnoyers Cc: Peter Xu Cc: Peter Zijlstra Cc: Rodrigo Vivi Cc: Steven Rostedt Cc: Thomas Gleinxer Cc: Tvrtko Ursulin Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- arch/x86/mm/pat/memtype_interval.c | 44 ++++-------------------------- 1 file changed, 6 insertions(+), 38 deletions(-) diff --git a/arch/x86/mm/pat/memtype_interval.c b/arch/x86/mm/pat/memtype_interval.c index 645613d59942..9d03f0dbc471 100644 --- a/arch/x86/mm/pat/memtype_interval.c +++ b/arch/x86/mm/pat/memtype_interval.c @@ -49,26 +49,15 @@ INTERVAL_TREE_DEFINE(struct memtype, rb, u64, subtree_max_end, static struct rb_root_cached memtype_rbroot = RB_ROOT_CACHED; -enum { - MEMTYPE_EXACT_MATCH = 0, - MEMTYPE_END_MATCH = 1 -}; - -static struct memtype *memtype_match(u64 start, u64 end, int match_type) +static struct memtype *memtype_match(u64 start, u64 end) { struct memtype *entry_match; entry_match = interval_iter_first(&memtype_rbroot, start, end-1); while (entry_match != NULL && entry_match->start < end) { - if ((match_type == MEMTYPE_EXACT_MATCH) && - (entry_match->start == start) && (entry_match->end == end)) + if (entry_match->start == start && entry_match->end == end) return entry_match; - - if ((match_type == MEMTYPE_END_MATCH) && - (entry_match->start < start) && (entry_match->end == end)) - return entry_match; - entry_match = interval_iter_next(entry_match, start, end-1); } @@ -132,32 +121,11 @@ struct memtype *memtype_erase(u64 start, u64 end) { struct memtype *entry_old; - /* - * Since the memtype_rbroot tree allows overlapping ranges, - * memtype_erase() checks with EXACT_MATCH first, i.e. free - * a whole node for the munmap case. If no such entry is found, - * it then checks with END_MATCH, i.e. shrink the size of a node - * from the end for the mremap case. - */ - entry_old = memtype_match(start, end, MEMTYPE_EXACT_MATCH); - if (!entry_old) { - entry_old = memtype_match(start, end, MEMTYPE_END_MATCH); - if (!entry_old) - return ERR_PTR(-EINVAL); - } - - if (entry_old->start == start) { - /* munmap: erase this node */ - interval_remove(entry_old, &memtype_rbroot); - } else { - /* mremap: update the end value of this node */ - interval_remove(entry_old, &memtype_rbroot); - entry_old->end = start; - interval_insert(entry_old, &memtype_rbroot); - - return NULL; - } + entry_old = memtype_match(start, end); + if (!entry_old) + return ERR_PTR(-EINVAL); + interval_remove(entry_old, &memtype_rbroot); return entry_old; } From 99e27b047c4ce17feed4bb5c5cb2f1521470afcd Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Mon, 12 May 2025 14:34:22 +0200 Subject: [PATCH 256/285] x86/mm/pat: inline memtype_match() into memtype_erase() Let's just have it in a single function. The resulting function is certainly small enough and readable. Link: https://lkml.kernel.org/r/20250512123424.637989-10-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Lorenzo Stoakes Reviewed-by: Liam R. Howlett Cc: Andy Lutomirski Cc: Borislav Betkov Cc: Dave Airlie Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jani Nikula Cc: Jann Horn Cc: Jonas Lahtinen Cc: "Masami Hiramatsu (Google)" Cc: Mathieu Desnoyers Cc: Peter Xu Cc: Peter Zijlstra Cc: Rodrigo Vivi Cc: Steven Rostedt Cc: Thomas Gleinxer Cc: Tvrtko Ursulin Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- arch/x86/mm/pat/memtype_interval.c | 31 +++++++++--------------------- 1 file changed, 9 insertions(+), 22 deletions(-) diff --git a/arch/x86/mm/pat/memtype_interval.c b/arch/x86/mm/pat/memtype_interval.c index 9d03f0dbc471..e5844ed1311e 100644 --- a/arch/x86/mm/pat/memtype_interval.c +++ b/arch/x86/mm/pat/memtype_interval.c @@ -49,21 +49,6 @@ INTERVAL_TREE_DEFINE(struct memtype, rb, u64, subtree_max_end, static struct rb_root_cached memtype_rbroot = RB_ROOT_CACHED; -static struct memtype *memtype_match(u64 start, u64 end) -{ - struct memtype *entry_match; - - entry_match = interval_iter_first(&memtype_rbroot, start, end-1); - - while (entry_match != NULL && entry_match->start < end) { - if (entry_match->start == start && entry_match->end == end) - return entry_match; - entry_match = interval_iter_next(entry_match, start, end-1); - } - - return NULL; /* Returns NULL if there is no match */ -} - static int memtype_check_conflict(u64 start, u64 end, enum page_cache_mode reqtype, enum page_cache_mode *newtype) @@ -119,14 +104,16 @@ int memtype_check_insert(struct memtype *entry_new, enum page_cache_mode *ret_ty struct memtype *memtype_erase(u64 start, u64 end) { - struct memtype *entry_old; + struct memtype *entry = interval_iter_first(&memtype_rbroot, start, end - 1); - entry_old = memtype_match(start, end); - if (!entry_old) - return ERR_PTR(-EINVAL); - - interval_remove(entry_old, &memtype_rbroot); - return entry_old; + while (entry && entry->start < end) { + if (entry->start == start && entry->end == end) { + interval_remove(entry, &memtype_rbroot); + return entry; + } + entry = interval_iter_next(entry, start, end - 1); + } + return ERR_PTR(-EINVAL); } struct memtype *memtype_lookup(u64 addr) From 11c82e71817779fc002d47c5beb6ad9df444289a Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Mon, 12 May 2025 14:34:23 +0200 Subject: [PATCH 257/285] drm/i915: track_pfn() -> "pfnmap tracking" track_pfn() does not exist, let's simply refer to it as "pfnmap tracking". Link: https://lkml.kernel.org/r/20250512123424.637989-11-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Lorenzo Stoakes Acked-by: Ingo Molnar [x86 bits] Reviewed-by: Liam R. Howlett Cc: Andy Lutomirski Cc: Borislav Betkov Cc: Dave Airlie Cc: "H. Peter Anvin" Cc: Jani Nikula Cc: Jann Horn Cc: Jonas Lahtinen Cc: "Masami Hiramatsu (Google)" Cc: Mathieu Desnoyers Cc: Peter Xu Cc: Peter Zijlstra Cc: Rodrigo Vivi Cc: Steven Rostedt Cc: Thomas Gleinxer Cc: Tvrtko Ursulin Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- drivers/gpu/drm/i915/i915_mm.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_mm.c b/drivers/gpu/drm/i915/i915_mm.c index 76e2801619f0..c33bd3d83069 100644 --- a/drivers/gpu/drm/i915/i915_mm.c +++ b/drivers/gpu/drm/i915/i915_mm.c @@ -100,7 +100,7 @@ int remap_io_mapping(struct vm_area_struct *vma, GEM_BUG_ON((vma->vm_flags & EXPECTED_FLAGS) != EXPECTED_FLAGS); - /* We rely on prevalidation of the io-mapping to skip track_pfn(). */ + /* We rely on prevalidation of the io-mapping to skip pfnmap tracking. */ r.mm = vma->vm_mm; r.pfn = pfn; r.prot = __pgprot((pgprot_val(iomap->prot) & _PAGE_CACHE_MASK) | @@ -140,7 +140,7 @@ int remap_io_sg(struct vm_area_struct *vma, }; int err; - /* We rely on prevalidation of the io-mapping to skip track_pfn(). */ + /* We rely on prevalidation of the io-mapping to skip pfnmap tracking. */ GEM_BUG_ON((vma->vm_flags & EXPECTED_FLAGS) != EXPECTED_FLAGS); while (offset >= r.sgt.max >> PAGE_SHIFT) { From a624c424d5d3778863ac67f87a374c22b1520c27 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Mon, 12 May 2025 14:34:24 +0200 Subject: [PATCH 258/285] mm/io-mapping: track_pfn() -> "pfnmap tracking" track_pfn() does not exist, let's simply refer to it as "pfnmap tracking". Link: https://lkml.kernel.org/r/20250512123424.637989-12-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Lorenzo Stoakes Acked-by: Ingo Molnar [x86 bits] Reviewed-by: Liam R. Howlett Cc: Andy Lutomirski Cc: Borislav Betkov Cc: Dave Airlie Cc: "H. Peter Anvin" Cc: Jani Nikula Cc: Jann Horn Cc: Jonas Lahtinen Cc: "Masami Hiramatsu (Google)" Cc: Mathieu Desnoyers Cc: Peter Xu Cc: Peter Zijlstra Cc: Rodrigo Vivi Cc: Steven Rostedt Cc: Thomas Gleinxer Cc: Tvrtko Ursulin Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/io-mapping.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/io-mapping.c b/mm/io-mapping.c index f44a6a134712..d3586e95c12c 100644 --- a/mm/io-mapping.c +++ b/mm/io-mapping.c @@ -24,7 +24,7 @@ int io_mapping_map_user(struct io_mapping *iomap, struct vm_area_struct *vma, pgprot_t remap_prot = __pgprot((pgprot_val(iomap->prot) & _PAGE_CACHE_MASK) | (pgprot_val(vma->vm_page_prot) & ~_PAGE_CACHE_MASK)); - /* We rely on prevalidation of the io-mapping to skip track_pfn(). */ + /* We rely on prevalidation of the io-mapping to skip pfnmap tracking. */ return remap_pfn_range_notrack(vma, addr, pfn, size, remap_prot); } EXPORT_SYMBOL_GPL(io_mapping_map_user); From 5053383829ab2b17dc4766a832004c848f4af9df Mon Sep 17 00:00:00 2001 From: Baolin Wang Date: Mon, 12 May 2025 10:57:11 +0800 Subject: [PATCH 259/285] mm: khugepaged: convert set_huge_pmd() to take a folio We've already gotten the stable locked folio in collapse_pte_mapped_thp(), so just use folio for set_huge_pmd() to set the PMD entry, which is more straightforward. Moreover, we will check the folio size in do_set_pmd(), so we can remove the unnecessary VM_BUG_ON() in set_huge_pmd(). While we are at it, we can also remove the PageTransHuge(), as it currently has no callers. Link: https://lkml.kernel.org/r/110c3e1ec5fe7854a0e2c95ffcbc985817180ed7.1747017104.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang Acked-by: David Hildenbrand Cc: Dev Jain Cc: Johannes Weiner Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Mariano Pache Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Mike Rapoport Cc: Ryan Roberts Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/page-flags.h | 15 --------------- mm/khugepaged.c | 11 +++++------ 2 files changed, 5 insertions(+), 21 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 37b11f15dbd9..1c1d49554c71 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -907,20 +907,6 @@ FOLIO_FLAG_FALSE(partially_mapped) #define PG_head_mask ((1UL << PG_head)) #ifdef CONFIG_TRANSPARENT_HUGEPAGE -/* - * PageHuge() only returns true for hugetlbfs pages, but not for - * normal or transparent huge pages. - * - * PageTransHuge() returns true for both transparent huge and - * hugetlbfs pages, but not normal pages. PageTransHuge() can only be - * called only in the core VM paths where hugetlbfs pages can't exist. - */ -static inline int PageTransHuge(const struct page *page) -{ - VM_BUG_ON_PAGE(PageTail(page), page); - return PageHead(page); -} - /* * PageTransCompound returns true for both transparent huge pages * and hugetlbfs pages, so it should only be called when it's known @@ -931,7 +917,6 @@ static inline int PageTransCompound(const struct page *page) return PageCompound(page); } #else -TESTPAGEFLAG_FALSE(TransHuge, transhuge) TESTPAGEFLAG_FALSE(TransCompound, transcompound) #endif diff --git a/mm/khugepaged.c b/mm/khugepaged.c index b04b6a770afe..33daea8f667e 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1465,9 +1465,9 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot) } #ifdef CONFIG_SHMEM -/* hpage must be locked, and mmap_lock must be held */ +/* folio must be locked, and mmap_lock must be held */ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr, - pmd_t *pmdp, struct page *hpage) + pmd_t *pmdp, struct folio *folio, struct page *page) { struct vm_fault vmf = { .vma = vma, @@ -1476,13 +1476,12 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr, .pmd = pmdp, }; - VM_BUG_ON(!PageTransHuge(hpage)); mmap_assert_locked(vma->vm_mm); - if (do_set_pmd(&vmf, hpage)) + if (do_set_pmd(&vmf, page)) return SCAN_FAIL; - get_page(hpage); + folio_get(folio); return SCAN_SUCCEED; } @@ -1689,7 +1688,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, maybe_install_pmd: /* step 5: install pmd entry */ result = install_pmd - ? set_huge_pmd(vma, haddr, pmd, &folio->page) + ? set_huge_pmd(vma, haddr, pmd, folio, &folio->page) : SCAN_SUCCEED; goto drop_folio; abort: From 698c0089cdf0b27d37f0b24824b682c23de5f72d Mon Sep 17 00:00:00 2001 From: Baolin Wang Date: Mon, 12 May 2025 10:57:12 +0800 Subject: [PATCH 260/285] mm: convert do_set_pmd() to take a folio In do_set_pmd(), we always use the folio->page to build PMD mappings for the entire folio. Since all callers of do_set_pmd() already hold a stable folio, converting do_set_pmd() to take a folio is safe and more straightforward. In addition, to ensure the extensibility of do_set_pmd() for supporting larger folios beyond PMD size, we keep the 'page' parameter to specify which page within the folio should be mapped. No functional changes expected. Link: https://lkml.kernel.org/r/9b488f4ecb4d3fd8634e3d448dd0ed6964482480.1747017104.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang Reviewed-by: Zi Yan Acked-by: David Hildenbrand Cc: Dev Jain Cc: Johannes Weiner Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Mariano Pache Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Mike Rapoport Cc: Ryan Roberts Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- include/linux/mm.h | 2 +- mm/filemap.c | 2 +- mm/khugepaged.c | 2 +- mm/memory.c | 11 +++++------ 4 files changed, 8 insertions(+), 9 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index a916ea42cfd5..cd2e513189d6 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1235,7 +1235,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma) return pte; } -vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page); +vm_fault_t do_set_pmd(struct vm_fault *vmf, struct folio *folio, struct page *page); void set_pte_range(struct vm_fault *vmf, struct folio *folio, struct page *page, unsigned int nr, unsigned long addr); diff --git a/mm/filemap.c b/mm/filemap.c index 7b90cbeb4a1a..09d005848f0d 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3533,7 +3533,7 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct folio *folio, if (pmd_none(*vmf->pmd) && folio_test_pmd_mappable(folio)) { struct page *page = folio_file_page(folio, start); - vm_fault_t ret = do_set_pmd(vmf, page); + vm_fault_t ret = do_set_pmd(vmf, folio, page); if (!ret) { /* The page is mapped successfully, reference consumed. */ folio_unlock(folio); diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 33daea8f667e..ebcd7c8a4b44 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1478,7 +1478,7 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr, mmap_assert_locked(vma->vm_mm); - if (do_set_pmd(&vmf, page)) + if (do_set_pmd(&vmf, folio, page)) return SCAN_FAIL; folio_get(folio); diff --git a/mm/memory.c b/mm/memory.c index 4cf4adb0de26..5cb48f262ab0 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5227,9 +5227,8 @@ static void deposit_prealloc_pte(struct vm_fault *vmf) vmf->prealloc_pte = NULL; } -vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) +vm_fault_t do_set_pmd(struct vm_fault *vmf, struct folio *folio, struct page *page) { - struct folio *folio = page_folio(page); struct vm_area_struct *vma = vmf->vma; bool write = vmf->flags & FAULT_FLAG_WRITE; unsigned long haddr = vmf->address & HPAGE_PMD_MASK; @@ -5302,7 +5301,7 @@ out: return ret; } #else -vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) +vm_fault_t do_set_pmd(struct vm_fault *vmf, struct folio *folio, struct page *page) { return VM_FAULT_FALLBACK; } @@ -5396,6 +5395,7 @@ fallback: else page = vmf->page; + folio = page_folio(page); /* * check even for read faults because we might have lost our CoWed * page @@ -5407,8 +5407,8 @@ fallback: } if (pmd_none(*vmf->pmd)) { - if (PageTransCompound(page)) { - ret = do_set_pmd(vmf, page); + if (folio_test_pmd_mappable(folio)) { + ret = do_set_pmd(vmf, folio, page); if (ret != VM_FAULT_FALLBACK) return ret; } @@ -5419,7 +5419,6 @@ fallback: return VM_FAULT_OOM; } - folio = page_folio(page); nr_pages = folio_nr_pages(folio); /* From cc0535acd1b439cfc854dfd27618b6f790497661 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Tue, 13 May 2025 15:57:06 +0100 Subject: [PATCH 261/285] MAINTAINERS: add kernel/fork.c to relevant sections Currently kernel/fork.c both contains absolutely key logic relating to a number of kernel subsystems and also has absolutely no assignment in MAINTAINERS. Correct this by placing this file in relevant sections - mm core, exec and the scheduler so people know who to contact when making changes here. scripts/get_maintainers.pl can perfectly well handle a file being in multiple sections, so this functions correctly. Intent is that we keep putting changes to kernel/fork.c through Andrew's tree. Link: https://lkml.kernel.org/r/20250513145706.122101-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Acked-by: Peter Zijlstra (Intel) Acked-by: Michal Hocko Acked-by: Vlastimil Babka Reviewed-by: Kees Cook Acked-by: David Hildenbrand Acked-by: Liam R. Howlett Cc: Ben Segall Cc: Dietmar Eggemann Cc: Ingo Molnar Cc: Juri Lelli Cc: Mel Gorman Cc: Mike Rapoport Cc: Steven Rostedt Cc: Suren Baghdasaryan Cc: Valentin Schneider Cc: Vincent Guittot Signed-off-by: Andrew Morton --- MAINTAINERS | 3 +++ 1 file changed, 3 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index ab6b08dbc779..378678192a66 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -8840,6 +8840,7 @@ F: include/linux/elf.h F: include/uapi/linux/auxvec.h F: include/uapi/linux/binfmts.h F: include/uapi/linux/elf.h +F: kernel/fork.c F: mm/vma_exec.c F: tools/testing/selftests/exec/ N: asm/elf.h @@ -15540,6 +15541,7 @@ F: include/linux/mm.h F: include/linux/mm_*.h F: include/linux/mmdebug.h F: include/linux/pagewalk.h +F: kernel/fork.c F: mm/Kconfig F: mm/debug.c F: mm/init-mm.c @@ -21770,6 +21772,7 @@ F: include/linux/preempt.h F: include/linux/sched.h F: include/linux/wait.h F: include/uapi/linux/sched.h +F: kernel/fork.c F: kernel/sched/ SCHEDULER - SCHED_EXT From 6669d1aaa0c45a50a4cc5f1756ab03578eaebd18 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Wed, 14 May 2025 09:40:24 +0100 Subject: [PATCH 262/285] mm: remove WARN_ON_ONCE() in file_has_valid_mmap_hooks() Having encountered a trinity report in linux-next (Linked in the 'Closes' tag) it appears that there are legitimate situations where a file-backed mapping can be acquired but no file->f_op->mmap or file->f_op->mmap_prepare is set, at which point do_mmap() should simply error out with -ENODEV. Since previously we did not warn in this scenario and it appears we rely upon this, restore this situation, while retaining a WARN_ON_ONCE() for the case where both are set, which is absolutely incorrect and must be addressed and thus always requires a warning. If further work is required to chase down precisely what is causing this, then we can later restore this, but it makes no sense to hold up this series to do so, as this is existing and apparently expected behaviour. Link: https://lkml.kernel.org/r/20250514084024.29148-1-lorenzo.stoakes@oracle.com Fixes: c84bf6dd2b83 ("mm: introduce new .mmap_prepare() file callback") Signed-off-by: Lorenzo Stoakes Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-lkp/202505141434.96ce5e5d-lkp@intel.com Reviewed-by: Vlastimil Babka Reviewed-by: Pedro Falcato Acked-by: David Hildenbrand Cc: Al Viro Cc: Christian Brauner Cc: Jan Kara Cc: Jann Horn Cc: Liam Howlett Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Mike Rapoport Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- include/linux/fs.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index e2721a1ff13d..09c8495dacdb 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2248,7 +2248,7 @@ static inline bool file_has_valid_mmap_hooks(struct file *file) /* Hooks are mutually exclusive. */ if (WARN_ON_ONCE(has_mmap && has_mmap_prepare)) return false; - if (WARN_ON_ONCE(!has_mmap && !has_mmap_prepare)) + if (!has_mmap && !has_mmap_prepare) return false; return true; From 5fc4b770fc35c05874dd2dd5bf1f03d354025b62 Mon Sep 17 00:00:00 2001 From: Mark Brown Date: Sun, 18 May 2025 15:04:42 +0100 Subject: [PATCH 263/285] selftests/mm: deduplicate second mmap() of 5*PAGE_SIZE at base The map_fixed_noreplace test does two blocks of test starting from a mapping of 5 pages at the base address, logging a test result for each initial mapping. These are logged with the same test name, causing test automation software to see two reports for the same test in a single run. Tweak the log message for the second one to deduplicate. Link: https://lkml.kernel.org/r/20250518-selftests-mm-map-fixed-noreplace-dup-v1-1-1a11a62c5e9f@kernel.org Signed-off-by: Mark Brown Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/map_fixed_noreplace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/mm/map_fixed_noreplace.c b/tools/testing/selftests/mm/map_fixed_noreplace.c index d53de2486080..1e9980b8993c 100644 --- a/tools/testing/selftests/mm/map_fixed_noreplace.c +++ b/tools/testing/selftests/mm/map_fixed_noreplace.c @@ -96,7 +96,7 @@ int main(void) ksft_exit_fail_msg("Error:1: mmap() succeeded when it shouldn't have\n"); } ksft_print_msg("mmap() @ 0x%lx-0x%lx p=%p result=%m\n", addr, addr + size, p); - ksft_test_result_pass("mmap() 5*PAGE_SIZE at base\n"); + ksft_test_result_pass("Second mmap() 5*PAGE_SIZE at base\n"); /* * Second mapping contained within first: From 2aad4edf6e1018b28b7000faec56b7b6e585c8e1 Mon Sep 17 00:00:00 2001 From: Alexei Starovoitov Date: Fri, 16 May 2025 17:34:46 -0700 Subject: [PATCH 264/285] mm: rename try_alloc_pages() to alloc_pages_nolock() The "try_" prefix is confusing, since it made people believe that try_alloc_pages() is analogous to spin_trylock() and NULL return means EAGAIN. This is not the case. If it returns NULL there is no reason to call it again. It will most likely return NULL again. Hence rename it to alloc_pages_nolock() to make it symmetrical to free_pages_nolock() and document that NULL means ENOMEM. Link: https://lkml.kernel.org/r/20250517003446.60260-1-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov Acked-by: Vlastimil Babka Acked-by: Johannes Weiner Reviewed-by: Shakeel Butt Acked-by: Harry Yoo Cc: Andrii Nakryiko Cc: Kumar Kartikeya Dwivedi Cc: Michal Hocko Cc: Peter Zijlstra Cc: Sebastian Andrzej Siewior Cc: Steven Rostedt Signed-off-by: Andrew Morton --- include/linux/gfp.h | 8 ++++---- kernel/bpf/syscall.c | 2 +- mm/page_alloc.c | 15 ++++++++------- mm/page_owner.c | 2 +- 4 files changed, 14 insertions(+), 13 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index c9fa6309c903..be160e8d8bcb 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -45,13 +45,13 @@ static inline bool gfpflags_allow_spinning(const gfp_t gfp_flags) * !__GFP_DIRECT_RECLAIM -> direct claim is not allowed. * !__GFP_KSWAPD_RECLAIM -> it's not safe to wake up kswapd. * All GFP_* flags including GFP_NOWAIT use one or both flags. - * try_alloc_pages() is the only API that doesn't specify either flag. + * alloc_pages_nolock() is the only API that doesn't specify either flag. * * This is stronger than GFP_NOWAIT or GFP_ATOMIC because * those are guaranteed to never block on a sleeping lock. * Here we are enforcing that the allocation doesn't ever spin * on any locks (i.e. only trylocks). There is no high level - * GFP_$FOO flag for this use in try_alloc_pages() as the + * GFP_$FOO flag for this use in alloc_pages_nolock() as the * regular page allocator doesn't fully support this * allocation mode. */ @@ -354,8 +354,8 @@ static inline struct page *alloc_page_vma_noprof(gfp_t gfp, } #define alloc_page_vma(...) alloc_hooks(alloc_page_vma_noprof(__VA_ARGS__)) -struct page *try_alloc_pages_noprof(int nid, unsigned int order); -#define try_alloc_pages(...) alloc_hooks(try_alloc_pages_noprof(__VA_ARGS__)) +struct page *alloc_pages_nolock_noprof(int nid, unsigned int order); +#define alloc_pages_nolock(...) alloc_hooks(alloc_pages_nolock_noprof(__VA_ARGS__)) extern unsigned long get_free_pages_noprof(gfp_t gfp_mask, unsigned int order); #define __get_free_pages(...) alloc_hooks(get_free_pages_noprof(__VA_ARGS__)) diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 64c3393e8270..9cdb4f22640f 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -578,7 +578,7 @@ static bool can_alloc_pages(void) static struct page *__bpf_alloc_page(int nid) { if (!can_alloc_pages()) - return try_alloc_pages(nid, 0); + return alloc_pages_nolock(nid, 0); return alloc_pages_node(nid, GFP_KERNEL | __GFP_ZERO | __GFP_ACCOUNT diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 4ee55edf1ad7..dbafa7c69a6a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5070,7 +5070,7 @@ EXPORT_SYMBOL(__free_pages); /* * Can be called while holding raw_spin_lock or from IRQ and NMI for any - * page type (not only those that came from try_alloc_pages) + * page type (not only those that came from alloc_pages_nolock) */ void free_pages_nolock(struct page *page, unsigned int order) { @@ -7311,20 +7311,21 @@ static bool __free_unaccepted(struct page *page) #endif /* CONFIG_UNACCEPTED_MEMORY */ /** - * try_alloc_pages - opportunistic reentrant allocation from any context + * alloc_pages_nolock - opportunistic reentrant allocation from any context * @nid: node to allocate from * @order: allocation order size * * Allocates pages of a given order from the given node. This is safe to * call from any context (from atomic, NMI, and also reentrant - * allocator -> tracepoint -> try_alloc_pages_noprof). + * allocator -> tracepoint -> alloc_pages_nolock_noprof). * Allocation is best effort and to be expected to fail easily so nobody should * rely on the success. Failures are not reported via warn_alloc(). * See always fail conditions below. * - * Return: allocated page or NULL on failure. + * Return: allocated page or NULL on failure. NULL does not mean EBUSY or EAGAIN. + * It means ENOMEM. There is no reason to call it again and expect !NULL. */ -struct page *try_alloc_pages_noprof(int nid, unsigned int order) +struct page *alloc_pages_nolock_noprof(int nid, unsigned int order) { /* * Do not specify __GFP_DIRECT_RECLAIM, since direct claim is not allowed. @@ -7333,7 +7334,7 @@ struct page *try_alloc_pages_noprof(int nid, unsigned int order) * * These two are the conditions for gfpflags_allow_spinning() being true. * - * Specify __GFP_NOWARN since failing try_alloc_pages() is not a reason + * Specify __GFP_NOWARN since failing alloc_pages_nolock() is not a reason * to warn. Also warn would trigger printk() which is unsafe from * various contexts. We cannot use printk_deferred_enter() to mitigate, * since the running context is unknown. @@ -7343,7 +7344,7 @@ struct page *try_alloc_pages_noprof(int nid, unsigned int order) * BPF use cases. * * Though __GFP_NOMEMALLOC is not checked in the code path below, - * specify it here to highlight that try_alloc_pages() + * specify it here to highlight that alloc_pages_nolock() * doesn't want to deplete reserves. */ gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC diff --git a/mm/page_owner.c b/mm/page_owner.c index cc4a6916eec6..9928c9ac8c31 100644 --- a/mm/page_owner.c +++ b/mm/page_owner.c @@ -302,7 +302,7 @@ void __reset_page_owner(struct page *page, unsigned short order) /* * Do not specify GFP_NOWAIT to make gfpflags_allow_spinning() == false * to prevent issues in stack_depot_save(). - * This is similar to try_alloc_pages() gfp flags, but only used + * This is similar to alloc_pages_nolock() gfp flags, but only used * to signal stack_depot to avoid spin_locks. */ handle = save_stack(__GFP_NOWARN); From 591c4c78be06360fe31b31401564bdb2d2b79995 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Mon, 12 May 2025 17:27:10 -0700 Subject: [PATCH 265/285] mm/damon/core: warn and fix nr_accesses[_bp] corruption Patch series "mm/damon: minor fixups and improvements for code, tests, and documents". Yet another batch of miscellaneous DAMON changes. Fix and improve minor problems in code, tests and documents. This patch (of 6): For a bug such as double aggregation reset[1], ->nr_accesses and/or ->nr_accesses_bp of damon_region could be corrupted. Such corruption can make monitoring results pretty inaccurate, so the root causing bug should be investigated. Meanwhile, the corruption itself can easily be fixed but silently fixing it will hide the bug. Fix the corruption as soon as found, but WARN_ONCE() so that we can be aware of the existence of the bug while keeping the system running in a more sane way. Link: https://lkml.kernel.org/r/20250513002715.40126-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250513002715.40126-2-sj@kernel.org Link: https://lore.kernel.org/20250302214145.356806-1-sj@kernel.org [1] Signed-off-by: SeongJae Park Cc: Brendan Higgins Cc: David Gow Cc: Jonathan Corbet Cc: Shuah Khan Signed-off-by: Andrew Morton --- mm/damon/core.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/mm/damon/core.c b/mm/damon/core.c index 587fb9a4fef8..0bb71e2ab713 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -1391,6 +1391,19 @@ int damos_walk(struct damon_ctx *ctx, struct damos_walk_control *control) return 0; } +/* + * Warn and fix corrupted ->nr_accesses[_bp] for investigations and preventing + * the problem being propagated. + */ +static void damon_warn_fix_nr_accesses_corruption(struct damon_region *r) +{ + if (r->nr_accesses_bp == r->nr_accesses * 10000) + return; + WARN_ONCE(true, "invalid nr_accesses_bp at reset: %u %u\n", + r->nr_accesses_bp, r->nr_accesses); + r->nr_accesses_bp = r->nr_accesses * 10000; +} + /* * Reset the aggregated monitoring results ('nr_accesses' of each region). */ @@ -1404,6 +1417,7 @@ static void kdamond_reset_aggregated(struct damon_ctx *c) damon_for_each_region(r, t) { trace_damon_aggregated(ti, r, damon_nr_regions(t)); + damon_warn_fix_nr_accesses_corruption(r); r->last_nr_accesses = r->nr_accesses; r->nr_accesses = 0; } From 0bac6b1a1111977c769e1933e0b0fc24e3647d97 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Mon, 12 May 2025 17:27:11 -0700 Subject: [PATCH 266/285] mm/damon/sysfs-schemes: fix wrong comment on damons_sysfs_quota_goal_metric_strs A comment on damos_sysfs_quota_goal_metric_strs is simply wrong, due to a copy-and-paste error. Fix it. Link: https://lkml.kernel.org/r/20250513002715.40126-3-sj@kernel.org Signed-off-by: SeongJae Park Cc: Brendan Higgins Cc: David Gow Cc: Jonathan Corbet Cc: Shuah Khan Signed-off-by: Andrew Morton --- mm/damon/sysfs-schemes.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c index c2b8a9cb44ec..0f6c9e1fec0b 100644 --- a/mm/damon/sysfs-schemes.c +++ b/mm/damon/sysfs-schemes.c @@ -940,7 +940,7 @@ struct damos_sysfs_quota_goal { int nid; }; -/* This should match with enum damos_action */ +/* This should match with enum damos_quota_goal_metric */ static const char * const damos_sysfs_quota_goal_metric_strs[] = { "user_input", "some_mem_psi_us", From a82cf30010665a3602cc6051d760e19e9174ae3d Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Mon, 12 May 2025 17:27:12 -0700 Subject: [PATCH 267/285] mm/damon/paddr: remove unused variable, folio_list, in damon_pa_stat() Commit c0cb9d91bf297 ("mm/damon/paddr: report filter-passed bytes back for DAMOS_STAT action") added unused variable in damon_pa_stat(), due to a copy-and-paste error. Remove it. Link: https://lkml.kernel.org/r/20250513002715.40126-4-sj@kernel.org Fixes: c0cb9d91bf29 ("mm/damon/paddr: report filter-passed bytes back for DAMOS_STAT action") Signed-off-by: SeongJae Park Cc: Brendan Higgins Cc: David Gow Cc: Jonathan Corbet Cc: Shuah Khan Signed-off-by: Andrew Morton --- mm/damon/paddr.c | 1 - 1 file changed, 1 deletion(-) diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c index 1b70d3f36046..e8464f7e0014 100644 --- a/mm/damon/paddr.c +++ b/mm/damon/paddr.c @@ -548,7 +548,6 @@ static unsigned long damon_pa_stat(struct damon_region *r, struct damos *s, unsigned long *sz_filter_passed) { unsigned long addr; - LIST_HEAD(folio_list); struct folio *folio; if (!damon_pa_scheme_has_filter(s)) From 094fb14913c7c92b2fa8b398fe6a4b8f143aec25 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Mon, 12 May 2025 17:27:13 -0700 Subject: [PATCH 268/285] mm/damon/tests/core-kunit: add a test for damos_set_filters_default_reject() DAMOS filters' default reject behavior is not very simple. Actually there was a mistake[1] during the development. Add a kunit test for validating the behavior. Link: https://lkml.kernel.org/r/20250513002715.40126-5-sj@kernel.org Link: https://lore.kernel.org/20250227002913.19359-1-sj@kernel.org [1] Signed-off-by: SeongJae Park Cc: Brendan Higgins Cc: David Gow Cc: Jonathan Corbet Cc: Shuah Khan Signed-off-by: Andrew Morton --- mm/damon/tests/core-kunit.h | 70 +++++++++++++++++++++++++++++++++++++ 1 file changed, 70 insertions(+) diff --git a/mm/damon/tests/core-kunit.h b/mm/damon/tests/core-kunit.h index be0fea9ee5fc..298c67557fae 100644 --- a/mm/damon/tests/core-kunit.h +++ b/mm/damon/tests/core-kunit.h @@ -510,6 +510,75 @@ static void damon_test_feed_loop_next_input(struct kunit *test) damon_feed_loop_next_input(last_input, 2000)); } +static void damon_test_set_filters_default_reject(struct kunit *test) +{ + struct damos scheme; + struct damos_filter *target_filter, *anon_filter; + + INIT_LIST_HEAD(&scheme.filters); + INIT_LIST_HEAD(&scheme.ops_filters); + + damos_set_filters_default_reject(&scheme); + /* + * No filter is installed. Allow by default on both core and ops layer + * filtering stages, since there are no filters at all. + */ + KUNIT_EXPECT_EQ(test, scheme.core_filters_default_reject, false); + KUNIT_EXPECT_EQ(test, scheme.ops_filters_default_reject, false); + + target_filter = damos_new_filter(DAMOS_FILTER_TYPE_TARGET, true, true); + damos_add_filter(&scheme, target_filter); + damos_set_filters_default_reject(&scheme); + /* + * A core-handled allow-filter is installed. + * Rejct by default on core layer filtering stage due to the last + * core-layer-filter's behavior. + * Allow by default on ops layer filtering stage due to the absence of + * ops layer filters. + */ + KUNIT_EXPECT_EQ(test, scheme.core_filters_default_reject, true); + KUNIT_EXPECT_EQ(test, scheme.ops_filters_default_reject, false); + + target_filter->allow = false; + damos_set_filters_default_reject(&scheme); + /* + * A core-handled reject-filter is installed. + * Allow by default on core layer filtering stage due to the last + * core-layer-filter's behavior. + * Allow by default on ops layer filtering stage due to the absence of + * ops layer filters. + */ + KUNIT_EXPECT_EQ(test, scheme.core_filters_default_reject, false); + KUNIT_EXPECT_EQ(test, scheme.ops_filters_default_reject, false); + + anon_filter = damos_new_filter(DAMOS_FILTER_TYPE_ANON, true, true); + damos_add_filter(&scheme, anon_filter); + + damos_set_filters_default_reject(&scheme); + /* + * A core-handled reject-filter and ops-handled allow-filter are installed. + * Allow by default on core layer filtering stage due to the existence + * of the ops-handled filter. + * Reject by default on ops layer filtering stage due to the last + * ops-layer-filter's behavior. + */ + KUNIT_EXPECT_EQ(test, scheme.core_filters_default_reject, false); + KUNIT_EXPECT_EQ(test, scheme.ops_filters_default_reject, true); + + target_filter->allow = true; + damos_set_filters_default_reject(&scheme); + /* + * A core-handled allow-filter and ops-handled allow-filter are + * installed. + * Allow by default on core layer filtering stage due to the existence + * of the ops-handled filter. + * Reject by default on ops layer filtering stage due to the last + * ops-layer-filter's behavior. + */ + KUNIT_EXPECT_EQ(test, scheme.core_filters_default_reject, false); + KUNIT_EXPECT_EQ(test, scheme.ops_filters_default_reject, true); +} + static struct kunit_case damon_test_cases[] = { KUNIT_CASE(damon_test_target), KUNIT_CASE(damon_test_regions), @@ -527,6 +596,7 @@ static struct kunit_case damon_test_cases[] = { KUNIT_CASE(damos_test_new_filter), KUNIT_CASE(damos_test_filter_out), KUNIT_CASE(damon_test_feed_loop_next_input), + KUNIT_CASE(damon_test_set_filters_default_reject), {}, }; From 03f83209e8e72df7eb6ed0afc60adca492844082 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Mon, 12 May 2025 17:27:14 -0700 Subject: [PATCH 269/285] selftests/damon/_damon_sysfs: read tried regions directories in order Kdamond.update_schemes_tried_regions() reads and stores tried regions information out of address order. It makes debugging a test failure difficult. Change the behavior to do the reading and writing in the address order. Link: https://lkml.kernel.org/r/20250513002715.40126-6-sj@kernel.org Signed-off-by: SeongJae Park Cc: Brendan Higgins Cc: David Gow Cc: Jonathan Corbet Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/damon/_damon_sysfs.py | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/tools/testing/selftests/damon/_damon_sysfs.py b/tools/testing/selftests/damon/_damon_sysfs.py index 6e136dc3df19..1e587e0b1a39 100644 --- a/tools/testing/selftests/damon/_damon_sysfs.py +++ b/tools/testing/selftests/damon/_damon_sysfs.py @@ -420,11 +420,16 @@ class Kdamond: tried_regions = [] tried_regions_dir = os.path.join( scheme.sysfs_dir(), 'tried_regions') + region_indices = [] for filename in os.listdir( os.path.join(scheme.sysfs_dir(), 'tried_regions')): tried_region_dir = os.path.join(tried_regions_dir, filename) if not os.path.isdir(tried_region_dir): continue + region_indices.append(int(filename)) + for region_idx in sorted(region_indices): + tried_region_dir = os.path.join(tried_regions_dir, + '%d' % region_idx) region_values = [] for f in ['start', 'end', 'nr_accesses', 'age']: content, err = read_file( From 6a4b3551ba1079e8831fb2821765a49b7fd33dbd Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Mon, 12 May 2025 17:27:15 -0700 Subject: [PATCH 270/285] Docs/damon: update titles and brief introductions to explain DAMOS DAMON was initially developed only for data access monitoring, and then extended for not only access monitoring but also access-aware system operations (DAMOS). But the documents have old titles and brief introductions for only the monitoring part. Update the titles and the brief introductions to explain DAMOS part together. Link: https://lkml.kernel.org/r/20250513002715.40126-7-sj@kernel.org Signed-off-by: SeongJae Park Cc: Brendan Higgins Cc: David Gow Cc: Jonathan Corbet Cc: Shuah Khan Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/damon/index.rst | 11 +++++------ Documentation/mm/damon/index.rst | 6 +++--- 2 files changed, 8 insertions(+), 9 deletions(-) diff --git a/Documentation/admin-guide/mm/damon/index.rst b/Documentation/admin-guide/mm/damon/index.rst index 33d37bb2fb4e..bc7e976120e0 100644 --- a/Documentation/admin-guide/mm/damon/index.rst +++ b/Documentation/admin-guide/mm/damon/index.rst @@ -1,12 +1,11 @@ .. SPDX-License-Identifier: GPL-2.0 -========================== -DAMON: Data Access MONitor -========================== +================================================================ +DAMON: Data Access MONitoring and Access-aware System Operations +================================================================ -:doc:`DAMON ` allows light-weight data access monitoring. -Using DAMON, users can analyze the memory access patterns of their systems and -optimize those. +:doc:`DAMON ` is a Linux kernel subsystem for efficient data +access monitoring and access-aware system operations. .. toctree:: :maxdepth: 2 diff --git a/Documentation/mm/damon/index.rst b/Documentation/mm/damon/index.rst index 5a3359704cce..31c1fa955b3d 100644 --- a/Documentation/mm/damon/index.rst +++ b/Documentation/mm/damon/index.rst @@ -1,8 +1,8 @@ .. SPDX-License-Identifier: GPL-2.0 -========================== -DAMON: Data Access MONitor -========================== +================================================================ +DAMON: Data Access MONitoring and Access-aware System Operations +================================================================ DAMON is a Linux kernel subsystem that provides a framework for data access monitoring and the monitoring results based system operations. The core From 780138b123816d717dbc0771d4c87e9a8a01963d Mon Sep 17 00:00:00 2001 From: Casey Chen Date: Tue, 13 May 2025 12:26:02 -0600 Subject: [PATCH 271/285] alloc_tag: check mem_profiling_support in alloc_tag_init If mem_profiling_support is false, for example by sysctl.vm.mem_profiling=never, alloc_tag_init should skip module tags allocation, codetag type registration and procfs init. Link: https://lkml.kernel.org/r/20250513182602.121843-1-cachen@purestorage.com Signed-off-by: Casey Chen Reviewed-by: Yuanyuan Zhong Acked-by: Suren Baghdasaryan Signed-off-by: Andrew Morton --- lib/alloc_tag.c | 34 +++++++++++++++++++--------------- 1 file changed, 19 insertions(+), 15 deletions(-) diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c index 25ecc1334b67..3baf19879c8e 100644 --- a/lib/alloc_tag.c +++ b/lib/alloc_tag.c @@ -244,17 +244,6 @@ static void shutdown_mem_profiling(bool remove_file) mem_profiling_support = false; } -static void __init procfs_init(void) -{ - if (!mem_profiling_support) - return; - - if (!proc_create_seq(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_seq_op)) { - pr_err("Failed to create %s file\n", ALLOCINFO_FILE_NAME); - shutdown_mem_profiling(false); - } -} - void __init alloc_tag_sec_init(void) { struct alloc_tag *last_codetag; @@ -762,19 +751,34 @@ static int __init alloc_tag_init(void) }; int res; + sysctl_init(); + + if (!mem_profiling_support) { + pr_info("Memory allocation profiling is not supported!\n"); + return 0; + } + + if (!proc_create_seq(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_seq_op)) { + pr_err("Failed to create %s file\n", ALLOCINFO_FILE_NAME); + shutdown_mem_profiling(false); + return -ENOMEM; + } + res = alloc_mod_tags_mem(); - if (res) + if (res) { + pr_err("Failed to reserve address space for module tags, errno = %d\n", res); + shutdown_mem_profiling(true); return res; + } alloc_tag_cttype = codetag_register_type(&desc); if (IS_ERR(alloc_tag_cttype)) { + pr_err("Allocation tags registration failed, errno = %ld\n", PTR_ERR(alloc_tag_cttype)); free_mod_tags_mem(); + shutdown_mem_profiling(true); return PTR_ERR(alloc_tag_cttype); } - sysctl_init(); - procfs_init(); - return 0; } module_init(alloc_tag_init); From 19e0713bbe4a682b651dfd3773d2a06a9d61c47b Mon Sep 17 00:00:00 2001 From: Ryan Chung Date: Tue, 13 May 2025 16:44:11 +0900 Subject: [PATCH 272/285] selftests/eventfd: correct test name and improve messages MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Rename test from eventfd_chek_flag_cloexec_and_nonblock to eventfd_check_flag_cloexec_and_nonblock. - Make the RDWR‐flag comment declarative: “The kernel automatically adds the O_RDWR flag.” - Update semaphore‐flag failure message to: “eventfd semaphore flag check failed: …” Link: https://lkml.kernel.org/r/20250513074411.6965-1-seokwoo.chung130@gmail.com Signed-off-by: Ryan Chung Reviewed-by: Wen Yang Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/filesystems/eventfd/eventfd_test.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/filesystems/eventfd/eventfd_test.c b/tools/testing/selftests/filesystems/eventfd/eventfd_test.c index 85acb4e3ef00..72d51ad0ee0e 100644 --- a/tools/testing/selftests/filesystems/eventfd/eventfd_test.c +++ b/tools/testing/selftests/filesystems/eventfd/eventfd_test.c @@ -50,7 +50,7 @@ TEST(eventfd_check_flag_rdwr) ASSERT_GE(fd, 0); flags = fcntl(fd, F_GETFL); - // since the kernel automatically added O_RDWR. + // The kernel automatically adds the O_RDWR flag. EXPECT_EQ(flags, O_RDWR); close(fd); @@ -85,7 +85,7 @@ TEST(eventfd_check_flag_nonblock) close(fd); } -TEST(eventfd_chek_flag_cloexec_and_nonblock) +TEST(eventfd_check_flag_cloexec_and_nonblock) { int fd, flags; @@ -178,8 +178,7 @@ TEST(eventfd_check_flag_semaphore) // The semaphore could only be obtained from fdinfo. ret = verify_fdinfo(fd, &err, "eventfd-semaphore: ", 19, "1\n"); if (ret != 0) - ksft_print_msg("eventfd-semaphore check failed, msg: %s\n", - err.msg); + ksft_print_msg("eventfd semaphore flag check failed: %s\n", err.msg); EXPECT_EQ(ret, 0); close(fd); From cc79061b8fc119807111b615aa791562374b15b2 Mon Sep 17 00:00:00 2001 From: Baolin Wang Date: Tue, 13 May 2025 14:56:35 +0800 Subject: [PATCH 273/285] mm: khugepaged: decouple SHMEM and file folios' collapse Originally, the file pages collapse was intended for tmpfs/shmem to merge into THP in the background. However, now not only tmpfs/shmem can support large folios, but some other file systems (such as XFS, erofs ...) also support large folios. Therefore, it is time to decouple the support of file folios collapse from SHMEM. Link: https://lkml.kernel.org/r/ce5c2314e0368cf34bda26f9bacf01c982d4da17.1747119309.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang Acked-by: David Hildenbrand Acked-by: Zi Yan Cc: Dev Jain Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Mariano Pache Cc: Michal Hocko Cc: Mike Rapoport Cc: Ryan Roberts Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- include/linux/khugepaged.h | 8 -------- mm/Kconfig | 2 +- mm/khugepaged.c | 13 ++----------- 3 files changed, 3 insertions(+), 20 deletions(-) diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h index 1f46046080f5..b8d69cfbb58b 100644 --- a/include/linux/khugepaged.h +++ b/include/linux/khugepaged.h @@ -15,16 +15,8 @@ extern void khugepaged_enter_vma(struct vm_area_struct *vma, unsigned long vm_flags); extern void khugepaged_min_free_kbytes_update(void); extern bool current_is_khugepaged(void); -#ifdef CONFIG_SHMEM extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, bool install_pmd); -#else -static inline int collapse_pte_mapped_thp(struct mm_struct *mm, - unsigned long addr, bool install_pmd) -{ - return 0; -} -#endif static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm) { diff --git a/mm/Kconfig b/mm/Kconfig index 60ea9eba4814..bd08e151fa1b 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -886,7 +886,7 @@ config THP_SWAP config READ_ONLY_THP_FOR_FS bool "Read-only THP for filesystems (EXPERIMENTAL)" - depends on TRANSPARENT_HUGEPAGE && SHMEM + depends on TRANSPARENT_HUGEPAGE help Allow khugepaged to put read-only file-backed pages in THP. diff --git a/mm/khugepaged.c b/mm/khugepaged.c index ebcd7c8a4b44..cdf5a581368b 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1464,7 +1464,6 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot) } } -#ifdef CONFIG_SHMEM /* folio must be locked, and mmap_lock must be held */ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmdp, struct folio *folio, struct page *page) @@ -2353,14 +2352,6 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr, trace_mm_khugepaged_scan_file(mm, folio, file, present, swap, result); return result; } -#else -static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr, - struct file *file, pgoff_t start, - struct collapse_control *cc) -{ - BUILD_BUG(); -} -#endif static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result, struct collapse_control *cc) @@ -2436,7 +2427,7 @@ skip: VM_BUG_ON(khugepaged_scan.address < hstart || khugepaged_scan.address + HPAGE_PMD_SIZE > hend); - if (IS_ENABLED(CONFIG_SHMEM) && !vma_is_anonymous(vma)) { + if (!vma_is_anonymous(vma)) { struct file *file = get_file(vma->vm_file); pgoff_t pgoff = linear_page_index(vma, khugepaged_scan.address); @@ -2782,7 +2773,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, mmap_assert_locked(mm); memset(cc->node_load, 0, sizeof(cc->node_load)); nodes_clear(cc->alloc_nmask); - if (IS_ENABLED(CONFIG_SHMEM) && !vma_is_anonymous(vma)) { + if (!vma_is_anonymous(vma)) { struct file *file = get_file(vma->vm_file); pgoff_t pgoff = linear_page_index(vma, addr); From 8a4b42b955286c122c4402b458acbc1d92953fbd Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Wed, 14 May 2025 11:41:52 -0700 Subject: [PATCH 274/285] memcg: memcg_rstat_updated re-entrant safe against irqs Patch series "memcg: make memcg stats irq safe", v2. This series converts memcg stats to be irq safe i.e. memcg stats can be updated in any context (task, softirq or hardirq) without disabling the irqs. This is still not nmi-safe on all architectures but after this series converting memcg charging and stats nmi-safe will be easier. This patch (of 7): memcg_rstat_updated() is used to track the memcg stats updates for optimizing the flushes. At the moment, it is not re-entrant safe and the callers disabled irqs before calling. However to achieve the goal of updating memcg stats without irqs, memcg_rstat_updated() needs to be re-entrant safe against irqs. This patch makes memcg_rstat_updated() re-entrant safe using this_cpu_* ops. On archs with CONFIG_ARCH_HAS_NMI_SAFE_THIS_CPU_OPS, this patch is also making memcg_rstat_updated() nmi safe. [lorenzo.stoakes@oracle.com: fix build] Link: https://lkml.kernel.org/r/22f69e6e-7908-4e92-96ca-5c70d535c439@lucifer.local Link: https://lkml.kernel.org/r/20250514184158.3471331-1-shakeel.butt@linux.dev Link: https://lkml.kernel.org/r/20250514184158.3471331-2-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Signed-off-by: Lorenzo Stoakes Reviewed-by: Vlastimil Babka Tested-by: Alexei Starovoitov Cc: Johannes Weiner Cc: Michal Hocko Cc: Muchun Song Cc: Roman Gushchin Cc: Sebastian Andrzej Siewior Signed-off-by: Andrew Morton --- mm/memcontrol.c | 29 ++++++++++++++++++----------- 1 file changed, 18 insertions(+), 11 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2b6768f0e916..01560e22dd06 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -503,8 +503,8 @@ struct memcg_vmstats_percpu { unsigned int stats_updates; /* Cached pointers for fast iteration in memcg_rstat_updated() */ - struct memcg_vmstats_percpu *parent; - struct memcg_vmstats *vmstats; + struct memcg_vmstats_percpu __percpu *parent_pcpu; + struct memcg_vmstats *vmstats; /* The above should fit a single cacheline for memcg_rstat_updated() */ @@ -586,16 +586,21 @@ static bool memcg_vmstats_needs_flush(struct memcg_vmstats *vmstats) static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) { + struct memcg_vmstats_percpu __percpu *statc_pcpu; struct memcg_vmstats_percpu *statc; - int cpu = smp_processor_id(); + int cpu; unsigned int stats_updates; if (!val) return; + /* Don't assume callers have preemption disabled. */ + cpu = get_cpu(); + cgroup_rstat_updated(memcg->css.cgroup, cpu); - statc = this_cpu_ptr(memcg->vmstats_percpu); - for (; statc; statc = statc->parent) { + statc_pcpu = memcg->vmstats_percpu; + for (; statc_pcpu; statc_pcpu = statc->parent_pcpu) { + statc = this_cpu_ptr(statc_pcpu); /* * If @memcg is already flushable then all its ancestors are * flushable as well and also there is no need to increase @@ -604,14 +609,15 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) if (memcg_vmstats_needs_flush(statc->vmstats)) break; - stats_updates = READ_ONCE(statc->stats_updates) + abs(val); - WRITE_ONCE(statc->stats_updates, stats_updates); + stats_updates = this_cpu_add_return(statc_pcpu->stats_updates, + abs(val)); if (stats_updates < MEMCG_CHARGE_BATCH) continue; + stats_updates = this_cpu_xchg(statc_pcpu->stats_updates, 0); atomic64_add(stats_updates, &statc->vmstats->stats_updates); - WRITE_ONCE(statc->stats_updates, 0); } + put_cpu(); } static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg, bool force) @@ -3689,7 +3695,8 @@ static void mem_cgroup_free(struct mem_cgroup *memcg) static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent) { - struct memcg_vmstats_percpu *statc, *pstatc; + struct memcg_vmstats_percpu *statc; + struct memcg_vmstats_percpu __percpu *pstatc_pcpu; struct mem_cgroup *memcg; int node, cpu; int __maybe_unused i; @@ -3720,9 +3727,9 @@ static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent) for_each_possible_cpu(cpu) { if (parent) - pstatc = per_cpu_ptr(parent->vmstats_percpu, cpu); + pstatc_pcpu = parent->vmstats_percpu; statc = per_cpu_ptr(memcg->vmstats_percpu, cpu); - statc->parent = parent ? pstatc : NULL; + statc->parent_pcpu = parent ? pstatc_pcpu : NULL; statc->vmstats = memcg->vmstats; } From c7163535cdaf7305aac14745483abb54684a43d5 Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Wed, 14 May 2025 11:41:53 -0700 Subject: [PATCH 275/285] memcg: move preempt disable to callers of memcg_rstat_updated Let's move the explicit preempt disable code to the callers of memcg_rstat_updated and also remove the memcg_stats_lock and related functions which ensures the callers of stats update functions have disabled preemption because now the stats update functions are explicitly disabling preemption. Link: https://lkml.kernel.org/r/20250514184158.3471331-3-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Acked-by: Vlastimil Babka Cc: Alexei Starovoitov Cc: Johannes Weiner Cc: Michal Hocko Cc: Muchun Song Cc: Roman Gushchin Cc: Sebastian Andrzej Siewior Signed-off-by: Andrew Morton --- mm/memcontrol.c | 74 +++++++++++++------------------------------------ 1 file changed, 19 insertions(+), 55 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 01560e22dd06..82b51e1b36f0 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -555,48 +555,22 @@ static u64 flush_last_time; #define FLUSH_TIME (2UL*HZ) -/* - * Accessors to ensure that preemption is disabled on PREEMPT_RT because it can - * not rely on this as part of an acquired spinlock_t lock. These functions are - * never used in hardirq context on PREEMPT_RT and therefore disabling preemtion - * is sufficient. - */ -static void memcg_stats_lock(void) -{ - preempt_disable_nested(); - VM_WARN_ON_IRQS_ENABLED(); -} - -static void __memcg_stats_lock(void) -{ - preempt_disable_nested(); -} - -static void memcg_stats_unlock(void) -{ - preempt_enable_nested(); -} - - static bool memcg_vmstats_needs_flush(struct memcg_vmstats *vmstats) { return atomic64_read(&vmstats->stats_updates) > MEMCG_CHARGE_BATCH * num_online_cpus(); } -static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) +static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val, + int cpu) { struct memcg_vmstats_percpu __percpu *statc_pcpu; struct memcg_vmstats_percpu *statc; - int cpu; unsigned int stats_updates; if (!val) return; - /* Don't assume callers have preemption disabled. */ - cpu = get_cpu(); - cgroup_rstat_updated(memcg->css.cgroup, cpu); statc_pcpu = memcg->vmstats_percpu; for (; statc_pcpu; statc_pcpu = statc->parent_pcpu) { @@ -617,7 +591,6 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) stats_updates = this_cpu_xchg(statc_pcpu->stats_updates, 0); atomic64_add(stats_updates, &statc->vmstats->stats_updates); } - put_cpu(); } static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg, bool force) @@ -715,6 +688,7 @@ void __mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx, int val) { int i = memcg_stats_index(idx); + int cpu; if (mem_cgroup_disabled()) return; @@ -722,12 +696,14 @@ void __mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx, if (WARN_ONCE(BAD_STAT_IDX(i), "%s: missing stat item %d\n", __func__, idx)) return; - memcg_stats_lock(); + cpu = get_cpu(); + __this_cpu_add(memcg->vmstats_percpu->state[i], val); val = memcg_state_val_in_pages(idx, val); - memcg_rstat_updated(memcg, val); + memcg_rstat_updated(memcg, val, cpu); trace_mod_memcg_state(memcg, idx, val); - memcg_stats_unlock(); + + put_cpu(); } #ifdef CONFIG_MEMCG_V1 @@ -756,6 +732,7 @@ static void __mod_memcg_lruvec_state(struct lruvec *lruvec, struct mem_cgroup_per_node *pn; struct mem_cgroup *memcg; int i = memcg_stats_index(idx); + int cpu; if (WARN_ONCE(BAD_STAT_IDX(i), "%s: missing stat item %d\n", __func__, idx)) return; @@ -763,24 +740,7 @@ static void __mod_memcg_lruvec_state(struct lruvec *lruvec, pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); memcg = pn->memcg; - /* - * The caller from rmap relies on disabled preemption because they never - * update their counter from in-interrupt context. For these two - * counters we check that the update is never performed from an - * interrupt context while other caller need to have disabled interrupt. - */ - __memcg_stats_lock(); - if (IS_ENABLED(CONFIG_DEBUG_VM)) { - switch (idx) { - case NR_ANON_MAPPED: - case NR_FILE_MAPPED: - case NR_ANON_THPS: - WARN_ON_ONCE(!in_task()); - break; - default: - VM_WARN_ON_IRQS_ENABLED(); - } - } + cpu = get_cpu(); /* Update memcg */ __this_cpu_add(memcg->vmstats_percpu->state[i], val); @@ -789,9 +749,10 @@ static void __mod_memcg_lruvec_state(struct lruvec *lruvec, __this_cpu_add(pn->lruvec_stats_percpu->state[i], val); val = memcg_state_val_in_pages(idx, val); - memcg_rstat_updated(memcg, val); + memcg_rstat_updated(memcg, val, cpu); trace_mod_memcg_lruvec_state(memcg, idx, val); - memcg_stats_unlock(); + + put_cpu(); } /** @@ -871,6 +832,7 @@ void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, unsigned long count) { int i = memcg_events_index(idx); + int cpu; if (mem_cgroup_disabled()) return; @@ -878,11 +840,13 @@ void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, if (WARN_ONCE(BAD_STAT_IDX(i), "%s: missing stat item %d\n", __func__, idx)) return; - memcg_stats_lock(); + cpu = get_cpu(); + __this_cpu_add(memcg->vmstats_percpu->events[i], count); - memcg_rstat_updated(memcg, count); + memcg_rstat_updated(memcg, count, cpu); trace_count_memcg_events(memcg, idx, count); - memcg_stats_unlock(); + + put_cpu(); } unsigned long memcg_events(struct mem_cgroup *memcg, int event) From 8814e3b8692b31e0150a894dd70c14ca0b7b746a Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Wed, 14 May 2025 11:41:54 -0700 Subject: [PATCH 276/285] memcg: make mod_memcg_state re-entrant safe against irqs Let's make mod_memcg_state re-entrant safe against irqs. The only thing needed is to convert the usage of __this_cpu_add() to this_cpu_add(). In addition, with re-entrant safety, there is no need to disable irqs. mod_memcg_state() is not safe against nmi, so let's add warning if someone tries to call it in nmi context. Link: https://lkml.kernel.org/r/20250514184158.3471331-4-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Acked-by: Vlastimil Babka Cc: Alexei Starovoitov Cc: Johannes Weiner Cc: Michal Hocko Cc: Muchun Song Cc: Roman Gushchin Cc: Sebastian Andrzej Siewior Signed-off-by: Andrew Morton --- include/linux/memcontrol.h | 20 ++------------------ mm/memcontrol.c | 8 ++++---- 2 files changed, 6 insertions(+), 22 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 9ed75f82b858..92861ff3c43f 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -903,19 +903,9 @@ struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim, struct mem_cgroup *oom_domain); void mem_cgroup_print_oom_group(struct mem_cgroup *memcg); -void __mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx, - int val); - /* idx can be of type enum memcg_stat_item or node_stat_item */ -static inline void mod_memcg_state(struct mem_cgroup *memcg, - enum memcg_stat_item idx, int val) -{ - unsigned long flags; - - local_irq_save(flags); - __mod_memcg_state(memcg, idx, val); - local_irq_restore(flags); -} +void mod_memcg_state(struct mem_cgroup *memcg, + enum memcg_stat_item idx, int val); static inline void mod_memcg_page_state(struct page *page, enum memcg_stat_item idx, int val) @@ -1375,12 +1365,6 @@ static inline void mem_cgroup_print_oom_group(struct mem_cgroup *memcg) { } -static inline void __mod_memcg_state(struct mem_cgroup *memcg, - enum memcg_stat_item idx, - int nr) -{ -} - static inline void mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx, int nr) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 82b51e1b36f0..25d5b198c8af 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -679,12 +679,12 @@ static int memcg_state_val_in_pages(int idx, int val) } /** - * __mod_memcg_state - update cgroup memory statistics + * mod_memcg_state - update cgroup memory statistics * @memcg: the memory cgroup * @idx: the stat item - can be enum memcg_stat_item or enum node_stat_item * @val: delta to add to the counter, can be negative */ -void __mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx, +void mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx, int val) { int i = memcg_stats_index(idx); @@ -698,7 +698,7 @@ void __mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx, cpu = get_cpu(); - __this_cpu_add(memcg->vmstats_percpu->state[i], val); + this_cpu_add(memcg->vmstats_percpu->state[i], val); val = memcg_state_val_in_pages(idx, val); memcg_rstat_updated(memcg, val, cpu); trace_mod_memcg_state(memcg, idx, val); @@ -2918,7 +2918,7 @@ static void drain_obj_stock(struct obj_stock_pcp *stock) memcg = get_mem_cgroup_from_objcg(old); - __mod_memcg_state(memcg, MEMCG_KMEM, -nr_pages); + mod_memcg_state(memcg, MEMCG_KMEM, -nr_pages); memcg1_account_kmem(memcg, -nr_pages); if (!mem_cgroup_is_root(memcg)) memcg_uncharge(memcg, nr_pages); From e52401e7247bb36cabba389ae32fb75a12a6e94b Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Wed, 14 May 2025 11:41:55 -0700 Subject: [PATCH 277/285] memcg: make count_memcg_events re-entrant safe against irqs Let's make count_memcg_events re-entrant safe against irqs. The only thing needed is to convert the usage of __this_cpu_add() to this_cpu_add(). In addition, with re-entrant safety, there is no need to disable irqs. Also add warnings for in_nmi() as it is not safe against nmi context. Link: https://lkml.kernel.org/r/20250514184158.3471331-5-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Acked-by: Vlastimil Babka Cc: Alexei Starovoitov Cc: Johannes Weiner Cc: Michal Hocko Cc: Muchun Song Cc: Roman Gushchin Cc: Sebastian Andrzej Siewior Signed-off-by: Andrew Morton --- include/linux/memcontrol.h | 21 ++------------------- mm/memcontrol-v1.c | 6 +++--- mm/memcontrol.c | 6 +++--- mm/swap.c | 8 ++++---- mm/vmscan.c | 14 +++++++------- 5 files changed, 19 insertions(+), 36 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 92861ff3c43f..f7848f73f41c 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -942,19 +942,8 @@ static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx, local_irq_restore(flags); } -void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, - unsigned long count); - -static inline void count_memcg_events(struct mem_cgroup *memcg, - enum vm_event_item idx, - unsigned long count) -{ - unsigned long flags; - - local_irq_save(flags); - __count_memcg_events(memcg, idx, count); - local_irq_restore(flags); -} +void count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, + unsigned long count); static inline void count_memcg_folio_events(struct folio *folio, enum vm_event_item idx, unsigned long nr) @@ -1418,12 +1407,6 @@ static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx, } static inline void count_memcg_events(struct mem_cgroup *memcg, - enum vm_event_item idx, - unsigned long count) -{ -} - -static inline void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, unsigned long count) { diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c index 54c49cbfc968..4b94731305b9 100644 --- a/mm/memcontrol-v1.c +++ b/mm/memcontrol-v1.c @@ -512,9 +512,9 @@ static void memcg1_charge_statistics(struct mem_cgroup *memcg, int nr_pages) { /* pagein of a big page is an event. So, ignore page size */ if (nr_pages > 0) - __count_memcg_events(memcg, PGPGIN, 1); + count_memcg_events(memcg, PGPGIN, 1); else { - __count_memcg_events(memcg, PGPGOUT, 1); + count_memcg_events(memcg, PGPGOUT, 1); nr_pages = -nr_pages; /* for event */ } @@ -689,7 +689,7 @@ void memcg1_uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout, unsigned long flags; local_irq_save(flags); - __count_memcg_events(memcg, PGPGOUT, pgpgout); + count_memcg_events(memcg, PGPGOUT, pgpgout); __this_cpu_add(memcg->events_percpu->nr_page_events, nr_memory); memcg1_check_events(memcg, nid); local_irq_restore(flags); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 25d5b198c8af..0ef7db12605b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -823,12 +823,12 @@ void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val) } /** - * __count_memcg_events - account VM events in a cgroup + * count_memcg_events - account VM events in a cgroup * @memcg: the memory cgroup * @idx: the event item * @count: the number of events that occurred */ -void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, +void count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, unsigned long count) { int i = memcg_events_index(idx); @@ -842,7 +842,7 @@ void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, cpu = get_cpu(); - __this_cpu_add(memcg->vmstats_percpu->events[i], count); + this_cpu_add(memcg->vmstats_percpu->events[i], count); memcg_rstat_updated(memcg, count, cpu); trace_count_memcg_events(memcg, idx, count); diff --git a/mm/swap.c b/mm/swap.c index 77b2d5997873..4fc322f7111a 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -309,7 +309,7 @@ static void lru_activate(struct lruvec *lruvec, struct folio *folio) trace_mm_lru_activate(folio); __count_vm_events(PGACTIVATE, nr_pages); - __count_memcg_events(lruvec_memcg(lruvec), PGACTIVATE, nr_pages); + count_memcg_events(lruvec_memcg(lruvec), PGACTIVATE, nr_pages); } #ifdef CONFIG_SMP @@ -581,7 +581,7 @@ static void lru_deactivate_file(struct lruvec *lruvec, struct folio *folio) if (active) { __count_vm_events(PGDEACTIVATE, nr_pages); - __count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, + count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_pages); } } @@ -599,7 +599,7 @@ static void lru_deactivate(struct lruvec *lruvec, struct folio *folio) lruvec_add_folio(lruvec, folio); __count_vm_events(PGDEACTIVATE, nr_pages); - __count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_pages); + count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_pages); } static void lru_lazyfree(struct lruvec *lruvec, struct folio *folio) @@ -625,7 +625,7 @@ static void lru_lazyfree(struct lruvec *lruvec, struct folio *folio) lruvec_add_folio(lruvec, folio); __count_vm_events(PGLAZYFREE, nr_pages); - __count_memcg_events(lruvec_memcg(lruvec), PGLAZYFREE, nr_pages); + count_memcg_events(lruvec_memcg(lruvec), PGLAZYFREE, nr_pages); } /* diff --git a/mm/vmscan.c b/mm/vmscan.c index 7d6d1ce3921e..07c51fa03434 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2028,7 +2028,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, item = PGSCAN_KSWAPD + reclaimer_offset(sc); if (!cgroup_reclaim(sc)) __count_vm_events(item, nr_scanned); - __count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned); + count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned); __count_vm_events(PGSCAN_ANON + file, nr_scanned); spin_unlock_irq(&lruvec->lru_lock); @@ -2048,7 +2048,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, item = PGSTEAL_KSWAPD + reclaimer_offset(sc); if (!cgroup_reclaim(sc)) __count_vm_events(item, nr_reclaimed); - __count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed); + count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed); __count_vm_events(PGSTEAL_ANON + file, nr_reclaimed); spin_unlock_irq(&lruvec->lru_lock); @@ -2138,7 +2138,7 @@ static void shrink_active_list(unsigned long nr_to_scan, if (!cgroup_reclaim(sc)) __count_vm_events(PGREFILL, nr_scanned); - __count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned); + count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned); spin_unlock_irq(&lruvec->lru_lock); @@ -2195,7 +2195,7 @@ static void shrink_active_list(unsigned long nr_to_scan, nr_deactivate = move_folios_to_lru(lruvec, &l_inactive); __count_vm_events(PGDEACTIVATE, nr_deactivate); - __count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate); + count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate); __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); spin_unlock_irq(&lruvec->lru_lock); @@ -4612,8 +4612,8 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc, __count_vm_events(item, isolated); __count_vm_events(PGREFILL, sorted); } - __count_memcg_events(memcg, item, isolated); - __count_memcg_events(memcg, PGREFILL, sorted); + count_memcg_events(memcg, item, isolated); + count_memcg_events(memcg, PGREFILL, sorted); __count_vm_events(PGSCAN_ANON + type, isolated); trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, MAX_LRU_BATCH, scanned, skipped, isolated, @@ -4763,7 +4763,7 @@ retry: item = PGSTEAL_KSWAPD + reclaimer_offset(sc); if (!cgroup_reclaim(sc)) __count_vm_events(item, reclaimed); - __count_memcg_events(memcg, item, reclaimed); + count_memcg_events(memcg, item, reclaimed); __count_vm_events(PGSTEAL_ANON + type, reclaimed); spin_unlock_irq(&lruvec->lru_lock); From eee8a1778cab9efd5ba0eee1c081b4a4b396375b Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Wed, 14 May 2025 11:41:56 -0700 Subject: [PATCH 278/285] memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs Let's make __mod_memcg_lruvec_state re-entrant safe and name it mod_memcg_lruvec_state(). The only thing needed is to convert the usage of __this_cpu_add() to this_cpu_add(). There are two callers of mod_memcg_lruvec_state() and one of them i.e. __mod_objcg_mlstate() will be re-entrant safe as well, so, rename it mod_objcg_mlstate(). The last caller __mod_lruvec_state() still calls __mod_node_page_state() which is not re-entrant safe yet, so keep it as is. Link: https://lkml.kernel.org/r/20250514184158.3471331-6-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Acked-by: Vlastimil Babka Cc: Alexei Starovoitov Cc: Johannes Weiner Cc: Michal Hocko Cc: Muchun Song Cc: Roman Gushchin Cc: Sebastian Andrzej Siewior Signed-off-by: Andrew Morton --- mm/memcontrol.c | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 0ef7db12605b..f66cacb6f2a0 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -725,7 +725,7 @@ unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx) } #endif -static void __mod_memcg_lruvec_state(struct lruvec *lruvec, +static void mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, int val) { @@ -743,10 +743,10 @@ static void __mod_memcg_lruvec_state(struct lruvec *lruvec, cpu = get_cpu(); /* Update memcg */ - __this_cpu_add(memcg->vmstats_percpu->state[i], val); + this_cpu_add(memcg->vmstats_percpu->state[i], val); /* Update lruvec */ - __this_cpu_add(pn->lruvec_stats_percpu->state[i], val); + this_cpu_add(pn->lruvec_stats_percpu->state[i], val); val = memcg_state_val_in_pages(idx, val); memcg_rstat_updated(memcg, val, cpu); @@ -773,7 +773,7 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, /* Update memcg and lruvec */ if (!mem_cgroup_disabled()) - __mod_memcg_lruvec_state(lruvec, idx, val); + mod_memcg_lruvec_state(lruvec, idx, val); } void __lruvec_stat_mod_folio(struct folio *folio, enum node_stat_item idx, @@ -2525,7 +2525,7 @@ static void commit_charge(struct folio *folio, struct mem_cgroup *memcg) folio->memcg_data = (unsigned long)memcg; } -static inline void __mod_objcg_mlstate(struct obj_cgroup *objcg, +static inline void mod_objcg_mlstate(struct obj_cgroup *objcg, struct pglist_data *pgdat, enum node_stat_item idx, int nr) { @@ -2535,7 +2535,7 @@ static inline void __mod_objcg_mlstate(struct obj_cgroup *objcg, rcu_read_lock(); memcg = obj_cgroup_memcg(objcg); lruvec = mem_cgroup_lruvec(memcg, pgdat); - __mod_memcg_lruvec_state(lruvec, idx, nr); + mod_memcg_lruvec_state(lruvec, idx, nr); rcu_read_unlock(); } @@ -2845,12 +2845,12 @@ static void __account_obj_stock(struct obj_cgroup *objcg, struct pglist_data *oldpg = stock->cached_pgdat; if (stock->nr_slab_reclaimable_b) { - __mod_objcg_mlstate(objcg, oldpg, NR_SLAB_RECLAIMABLE_B, + mod_objcg_mlstate(objcg, oldpg, NR_SLAB_RECLAIMABLE_B, stock->nr_slab_reclaimable_b); stock->nr_slab_reclaimable_b = 0; } if (stock->nr_slab_unreclaimable_b) { - __mod_objcg_mlstate(objcg, oldpg, NR_SLAB_UNRECLAIMABLE_B, + mod_objcg_mlstate(objcg, oldpg, NR_SLAB_UNRECLAIMABLE_B, stock->nr_slab_unreclaimable_b); stock->nr_slab_unreclaimable_b = 0; } @@ -2876,7 +2876,7 @@ static void __account_obj_stock(struct obj_cgroup *objcg, } } if (nr) - __mod_objcg_mlstate(objcg, pgdat, idx, nr); + mod_objcg_mlstate(objcg, pgdat, idx, nr); } static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, @@ -2945,13 +2945,13 @@ static void drain_obj_stock(struct obj_stock_pcp *stock) */ if (stock->nr_slab_reclaimable_b || stock->nr_slab_unreclaimable_b) { if (stock->nr_slab_reclaimable_b) { - __mod_objcg_mlstate(old, stock->cached_pgdat, + mod_objcg_mlstate(old, stock->cached_pgdat, NR_SLAB_RECLAIMABLE_B, stock->nr_slab_reclaimable_b); stock->nr_slab_reclaimable_b = 0; } if (stock->nr_slab_unreclaimable_b) { - __mod_objcg_mlstate(old, stock->cached_pgdat, + mod_objcg_mlstate(old, stock->cached_pgdat, NR_SLAB_UNRECLAIMABLE_B, stock->nr_slab_unreclaimable_b); stock->nr_slab_unreclaimable_b = 0; From 0ccf1806d44f7c567c73d6ef076a8f6c8c16c74b Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Wed, 14 May 2025 11:41:57 -0700 Subject: [PATCH 279/285] memcg: no stock lock for cpu hot-unplug Previously on the cpu hot-unplug, the kernel would call drain_obj_stock() with objcg local lock. However local lock was not needed as the stock which was accessed belongs to a dead cpu but we kept it there to disable irqs as drain_obj_stock() may call mod_objcg_mlstate() which required irqs disabled. However there is no need to disable irqs now for mod_objcg_mlstate(), so we can remove the local lock altogether from cpu hot-unplug path. Link: https://lkml.kernel.org/r/20250514184158.3471331-7-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Acked-by: Vlastimil Babka Cc: Alexei Starovoitov Cc: Johannes Weiner Cc: Michal Hocko Cc: Muchun Song Cc: Roman Gushchin Cc: Sebastian Andrzej Siewior Signed-off-by: Andrew Morton --- mm/memcontrol.c | 11 +---------- 1 file changed, 1 insertion(+), 10 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f66cacb6f2a0..d8508b57d0fa 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2023,17 +2023,8 @@ void drain_all_stock(struct mem_cgroup *root_memcg) static int memcg_hotplug_cpu_dead(unsigned int cpu) { - struct obj_stock_pcp *obj_st; - unsigned long flags; - - obj_st = &per_cpu(obj_stock, cpu); - - /* drain_obj_stock requires objstock.lock */ - local_lock_irqsave(&obj_stock.lock, flags); - drain_obj_stock(obj_st); - local_unlock_irqrestore(&obj_stock.lock, flags); - /* no need for the local lock */ + drain_obj_stock(&per_cpu(obj_stock, cpu)); drain_stock_fully(&per_cpu(memcg_stock, cpu)); return 0; From 200577f69f29a58c90c67c83a0df6d12850e1d09 Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Wed, 14 May 2025 11:41:58 -0700 Subject: [PATCH 280/285] memcg: objcg stock trylock without irq disabling There is no need to disable irqs to use objcg per-cpu stock, so let's just not do that but consume_obj_stock() and refill_obj_stock() will need to use trylock instead to avoid deadlock against irq. One consequence of this change is that the charge request from irq context may take slowpath more often but it should be rare. Link: https://lkml.kernel.org/r/20250514184158.3471331-8-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Acked-by: Vlastimil Babka Cc: Alexei Starovoitov Cc: Johannes Weiner Cc: Michal Hocko Cc: Muchun Song Cc: Roman Gushchin Cc: Sebastian Andrzej Siewior Signed-off-by: Andrew Morton --- mm/memcontrol.c | 25 +++++++++++++++---------- 1 file changed, 15 insertions(+), 10 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d8508b57d0fa..35db91fddd1f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1880,18 +1880,17 @@ static void drain_local_memcg_stock(struct work_struct *dummy) static void drain_local_obj_stock(struct work_struct *dummy) { struct obj_stock_pcp *stock; - unsigned long flags; if (WARN_ONCE(!in_task(), "drain in non-task context")) return; - local_lock_irqsave(&obj_stock.lock, flags); + local_lock(&obj_stock.lock); stock = this_cpu_ptr(&obj_stock); drain_obj_stock(stock); clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags); - local_unlock_irqrestore(&obj_stock.lock, flags); + local_unlock(&obj_stock.lock); } static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) @@ -2874,10 +2873,10 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, struct pglist_data *pgdat, enum node_stat_item idx) { struct obj_stock_pcp *stock; - unsigned long flags; bool ret = false; - local_lock_irqsave(&obj_stock.lock, flags); + if (!local_trylock(&obj_stock.lock)) + return ret; stock = this_cpu_ptr(&obj_stock); if (objcg == READ_ONCE(stock->cached_objcg) && stock->nr_bytes >= nr_bytes) { @@ -2888,7 +2887,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, __account_obj_stock(objcg, stock, nr_bytes, pgdat, idx); } - local_unlock_irqrestore(&obj_stock.lock, flags); + local_unlock(&obj_stock.lock); return ret; } @@ -2977,10 +2976,16 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, enum node_stat_item idx) { struct obj_stock_pcp *stock; - unsigned long flags; unsigned int nr_pages = 0; - local_lock_irqsave(&obj_stock.lock, flags); + if (!local_trylock(&obj_stock.lock)) { + if (pgdat) + mod_objcg_mlstate(objcg, pgdat, idx, nr_bytes); + nr_pages = nr_bytes >> PAGE_SHIFT; + nr_bytes = nr_bytes & (PAGE_SIZE - 1); + atomic_add(nr_bytes, &objcg->nr_charged_bytes); + goto out; + } stock = this_cpu_ptr(&obj_stock); if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */ @@ -3002,8 +3007,8 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, stock->nr_bytes &= (PAGE_SIZE - 1); } - local_unlock_irqrestore(&obj_stock.lock, flags); - + local_unlock(&obj_stock.lock); +out: if (nr_pages) obj_cgroup_uncharge_pages(objcg, nr_pages); } From b0752f1a709740fd6ae518e969a25f90ec6a100f Mon Sep 17 00:00:00 2001 From: Fan Ni Date: Mon, 5 May 2025 11:22:41 -0700 Subject: [PATCH 281/285] mm/hugetlb: pass folio instead of page to unmap_ref_private() Patch series "Let unmap_hugepage_range() and several related functions take folio instead of page", v4. This patch (of 4): unmap_ref_private() has only a single user, which passes in &folio->page. Let it take the folio directly. Link: https://lkml.kernel.org/r/20250505182345.506888-2-nifan.cxl@gmail.com Link: https://lkml.kernel.org/r/20250505182345.506888-3-nifan.cxl@gmail.com Signed-off-by: Fan Ni Reviewed-by: Muchun Song Reviewed-by: Sidhartha Kumar Reviewed-by: Oscar Salvador Reviewed-by: Matthew Wilcox (Oracle) Acked-by: David Hildenbrand Cc: "Vishal Moola (Oracle)" Signed-off-by: Andrew Morton --- mm/hugetlb.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 0057d1f1dc9a..0c2b264a7ab8 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6071,7 +6071,7 @@ void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start, * same region. */ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma, - struct page *page, unsigned long address) + struct folio *folio, unsigned long address) { struct hstate *h = hstate_vma(vma); struct vm_area_struct *iter_vma; @@ -6115,7 +6115,8 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma, */ if (!is_vma_resv_set(iter_vma, HPAGE_RESV_OWNER)) unmap_hugepage_range(iter_vma, address, - address + huge_page_size(h), page, 0); + address + huge_page_size(h), + &folio->page, 0); } i_mmap_unlock_write(mapping); } @@ -6238,8 +6239,7 @@ retry_avoidcopy: hugetlb_vma_unlock_read(vma); mutex_unlock(&hugetlb_fault_mutex_table[hash]); - unmap_ref_private(mm, vma, &old_folio->page, - vmf->address); + unmap_ref_private(mm, vma, old_folio, vmf->address); mutex_lock(&hugetlb_fault_mutex_table[hash]); hugetlb_vma_lock_read(vma); From 81edb1ba3232afd45ae7f3f492a91019571b18c9 Mon Sep 17 00:00:00 2001 From: Fan Ni Date: Mon, 5 May 2025 11:22:42 -0700 Subject: [PATCH 282/285] mm/hugetlb: refactor unmap_hugepage_range() to take folio instead of page The function unmap_hugepage_range() has two kinds of users: 1) unmap_ref_private(), which passes in the head page of a folio. Since unmap_ref_private() already takes folio and there are no other uses of the folio struct in the function, it is natural for unmap_hugepage_range() to take folio also. 2) All other uses, which pass in NULL pointer. In both cases, we can pass in folio. Refactor unmap_hugepage_range() to take folio. Link: https://lkml.kernel.org/r/20250505182345.506888-4-nifan.cxl@gmail.com Signed-off-by: Fan Ni Reviewed-by: Muchun Song Reviewed-by: Sidhartha Kumar Reviewed-by: Oscar Salvador Acked-by: David Hildenbrand Cc: Matthew Wilcox (Oracle) Cc: "Vishal Moola (Oracle)" Signed-off-by: Andrew Morton --- include/linux/hugetlb.h | 4 ++-- mm/hugetlb.c | 7 ++++--- 2 files changed, 6 insertions(+), 5 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 23ebf49c5d6a..f6d5f24e793c 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -129,8 +129,8 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma, int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *, struct vm_area_struct *); void unmap_hugepage_range(struct vm_area_struct *, - unsigned long, unsigned long, struct page *, - zap_flags_t); + unsigned long start, unsigned long end, + struct folio *, zap_flags_t); void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long start, unsigned long end, diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 0c2b264a7ab8..c339ffe05556 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6046,7 +6046,7 @@ void __hugetlb_zap_end(struct vm_area_struct *vma, } void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start, - unsigned long end, struct page *ref_page, + unsigned long end, struct folio *folio, zap_flags_t zap_flags) { struct mmu_notifier_range range; @@ -6058,7 +6058,8 @@ void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start, mmu_notifier_invalidate_range_start(&range); tlb_gather_mmu(&tlb, vma->vm_mm); - __unmap_hugepage_range(&tlb, vma, start, end, ref_page, zap_flags); + __unmap_hugepage_range(&tlb, vma, start, end, + &folio->page, zap_flags); mmu_notifier_invalidate_range_end(&range); tlb_finish_mmu(&tlb); @@ -6116,7 +6117,7 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma, if (!is_vma_resv_set(iter_vma, HPAGE_RESV_OWNER)) unmap_hugepage_range(iter_vma, address, address + huge_page_size(h), - &folio->page, 0); + folio, 0); } i_mmap_unlock_write(mapping); } From 7f4b6065d9a842721a04632fc219aa453d1b2f5c Mon Sep 17 00:00:00 2001 From: Fan Ni Date: Mon, 5 May 2025 11:22:43 -0700 Subject: [PATCH 283/285] mm/hugetlb: refactor __unmap_hugepage_range() to take folio instead of page The function __unmap_hugepage_range() has two kinds of users: 1) unmap_hugepage_range(), which passes in the head page of a folio. Since unmap_hugepage_range() already takes folio and there are no other uses of the folio struct in the function, it is natural for __unmap_hugepage_range() to take folio also. 2) All other uses, which pass in NULL pointer. In both cases, we can pass in folio. Refactor __unmap_hugepage_range() to take folio. Link: https://lkml.kernel.org/r/20250505182345.506888-5-nifan.cxl@gmail.com Signed-off-by: Fan Ni Acked-by: David Hildenbrand Reviewed-by: Oscar Salvador Cc: Matthew Wilcox (Oracle) Cc: Muchun Song Cc: Sidhartha Kumar Cc: "Vishal Moola (Oracle)" Signed-off-by: Andrew Morton --- include/linux/hugetlb.h | 4 ++-- mm/hugetlb.c | 18 +++++++++--------- 2 files changed, 11 insertions(+), 11 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index f6d5f24e793c..eb21619206af 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -134,7 +134,7 @@ void unmap_hugepage_range(struct vm_area_struct *, void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long start, unsigned long end, - struct page *ref_page, zap_flags_t zap_flags); + struct folio *, zap_flags_t zap_flags); void hugetlb_report_meminfo(struct seq_file *); int hugetlb_report_node_meminfo(char *buf, int len, int nid); void hugetlb_show_meminfo_node(int nid); @@ -455,7 +455,7 @@ static inline long hugetlb_change_protection( static inline void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long start, - unsigned long end, struct page *ref_page, + unsigned long end, struct folio *folio, zap_flags_t zap_flags) { BUG(); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index c339ffe05556..443b75e116cf 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5840,7 +5840,7 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma, void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long start, unsigned long end, - struct page *ref_page, zap_flags_t zap_flags) + struct folio *folio, zap_flags_t zap_flags) { struct mm_struct *mm = vma->vm_mm; unsigned long address; @@ -5913,12 +5913,12 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, page = pte_page(pte); /* - * If a reference page is supplied, it is because a specific - * page is being unmapped, not a range. Ensure the page we - * are about to unmap is the actual page of interest. + * If a folio is supplied, it is because a specific + * folio is being unmapped, not a range. Ensure the folio we + * are about to unmap is the actual folio of interest. */ - if (ref_page) { - if (page != ref_page) { + if (folio) { + if (page_folio(page) != folio) { spin_unlock(ptl); continue; } @@ -5982,9 +5982,9 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, tlb_remove_page_size(tlb, page, huge_page_size(h)); /* - * Bail out after unmapping reference page if supplied + * If we were instructed to unmap a specific folio, we're done. */ - if (ref_page) + if (folio) break; } tlb_end_vma(tlb, vma); @@ -6059,7 +6059,7 @@ void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start, tlb_gather_mmu(&tlb, vma->vm_mm); __unmap_hugepage_range(&tlb, vma, start, end, - &folio->page, zap_flags); + folio, zap_flags); mmu_notifier_invalidate_range_end(&range); tlb_finish_mmu(&tlb); From 05275594a311cd7427a64b8bcc6097645c0344af Mon Sep 17 00:00:00 2001 From: Fan Ni Date: Mon, 5 May 2025 11:22:44 -0700 Subject: [PATCH 284/285] mm/hugetlb: convert use of struct page to folio in __unmap_hugepage_range() In __unmap_hugepage_range(), the "page" pointer always points to the first page of a huge page, which guarantees there is a folio associating with it. Convert the "page" pointer to use folio. Link: https://lkml.kernel.org/r/20250505182345.506888-6-nifan.cxl@gmail.com Signed-off-by: Fan Ni Reviewed-by: Oscar Salvador Acked-by: David Hildenbrand Cc: Matthew Wilcox (Oracle) Cc: Muchun Song Cc: Sidhartha Kumar Cc: "Vishal Moola (Oracle)" Signed-off-by: Andrew Morton --- mm/hugetlb.c | 24 +++++++++++++----------- 1 file changed, 13 insertions(+), 11 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 443b75e116cf..d53caf96a4b2 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5843,11 +5843,11 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, struct folio *folio, zap_flags_t zap_flags) { struct mm_struct *mm = vma->vm_mm; + const bool folio_provided = !!folio; unsigned long address; pte_t *ptep; pte_t pte; spinlock_t *ptl; - struct page *page; struct hstate *h = hstate_vma(vma); unsigned long sz = huge_page_size(h); bool adjust_reservation = false; @@ -5911,14 +5911,13 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, continue; } - page = pte_page(pte); /* * If a folio is supplied, it is because a specific * folio is being unmapped, not a range. Ensure the folio we * are about to unmap is the actual folio of interest. */ - if (folio) { - if (page_folio(page) != folio) { + if (folio_provided) { + if (folio != page_folio(pte_page(pte))) { spin_unlock(ptl); continue; } @@ -5928,12 +5927,14 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, * looking like data was lost */ set_vma_resv_flags(vma, HPAGE_RESV_UNMAPPED); + } else { + folio = page_folio(pte_page(pte)); } pte = huge_ptep_get_and_clear(mm, address, ptep, sz); tlb_remove_huge_tlb_entry(h, tlb, ptep, address); if (huge_pte_dirty(pte)) - set_page_dirty(page); + folio_mark_dirty(folio); /* Leave a uffd-wp pte marker if needed */ if (huge_pte_uffd_wp(pte) && !(zap_flags & ZAP_FLAG_DROP_MARKER)) @@ -5941,7 +5942,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, make_pte_marker(PTE_MARKER_UFFD_WP), sz); hugetlb_count_sub(pages_per_huge_page(h), mm); - hugetlb_remove_rmap(page_folio(page)); + hugetlb_remove_rmap(folio); /* * Restore the reservation for anonymous page, otherwise the @@ -5950,8 +5951,8 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, * reservation bit. */ if (!h->surplus_huge_pages && __vma_private_lock(vma) && - folio_test_anon(page_folio(page))) { - folio_set_hugetlb_restore_reserve(page_folio(page)); + folio_test_anon(folio)) { + folio_set_hugetlb_restore_reserve(folio); /* Reservation to be adjusted after the spin lock */ adjust_reservation = true; } @@ -5975,16 +5976,17 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, * count will not be incremented by free_huge_folio. * Act as if we consumed the reservation. */ - folio_clear_hugetlb_restore_reserve(page_folio(page)); + folio_clear_hugetlb_restore_reserve(folio); else if (rc) vma_add_reservation(h, vma, address); } - tlb_remove_page_size(tlb, page, huge_page_size(h)); + tlb_remove_page_size(tlb, folio_page(folio, 0), + folio_size(folio)); /* * If we were instructed to unmap a specific folio, we're done. */ - if (folio) + if (folio_provided) break; } tlb_end_vma(tlb, vma); From c544a952ba61b1a025455098033c17e0573ab085 Mon Sep 17 00:00:00 2001 From: Nikhil Dhama Date: Mon, 7 Apr 2025 16:22:19 +0530 Subject: [PATCH 285/285] mm: pcp: increase pcp->free_count threshold to trigger free_high In old pcp design, pcp->free_factor gets incremented in nr_pcp_free() which is invoked by free_pcppages_bulk(). So, it used to increase free_factor by 1 only when we try to reduce the size of pcp list and free_high used to trigger only for order > 0 and order < costly_order and pcp->free_factor > 0. For iperf3 I noticed that with older design in kernel v6.6, pcp list was drained mostly when pcp->count > high (more often when count goes above 530). and most of the time pcp->free_factor was 0, triggering very few high order flushes. But this is changed in the current design, introduced in commit 6ccdcb6d3a74 ("mm, pcp: reduce detecting time of consecutive high order page freeing"), where pcp->free_factor is changed to pcp->free_count to keep track of the number of pages freed contiguously. In this design, pcp->free_count is incremented on every deallocation, irrespective of whether pcp list was reduced or not. And logic to trigger free_high is if pcp->free_count goes above batch (which is 63) and there are two contiguous page free without any allocation. With this design, for iperf3, pcp list is getting flushed more frequently because free_high heuristics is triggered more often now. I observed that high order pcp list is drained as soon as both count and free_count goes above 63. Due to this more aggressive high order flushing, applications doing contiguous high order allocation will require to go to global list more frequently. On a 2-node AMD machine with 384 vCPUs on each node, connected via Mellonox connectX-7, I am seeing a ~30% performance reduction if we scale number of iperf3 client/server pairs from 32 to 64. Though this new design reduced the time to detect high order flushes, but for application which are allocating high order pages more frequently it may be flushing the high order list pre-maturely. This motivates towards tuning on how late or early we should flush high order lists. So, in this patch, we increased the pcp->free_count threshold to trigger free_high from "batch" to "batch + pcp->high_min / 2" as suggested by Ying [1], In the original pcp->free_factor solution, free_high is triggered for contiguous freeing with size ranging from "batch" to "pcp->high + batch". So, the average value is "batch + pcp->high / 2". While in the pcp->free_count solution, free_high will be triggered for contiguous freeing with size "batch". So, to restore the original behavior, we can use the threshold "batch + pcp->high_min / 2" This new threshold keeps high order pages in pcp list for a longer duration which can help the application doing high order allocations frequently. With this patch performace to Iperf3 is restored and score for other benchmarks on the same machine are as follows: iperf3 lmbench3 netperf kbuild (AF_UNIX) (SCTP_STREAM_MANY) ------- --------- ----------------- ------ v6.6 vanilla (base) 100 100 100 100 v6.12 vanilla 69 113 98.5 98.8 v6.12 + this patch 100 110.3 100.2 99.3 netperf-tcp: 6.12 6.12 vanilla this_patch Hmean 64 732.14 ( 0.00%) 730.45 ( -0.23%) Hmean 128 1417.46 ( 0.00%) 1419.44 ( 0.14%) Hmean 256 2679.67 ( 0.00%) 2676.45 ( -0.12%) Hmean 1024 8328.52 ( 0.00%) 8339.34 ( 0.13%) Hmean 2048 12716.98 ( 0.00%) 12743.68 ( 0.21%) Hmean 3312 15787.79 ( 0.00%) 15887.25 ( 0.63%) Hmean 4096 17311.91 ( 0.00%) 17332.68 ( 0.12%) Hmean 8192 20310.73 ( 0.00%) 20465.09 ( 0.76%) Link: https://lore.kernel.org/all/875xjmuiup.fsf@DESKTOP-5N7EMDA/ [1] Link: https://lkml.kernel.org/r/20250407105219.55351-1-nikhil.dhama@amd.com Fixes: 6ccdcb6d3a74 ("mm, pcp: reduce detecting time of consecutive high order page freeing") Signed-off-by: Nikhil Dhama Suggested-by: Huang Ying Reviewed-by: Huang Ying Cc: Raghavendra K T Cc: Mel Gorman Cc: Bharata B Rao Signed-off-by: Andrew Morton --- mm/page_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index dbafa7c69a6a..d578c6f7ee29 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2661,7 +2661,7 @@ static void free_frozen_page_commit(struct zone *zone, * stops will be drained from vmstat refresh context. */ if (order && order <= PAGE_ALLOC_COSTLY_ORDER) { - free_high = (pcp->free_count >= batch && + free_high = (pcp->free_count >= (batch + pcp->high_min / 2) && (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) && (!(pcp->flags & PCPF_FREE_HIGH_BATCH) || pcp->count >= batch));