Commit 36976159 authored by Kairui Song's avatar Kairui Song Committed by Andrew Morton
Browse files

mm, swap: cleanup swap entry management workflow

The current swap entry allocation/freeing workflow has never had a clear
definition.  This makes it hard to debug or add new optimizations.

This commit introduces a proper definition of how swap entries would be
allocated and freed.  Now, most operations are folio based, so they will
never exceed one swap cluster, and we now have a cleaner border between
swap and the rest of mm, making it much easier to follow and debug,
especially with new added sanity checks.  Also making more optimization
possible.

Swap entry will be mostly freed and free with a folio bound.  The folio
lock will be useful for resolving many swap related races.

Now swap allocation (except hibernation) always starts with a folio in the
swap cache, and gets duped/freed protected by the folio lock:

- folio_alloc_swap() - The only allocation entry point now.
  Context: The folio must be locked.
  This allocates one or a set of continuous swap slots for a folio and
  binds them to the folio by adding the folio to the swap cache. The
  swap slots' swap count start with zero value.

- folio_dup_swap() - Increase the swap count of one or more entries.
  Context: The folio must be locked and in the swap cache. For now, the
  caller still has to lock the new swap entry owner (e.g., PTL).
  This increases the ref count of swap entries allocated to a folio.
  Newly allocated swap slots' count has to be increased by this helper
  as the folio got unmapped (and swap entries got installed).

- folio_put_swap() - Decrease the swap count of one or more entries.
  Context: The folio must be locked and in the swap cache. For now, the
  caller still has to lock the new swap entry owner (e.g., PTL).
  This decreases the ref count of swap entries allocated to a folio.
  Typically, swapin will decrease the swap count as the folio got
  installed back and the swap entry got uninstalled

  This won't remove the folio from the swap cache and free the
  slot. Lazy freeing of swap cache is helpful for reducing IO.
  There is already a folio_free_swap() for immediate cache reclaim.
  This part could be further optimized later.

The above locking constraints could be further relaxed when the swap table
is fully implemented.  Currently dup still needs the caller to lock the
swap entry container (e.g.  PTL), or a concurrent zap may underflow the
swap count.

Some swap users need to interact with swap count without involving folio
(e.g.  forking/zapping the page table or mapping truncate without swapin).
In such cases, the caller has to ensure there is no race condition on
whatever owns the swap count and use the below helpers:

- swap_put_entries_direct() - Decrease the swap count directly.
  Context: The caller must lock whatever is referencing the slots to
  avoid a race.

  Typically the page table zapping or shmem mapping truncate will need
  to free swap slots directly. If a slot is cached (has a folio bound),
  this will also try to release the swap cache.

- swap_dup_entry_direct() - Increase the swap count directly.
  Context: The caller must lock whatever is referencing the entries to
  avoid race, and the entries must already have a swap count > 1.

  Typically, forking will need to copy the page table and hence needs to
  increase the swap count of the entries in the table. The page table is
  locked while referencing the swap entries, so the entries all have a
  swap count > 1 and can't be freed.

Hibernation subsystem is a bit different, so two special wrappers are here:

- swap_alloc_hibernation_slot() - Allocate one entry from one device.
- swap_free_hibernation_slot() - Free one entry allocated by the above
  helper.

All hibernation entries are exclusive to the hibernation subsystem and
should not interact with ordinary swap routines.

By separating the workflows, it will be possible to bind folio more
tightly with swap cache and get rid of the SWAP_HAS_CACHE as a temporary
pin.

This commit should not introduce any behavior change

[kasong@tencent.com: fix leak, per Chris Mason.  Remove WARN_ON, per Lai Yi]
  Link: https://lkml.kernel.org/r/CAMgjq7AUz10uETVm8ozDWcB3XohkOqf0i33KGrAquvEVvfp5cg@mail.gmail.com
[ryncsn@gmail.com: fix KSM copy pages for swapoff, per Chris]
  Link: https://lkml.kernel.org/r/aXxkANcET3l2Xu6J@KASONG-MC4
Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-14-8862a265a033@tencent.com


Signed-off-by: default avatarKairui Song <kasong@tencent.com>
Signed-off-by: default avatarKairui Song <ryncsn@gmail.com>
Acked-by: default avatarRafael J. Wysocki (Intel) <rafael@kernel.org>
Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Deepanshu Kartikey <kartikey406@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Chris Mason <clm@fb.com>
Cc: Chris Mason <clm@meta.com>
Cc: Lai Yi <yi1.lai@linux.intel.com>
Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
parent de85024b
Loading
Loading
Loading
Loading
+1 −1
Original line number Diff line number Diff line
@@ -32,7 +32,7 @@ static void ptep_zap_softleaf_entry(struct mm_struct *mm, softleaf_t entry)
		dec_mm_counter(mm, MM_SWAPENTS);
	else if (softleaf_is_migration(entry))
		dec_mm_counter(mm, mm_counter(softleaf_to_folio(entry)));
	free_swap_and_cache(entry);
	swap_put_entries_direct(entry, 1);
}

/**
+1 −1
Original line number Diff line number Diff line
@@ -682,7 +682,7 @@ static void ptep_zap_softleaf_entry(struct mm_struct *mm, softleaf_t entry)

		dec_mm_counter(mm, mm_counter(folio));
	}
	free_swap_and_cache(entry);
	swap_put_entries_direct(entry, 1);
}

void ptep_zap_unused(struct mm_struct *mm, unsigned long addr,
+25 −33
Original line number Diff line number Diff line
@@ -452,14 +452,8 @@ static inline long get_nr_swap_pages(void)
}

extern void si_swapinfo(struct sysinfo *);
int folio_alloc_swap(struct folio *folio);
bool folio_free_swap(struct folio *folio);
void put_swap_folio(struct folio *folio, swp_entry_t entry);
extern swp_entry_t get_swap_page_of_type(int);
extern int add_swap_count_continuation(swp_entry_t, gfp_t);
extern int swap_duplicate_nr(swp_entry_t entry, int nr);
extern void swap_free_nr(swp_entry_t entry, int nr_pages);
extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
int swap_type_of(dev_t device, sector_t offset);
int find_first_swap(dev_t *device);
extern unsigned int count_swap_pages(int, int);
@@ -471,6 +465,29 @@ struct backing_dev_info;
extern struct swap_info_struct *get_swap_device(swp_entry_t entry);
sector_t swap_folio_sector(struct folio *folio);

/*
 * If there is an existing swap slot reference (swap entry) and the caller
 * guarantees that there is no race modification of it (e.g., PTL
 * protecting the swap entry in page table; shmem's cmpxchg protects t
 * he swap entry in shmem mapping), these two helpers below can be used
 * to put/dup the entries directly.
 *
 * All entries must be allocated by folio_alloc_swap(). And they must have
 * a swap count > 1. See comments of folio_*_swap helpers for more info.
 */
int swap_dup_entry_direct(swp_entry_t entry);
void swap_put_entries_direct(swp_entry_t entry, int nr);

/*
 * folio_free_swap tries to free the swap entries pinned by a swap cache
 * folio, it has to be here to be called by other components.
 */
bool folio_free_swap(struct folio *folio);

/* Allocate / free (hibernation) exclusive entries */
swp_entry_t swap_alloc_hibernation_slot(int type);
void swap_free_hibernation_slot(swp_entry_t entry);

static inline void put_swap_device(struct swap_info_struct *si)
{
	percpu_ref_put(&si->users);
@@ -498,10 +515,6 @@ static inline void put_swap_device(struct swap_info_struct *si)
#define free_pages_and_swap_cache(pages, nr) \
	release_pages((pages), (nr));

static inline void free_swap_and_cache_nr(swp_entry_t entry, int nr)
{
}

static inline void free_swap_cache(struct folio *folio)
{
}
@@ -511,12 +524,12 @@ static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
	return 0;
}

static inline int swap_duplicate_nr(swp_entry_t swp, int nr_pages)
static inline int swap_dup_entry_direct(swp_entry_t ent)
{
	return 0;
}

static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
static inline void swap_put_entries_direct(swp_entry_t ent, int nr)
{
}

@@ -539,11 +552,6 @@ static inline int swp_swapcount(swp_entry_t entry)
	return 0;
}

static inline int folio_alloc_swap(struct folio *folio)
{
	return -EINVAL;
}

static inline bool folio_free_swap(struct folio *folio)
{
	return false;
@@ -556,22 +564,6 @@ static inline int add_swap_extent(struct swap_info_struct *sis,
	return -EINVAL;
}
#endif /* CONFIG_SWAP */

static inline int swap_duplicate(swp_entry_t entry)
{
	return swap_duplicate_nr(entry, 1);
}

static inline void free_swap_and_cache(swp_entry_t entry)
{
	free_swap_and_cache_nr(entry, 1);
}

static inline void swap_free(swp_entry_t entry)
{
	swap_free_nr(entry, 1);
}

#ifdef CONFIG_MEMCG
static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
{
+6 −4
Original line number Diff line number Diff line
@@ -174,10 +174,10 @@ sector_t alloc_swapdev_block(int swap)
	 * Allocate a swap page and register that it has been allocated, so that
	 * it can be freed in case of an error.
	 */
	offset = swp_offset(get_swap_page_of_type(swap));
	offset = swp_offset(swap_alloc_hibernation_slot(swap));
	if (offset) {
		if (swsusp_extents_insert(offset))
			swap_free(swp_entry(swap, offset));
			swap_free_hibernation_slot(swp_entry(swap, offset));
		else
			return swapdev_block(swap, offset);
	}
@@ -186,6 +186,7 @@ sector_t alloc_swapdev_block(int swap)

void free_all_swap_pages(int swap)
{
	unsigned long offset;
	struct rb_node *node;

	/*
@@ -197,8 +198,9 @@ void free_all_swap_pages(int swap)

		ext = rb_entry(node, struct swsusp_extent, node);
		rb_erase(node, &swsusp_extents);
		swap_free_nr(swp_entry(swap, ext->start),
			     ext->end - ext->start + 1);

		for (offset = ext->start; offset <= ext->end; offset++)
			swap_free_hibernation_slot(swp_entry(swap, offset));

		kfree(ext);
	}
+1 −1
Original line number Diff line number Diff line
@@ -692,7 +692,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
				max_nr = (end - addr) / PAGE_SIZE;
				nr = swap_pte_batch(pte, max_nr, ptent);
				nr_swap -= nr;
				free_swap_and_cache_nr(entry, nr);
				swap_put_entries_direct(entry, nr);
				clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
			} else if (softleaf_is_hwpoison(entry) ||
				   softleaf_is_poison_marker(entry)) {
Loading