Commit b9120619 authored by Vlastimil Babka's avatar Vlastimil Babka
Browse files

Merge series "SLUB percpu sheaves"

This series adds an opt-in percpu array-based caching layer to SLUB.
It has evolved to a state where kmem caches with sheaves are compatible
with all SLUB features (slub_debug, SLUB_TINY, NUMA locality
considerations). The plan is therefore that it will be later enabled for
all kmem caches and replace the complicated cpu (partial) slabs code.

Note the name "sheaf" was invented by Matthew Wilcox so we don't call
the arrays magazines like the original Bonwick paper. The per-NUMA-node
cache of sheaves is thus called "barn".

This caching may seem similar to the arrays we had in SLAB, but there
are some important differences:

- deals differently with NUMA locality of freed objects, thus there are
  no per-node "shared" arrays (with possible lock contention) and no
  "alien" arrays that would need periodical flushing
  - instead, freeing remote objects (which is rare) bypasses the sheaves
  - percpu sheaves thus contain only local objects (modulo rare races
    and local node exhaustion)
  - NUMA restricted allocations and strict_numa mode is still honoured
- improves kfree_rcu() handling by reusing whole sheaves
- there is an API for obtaining a preallocated sheaf that can be used
  for guaranteed and efficient allocations in a restricted context, when
  the upper bound for needed objects is known but rarely reached
- opt-in, not used for every cache (for now)

The motivation comes mainly from the ongoing work related to VMA locking
scalability and the related maple tree operations. This is why VMA and
maple nodes caches are sheaf-enabled in the patchset.

A sheaf-enabled cache has the following expected advantages:

- Cheaper fast paths. For allocations, instead of local double cmpxchg,
  thanks to local_trylock() it becomes a preempt_disable() and no atomic
  operations. Same for freeing, which is otherwise a local double cmpxchg
  only for short term allocations (so the same slab is still active on the
  same cpu when freeing the object) and a more costly locked double
  cmpxchg otherwise.

- kfree_rcu() batching and recycling. kfree_rcu() will put objects to a
  separate percpu sheaf and only submit the whole sheaf to call_rcu()
  when full. After the grace period, the sheaf can be used for
  allocations, which is more efficient than freeing and reallocating
  individual slab objects (even with the batching done by kfree_rcu()
  implementation itself). In case only some cpus are allowed to handle rcu
  callbacks, the sheaf can still be made available to other cpus on the
  same node via the shared barn. The maple_node cache uses kfree_rcu() and
  thus can benefit from this.
  Note: this path is currently limited to !PREEMPT_RT

- Preallocation support. A prefilled sheaf can be privately borrowed to
  perform a short term operation that is not allowed to block in the
  middle and may need to allocate some objects. If an upper bound (worst
  case) for the number of allocations is known, but only much fewer
  allocations actually needed on average, borrowing and returning a sheaf
  is much more efficient then a bulk allocation for the worst case
  followed by a bulk free of the many unused objects. Maple tree write
  operations should benefit from this.

- Compatibility with slub_debug. When slub_debug is enabled for a cache,
  we simply don't create the percpu sheaves so that the debugging hooks
  (at the node partial list slowpaths) are reached as before. The same
  thing is done for CONFIG_SLUB_TINY. Sheaf preallocation still works by
  reusing the (ineffective) paths for requests exceeding the cache's
  sheaf_capacity. This is in line with the existing approach where
  debugging bypasses the fast paths and SLUB_TINY preferes memory
  savings over performance.

The above is adapted from the cover letter [1], which contains also
in-kernel microbenchmark results showing the lower overhead of sheaves.

Results from Suren Baghdasaryan [2] using a mmap/munmap microbenchmark
also show improvements.

Results from Sudarsan Mahendran [3] using will-it-scale show both
benefits and regressions, probably due to overall noisiness of those
tests.

Link: https://lore.kernel.org/all/20250910-slub-percpu-caches-v8-0-ca3099d8352c@suse.cz/ [1]
Link: https://lore.kernel.org/all/CAJuCfpEQ%3DRUgcAvRzE5jRrhhFpkm8E2PpBK9e9GhK26ZaJQt%3DQ@mail.gmail.com/ [2]
Link: https://lore.kernel.org/all/20250913000935.1021068-1-sudarsanm@google.com/ [3]
parents f7381b91 719a42e5
Loading
Loading
Loading
Loading
+6 −3
Original line number Diff line number Diff line
@@ -17,7 +17,10 @@ typedef struct {

/* local_trylock() and local_trylock_irqsave() only work with local_trylock_t */
typedef struct {
	local_lock_t	llock;
#ifdef CONFIG_DEBUG_LOCK_ALLOC
	struct lockdep_map	dep_map;
	struct task_struct	*owner;
#endif
	u8		acquired;
} local_trylock_t;

@@ -31,7 +34,7 @@ typedef struct {
	.owner = NULL,

# define LOCAL_TRYLOCK_DEBUG_INIT(lockname)		\
	.llock = { LOCAL_LOCK_DEBUG_INIT((lockname).llock) },
	LOCAL_LOCK_DEBUG_INIT(lockname)

static inline void local_lock_acquire(local_lock_t *l)
{
@@ -81,7 +84,7 @@ do { \
	local_lock_debug_init(lock);				\
} while (0)

#define __local_trylock_init(lock) __local_lock_init(lock.llock)
#define __local_trylock_init(lock) __local_lock_init((local_lock_t *)lock)

#define __spinlock_nested_bh_init(lock)				\
do {								\
+5 −1
Original line number Diff line number Diff line
@@ -442,7 +442,9 @@ struct ma_state {
	struct maple_enode *node;	/* The node containing this entry */
	unsigned long min;		/* The minimum index of this node - implied pivot min */
	unsigned long max;		/* The maximum index of this node - implied pivot max */
	struct maple_alloc *alloc;	/* Allocated nodes for this operation */
	struct slab_sheaf *sheaf;	/* Allocated nodes for this operation */
	struct maple_node *alloc;	/* A single allocated node for fast path writes */
	unsigned long node_request;	/* The number of nodes to allocate for this operation */
	enum maple_status status;	/* The status of the state (active, start, none, etc) */
	unsigned char depth;		/* depth of tree descent during write */
	unsigned char offset;
@@ -490,7 +492,9 @@ struct ma_wr_state {
		.status = ma_start,					\
		.min = 0,						\
		.max = ULONG_MAX,					\
		.sheaf = NULL,						\
		.alloc = NULL,						\
		.node_request = 0,					\
		.mas_flags = 0,						\
		.store_type = wr_invalid,				\
	}
+47 −0
Original line number Diff line number Diff line
@@ -335,6 +335,37 @@ struct kmem_cache_args {
	 * %NULL means no constructor.
	 */
	void (*ctor)(void *);
	/**
	 * @sheaf_capacity: Enable sheaves of given capacity for the cache.
	 *
	 * With a non-zero value, allocations from the cache go through caching
	 * arrays called sheaves. Each cpu has a main sheaf that's always
	 * present, and a spare sheaf that may be not present. When both become
	 * empty, there's an attempt to replace an empty sheaf with a full sheaf
	 * from the per-node barn.
	 *
	 * When no full sheaf is available, and gfp flags allow blocking, a
	 * sheaf is allocated and filled from slab(s) using bulk allocation.
	 * Otherwise the allocation falls back to the normal operation
	 * allocating a single object from a slab.
	 *
	 * Analogically when freeing and both percpu sheaves are full, the barn
	 * may replace it with an empty sheaf, unless it's over capacity. In
	 * that case a sheaf is bulk freed to slab pages.
	 *
	 * The sheaves do not enforce NUMA placement of objects, so allocations
	 * via kmem_cache_alloc_node() with a node specified other than
	 * NUMA_NO_NODE will bypass them.
	 *
	 * Bulk allocation and free operations also try to use the cpu sheaves
	 * and barn, but fallback to using slab pages directly.
	 *
	 * When slub_debug is enabled for the cache, the sheaf_capacity argument
	 * is ignored.
	 *
	 * %0 means no sheaves will be created.
	 */
	unsigned int sheaf_capacity;
};

struct kmem_cache *__kmem_cache_create_args(const char *name,
@@ -798,6 +829,22 @@ void *kmem_cache_alloc_node_noprof(struct kmem_cache *s, gfp_t flags,
				   int node) __assume_slab_alignment __malloc;
#define kmem_cache_alloc_node(...)	alloc_hooks(kmem_cache_alloc_node_noprof(__VA_ARGS__))

struct slab_sheaf *
kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size);

int kmem_cache_refill_sheaf(struct kmem_cache *s, gfp_t gfp,
		struct slab_sheaf **sheafp, unsigned int size);

void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
				       struct slab_sheaf *sheaf);

void *kmem_cache_alloc_from_sheaf_noprof(struct kmem_cache *cachep, gfp_t gfp,
			struct slab_sheaf *sheaf) __assume_slab_alignment __malloc;
#define kmem_cache_alloc_from_sheaf(...)	\
			alloc_hooks(kmem_cache_alloc_from_sheaf_noprof(__VA_ARGS__))

unsigned int kmem_cache_sheaf_size(struct slab_sheaf *sheaf);

/*
 * These macros allow declaring a kmem_buckets * parameter alongside size, which
 * can be compiled out with CONFIG_SLAB_BUCKETS=n so that a large number of call
+115 −552

File changed.

Preview size limit exceeded, changes collapsed.

+0 −137
Original line number Diff line number Diff line
@@ -2746,139 +2746,6 @@ static noinline void __init check_fuzzer(struct maple_tree *mt)
	mtree_test_erase(mt, ULONG_MAX - 10);
}

/* duplicate the tree with a specific gap */
static noinline void __init check_dup_gaps(struct maple_tree *mt,
				    unsigned long nr_entries, bool zero_start,
				    unsigned long gap)
{
	unsigned long i = 0;
	struct maple_tree newmt;
	int ret;
	void *tmp;
	MA_STATE(mas, mt, 0, 0);
	MA_STATE(newmas, &newmt, 0, 0);
	struct rw_semaphore newmt_lock;

	init_rwsem(&newmt_lock);
	mt_set_external_lock(&newmt, &newmt_lock);

	if (!zero_start)
		i = 1;

	mt_zero_nr_tallocated();
	for (; i <= nr_entries; i++)
		mtree_store_range(mt, i*10, (i+1)*10 - gap,
				  xa_mk_value(i), GFP_KERNEL);

	mt_init_flags(&newmt, MT_FLAGS_ALLOC_RANGE | MT_FLAGS_LOCK_EXTERN);
	mt_set_non_kernel(99999);
	down_write(&newmt_lock);
	ret = mas_expected_entries(&newmas, nr_entries);
	mt_set_non_kernel(0);
	MT_BUG_ON(mt, ret != 0);

	rcu_read_lock();
	mas_for_each(&mas, tmp, ULONG_MAX) {
		newmas.index = mas.index;
		newmas.last = mas.last;
		mas_store(&newmas, tmp);
	}
	rcu_read_unlock();
	mas_destroy(&newmas);

	__mt_destroy(&newmt);
	up_write(&newmt_lock);
}

/* Duplicate many sizes of trees.  Mainly to test expected entry values */
static noinline void __init check_dup(struct maple_tree *mt)
{
	int i;
	int big_start = 100010;

	/* Check with a value at zero */
	for (i = 10; i < 1000; i++) {
		mt_init_flags(mt, MT_FLAGS_ALLOC_RANGE);
		check_dup_gaps(mt, i, true, 5);
		mtree_destroy(mt);
		rcu_barrier();
	}

	cond_resched();
	mt_cache_shrink();
	/* Check with a value at zero, no gap */
	for (i = 1000; i < 2000; i++) {
		mt_init_flags(mt, MT_FLAGS_ALLOC_RANGE);
		check_dup_gaps(mt, i, true, 0);
		mtree_destroy(mt);
		rcu_barrier();
	}

	cond_resched();
	mt_cache_shrink();
	/* Check with a value at zero and unreasonably large */
	for (i = big_start; i < big_start + 10; i++) {
		mt_init_flags(mt, MT_FLAGS_ALLOC_RANGE);
		check_dup_gaps(mt, i, true, 5);
		mtree_destroy(mt);
		rcu_barrier();
	}

	cond_resched();
	mt_cache_shrink();
	/* Small to medium size not starting at zero*/
	for (i = 200; i < 1000; i++) {
		mt_init_flags(mt, MT_FLAGS_ALLOC_RANGE);
		check_dup_gaps(mt, i, false, 5);
		mtree_destroy(mt);
		rcu_barrier();
	}

	cond_resched();
	mt_cache_shrink();
	/* Unreasonably large not starting at zero*/
	for (i = big_start; i < big_start + 10; i++) {
		mt_init_flags(mt, MT_FLAGS_ALLOC_RANGE);
		check_dup_gaps(mt, i, false, 5);
		mtree_destroy(mt);
		rcu_barrier();
		cond_resched();
		mt_cache_shrink();
	}

	/* Check non-allocation tree not starting at zero */
	for (i = 1500; i < 3000; i++) {
		mt_init_flags(mt, 0);
		check_dup_gaps(mt, i, false, 5);
		mtree_destroy(mt);
		rcu_barrier();
		cond_resched();
		if (i % 2 == 0)
			mt_cache_shrink();
	}

	mt_cache_shrink();
	/* Check non-allocation tree starting at zero */
	for (i = 200; i < 1000; i++) {
		mt_init_flags(mt, 0);
		check_dup_gaps(mt, i, true, 5);
		mtree_destroy(mt);
		rcu_barrier();
		cond_resched();
	}

	mt_cache_shrink();
	/* Unreasonably large */
	for (i = big_start + 5; i < big_start + 10; i++) {
		mt_init_flags(mt, 0);
		check_dup_gaps(mt, i, true, 5);
		mtree_destroy(mt);
		rcu_barrier();
		mt_cache_shrink();
		cond_resched();
	}
}

static noinline void __init check_bnode_min_spanning(struct maple_tree *mt)
{
	int i = 50;
@@ -4077,10 +3944,6 @@ static int __init maple_tree_seed(void)
	check_fuzzer(&tree);
	mtree_destroy(&tree);

	mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
	check_dup(&tree);
	mtree_destroy(&tree);

	mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
	check_bnode_min_spanning(&tree);
	mtree_destroy(&tree);
Loading