Commit 7d709f49 authored by Gregory Price's avatar Gregory Price Committed by Andrew Morton
Browse files

vmscan,cgroup: apply mems_effective to reclaim

It is possible for a reclaimer to cause demotions of an lruvec belonging
to a cgroup with cpuset.mems set to exclude some nodes.  Attempt to apply
this limitation based on the lruvec's memcg and prevent demotion.

Notably, this may still allow demotion of shared libraries or any memory
first instantiated in another cgroup.  This means cpusets still cannot
cannot guarantee complete isolation when demotion is enabled, and the docs
have been updated to reflect this.

This is useful for isolating workloads on a multi-tenant system from
certain classes of memory more consistently - with the noted exceptions.

Note on locking:

The cgroup_get_e_css reference protects the css->effective_mems, and calls
of this interface would be subject to the same race conditions associated
with a non-atomic access to cs->effective_mems.

So while this interface cannot make strong guarantees of correctness, it
can therefore avoid taking a global or rcu_read_lock for performance.

Link: https://lkml.kernel.org/r/20250424202806.52632-3-gourry@gourry.net


Signed-off-by: default avatarGregory Price <gourry@gourry.net>
Suggested-by: default avatarShakeel Butt <shakeel.butt@linux.dev>
Suggested-by: default avatarWaiman Long <longman@redhat.com>
Acked-by: default avatarTejun Heo <tj@kernel.org>
Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
Reviewed-by: default avatarShakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: default avatarWaiman Long <longman@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
parent 8adce085
Loading
Loading
Loading
Loading
+10 −6
Original line number Diff line number Diff line
@@ -16,9 +16,13 @@ Description: Enable/disable demoting pages during reclaim
		Allowing page migration during reclaim enables these
		systems to migrate pages from fast tiers to slow tiers
		when the fast tier is under pressure.  This migration
		is performed before swap.  It may move data to a NUMA
		node that does not fall into the cpuset of the
		allocating process which might be construed to violate
		the guarantees of cpusets.  This should not be enabled
		on systems which need strict cpuset location
		guarantees.
		is performed before swap if an eligible numa node is
		present in cpuset.mems for the cgroup (or if cpuset v1
		is being used). If cpusets.mems changes at runtime, it
		may move data to a NUMA node that does not fall into the
		cpuset of the new cpusets.mems, which might be construed
		to violate the guarantees of cpusets.  Shared memory,
		such as libraries, owned by another cgroup may still be
		demoted and result in memory use on a node not present
		in cpusets.mem. This should not be enabled on systems
		which need strict cpuset location guarantees.
+5 −0
Original line number Diff line number Diff line
@@ -173,6 +173,7 @@ static inline void set_mems_allowed(nodemask_t nodemask)
	task_unlock(current);
}

extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid);
#else /* !CONFIG_CPUSETS */

static inline bool cpusets_enabled(void) { return false; }
@@ -293,6 +294,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
	return false;
}

static inline bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
{
	return true;
}
#endif /* !CONFIG_CPUSETS */

#endif /* _LINUX_CPUSET_H */
+7 −0
Original line number Diff line number Diff line
@@ -1736,6 +1736,8 @@ static inline void count_objcg_events(struct obj_cgroup *objcg,
	rcu_read_unlock();
}

bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid);

#else
static inline bool mem_cgroup_kmem_disabled(void)
{
@@ -1797,6 +1799,11 @@ static inline ino_t page_cgroup_ino(struct page *page)
{
	return 0;
}

static inline bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid)
{
	return true;
}
#endif /* CONFIG_MEMCG */

#if defined(CONFIG_MEMCG) && defined(CONFIG_ZSWAP)
+36 −0
Original line number Diff line number Diff line
@@ -4237,6 +4237,42 @@ bool cpuset_current_node_allowed(int node, gfp_t gfp_mask)
	return allowed;
}

bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
{
	struct cgroup_subsys_state *css;
	struct cpuset *cs;
	bool allowed;

	/*
	 * In v1, mem_cgroup and cpuset are unlikely in the same hierarchy
	 * and mems_allowed is likely to be empty even if we could get to it,
	 * so return true to avoid taking a global lock on the empty check.
	 */
	if (!cpuset_v2())
		return true;

	css = cgroup_get_e_css(cgroup, &cpuset_cgrp_subsys);
	if (!css)
		return true;

	/*
	 * Normally, accessing effective_mems would require the cpuset_mutex
	 * or callback_lock - but node_isset is atomic and the reference
	 * taken via cgroup_get_e_css is sufficient to protect css.
	 *
	 * Since this interface is intended for use by migration paths, we
	 * relax locking here to avoid taking global locks - while accepting
	 * there may be rare scenarios where the result may be innaccurate.
	 *
	 * Reclaim and migration are subject to these same race conditions, and
	 * cannot make strong isolation guarantees, so this is acceptable.
	 */
	cs = container_of(css, struct cpuset, css);
	allowed = node_isset(nid, cs->effective_mems);
	css_put(css);
	return allowed;
}

/**
 * cpuset_spread_node() - On which node to begin search for a page
 * @rotor: round robin rotor
+6 −0
Original line number Diff line number Diff line
@@ -29,6 +29,7 @@
#include <linux/page_counter.h>
#include <linux/memcontrol.h>
#include <linux/cgroup.h>
#include <linux/cpuset.h>
#include <linux/sched/mm.h>
#include <linux/shmem_fs.h>
#include <linux/hugetlb.h>
@@ -5523,3 +5524,8 @@ static int __init mem_cgroup_swap_init(void)
subsys_initcall(mem_cgroup_swap_init);

#endif /* CONFIG_SWAP */

bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid)
{
	return memcg ? cpuset_node_allowed(memcg->css.cgroup, nid) : true;
}
Loading