Merge tag 'kthread-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks

Pull kthread updates from Frederic Weisbecker:
 "The kthread code provides an infrastructure which manages the
  preferred affinity of unbound kthreads (node or custom cpumask)
  against housekeeping (CPU isolation) constraints and CPU hotplug
  events.

  One crucial missing piece is the handling of cpuset: when an isolated
  partition is created, deleted, or its CPUs updated, all the unbound
  kthreads in the top cpuset become indifferently affine to _all_ the
  non-isolated CPUs, possibly breaking their preferred affinity along
  the way.

  Solve this with performing the kthreads affinity update from cpuset to
  the kthreads consolidated relevant code instead so that preferred
  affinities are honoured and applied against the updated cpuset
  isolated partitions.

  The dispatch of the new isolated cpumasks to timers, workqueues and
  kthreads is performed by housekeeping, as per the nice Tejun's
  suggestion.

  As a welcome side effect, HK_TYPE_DOMAIN then integrates both the set
  from boot defined domain isolation (through isolcpus=) and cpuset
  isolated partitions. Housekeeping cpumasks are now modifiable with a
  specific RCU based synchronization. A big step toward making
  nohz_full= also mutable through cpuset in the future"

* tag 'kthread-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: (33 commits)
  doc: Add housekeeping documentation
  kthread: Document kthread_affine_preferred()
  kthread: Comment on the purpose and placement of kthread_affine_node() call
  kthread: Honour kthreads preferred affinity after cpuset changes
  sched/arm64: Move fallback task cpumask to HK_TYPE_DOMAIN
  sched: Switch the fallback task allowed cpumask to HK_TYPE_DOMAIN
  kthread: Rely on HK_TYPE_DOMAIN for preferred affinity management
  kthread: Include kthreadd to the managed affinity list
  kthread: Include unbound kthreads in the managed affinity list
  kthread: Refine naming of affinity related fields
  PCI: Remove superfluous HK_TYPE_WQ check
  sched/isolation: Remove HK_TYPE_TICK test from cpu_is_isolated()
  cpuset: Remove cpuset_cpu_is_isolated()
  timers/migration: Remove superfluous cpuset isolation test
  cpuset: Propagate cpuset isolation update to timers through housekeeping
  cpuset: Propagate cpuset isolation update to workqueue through housekeeping
  PCI: Flush PCI probe workqueue on cpuset isolated partition change
  sched/isolation: Flush vmstat workqueues on cpuset isolated partition change
  sched/isolation: Flush memcg workqueues on cpuset isolated partition change
  cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset
  ...
This commit is contained in:
Linus Torvalds
2026-02-09 19:57:30 -08:00
28 changed files with 554 additions and 222 deletions

View File

@@ -0,0 +1,111 @@
======================================
Housekeeping
======================================
CPU Isolation moves away kernel work that may otherwise run on any CPU.
The purpose of its related features is to reduce the OS jitter that some
extreme workloads can't stand, such as in some DPDK usecases.
The kernel work moved away by CPU isolation is commonly described as
"housekeeping" because it includes ground work that performs cleanups,
statistics maintainance and actions relying on them, memory release,
various deferrals etc...
Sometimes housekeeping is just some unbound work (unbound workqueues,
unbound timers, ...) that gets easily assigned to non-isolated CPUs.
But sometimes housekeeping is tied to a specific CPU and requires
elaborated tricks to be offloaded to non-isolated CPUs (RCU_NOCB, remote
scheduler tick, etc...).
Thus, a housekeeping CPU can be considered as the reverse of an isolated
CPU. It is simply a CPU that can execute housekeeping work. There must
always be at least one online housekeeping CPU at any time. The CPUs that
are not isolated are automatically assigned as housekeeping.
Housekeeping is currently divided in four features described
by the ``enum hk_type type``:
1. HK_TYPE_DOMAIN matches the work moved away by scheduler domain
isolation performed through ``isolcpus=domain`` boot parameter or
isolated cpuset partitions in cgroup v2. This includes scheduler
load balancing, unbound workqueues and timers.
2. HK_TYPE_KERNEL_NOISE matches the work moved away by tick isolation
performed through ``nohz_full=`` or ``isolcpus=nohz`` boot
parameters. This includes remote scheduler tick, vmstat and lockup
watchdog.
3. HK_TYPE_MANAGED_IRQ matches the IRQ handlers moved away by managed
IRQ isolation performed through ``isolcpus=managed_irq``.
4. HK_TYPE_DOMAIN_BOOT matches the work moved away by scheduler domain
isolation performed through ``isolcpus=domain`` only. It is similar
to HK_TYPE_DOMAIN except it ignores the isolation performed by
cpusets.
Housekeeping cpumasks
=================================
Housekeeping cpumasks include the CPUs that can execute the work moved
away by the matching isolation feature. These cpumasks are returned by
the following function::
const struct cpumask *housekeeping_cpumask(enum hk_type type)
By default, if neither ``nohz_full=``, nor ``isolcpus``, nor cpuset's
isolated partitions are used, which covers most usecases, this function
returns the cpu_possible_mask.
Otherwise the function returns the cpumask complement of the isolation
feature. For example:
With isolcpus=domain,7 the following will return a mask with all possible
CPUs except 7::
housekeeping_cpumask(HK_TYPE_DOMAIN)
Similarly with nohz_full=5,6 the following will return a mask with all
possible CPUs except 5,6::
housekeeping_cpumask(HK_TYPE_KERNEL_NOISE)
Synchronization against cpusets
=================================
Cpuset can modify the HK_TYPE_DOMAIN housekeeping cpumask while creating,
modifying or deleting an isolated partition.
The users of HK_TYPE_DOMAIN cpumask must then make sure to synchronize
properly against cpuset in order to make sure that:
1. The cpumask snapshot stays coherent.
2. No housekeeping work is queued on a newly made isolated CPU.
3. Pending housekeeping work that was queued to a non isolated
CPU which just turned isolated through cpuset must be flushed
before the related created/modified isolated partition is made
available to userspace.
This synchronization is maintained by an RCU based scheme. The cpuset update
side waits for an RCU grace period after updating the HK_TYPE_DOMAIN
cpumask and before flushing pending works. On the read side, care must be
taken to gather the housekeeping target election and the work enqueue within
the same RCU read side critical section.
A typical layout example would look like this on the update side
(``housekeeping_update()``)::
rcu_assign_pointer(housekeeping_cpumasks[type], trial);
synchronize_rcu();
flush_workqueue(example_workqueue);
And then on the read side::
rcu_read_lock();
cpu = housekeeping_any_cpu(HK_TYPE_DOMAIN);
queue_work_on(cpu, example_workqueue, work);
rcu_read_unlock();

View File

@@ -25,6 +25,7 @@ it.
symbol-namespaces
asm-annotations
real-time/index
housekeeping.rst
Data structures and low-level utilities
=======================================