Commit 5bdb4078 authored by Linus Torvalds's avatar Linus Torvalds
Browse files
Pull sched_ext updates from Tejun Heo:

 - cgroup sub-scheduler groundwork

   Multiple BPF schedulers can be attached to cgroups and the dispatch
   path is made hierarchical. This involves substantial restructuring of
   the core dispatch, bypass, watchdog, and dump paths to be
   per-scheduler, along with new infrastructure for scheduler ownership
   enforcement, lifecycle management, and cgroup subtree iteration

   The enqueue path is not yet updated and will follow in a later cycle

 - scx_bpf_dsq_reenq() generalized to support any DSQ including remote
   local DSQs and user DSQs

   Built on top of this, SCX_ENQ_IMMED guarantees that tasks dispatched
   to local DSQs either run immediately or get reenqueued back through
   ops.enqueue(), giving schedulers tighter control over queueing
   latency

   Also useful for opportunistic CPU sharing across sub-schedulers

 - ops.dequeue() was only invoked when the core knew a task was in BPF
   data structures, missing scheduling property change events and
   skipping callbacks for non-local DSQ dispatches from ops.select_cpu()

   Fixed to guarantee exactly one ops.dequeue() call when a task leaves
   BPF scheduler custody

 - Kfunc access validation moved from runtime to BPF verifier time,
   removing runtime mask enforcement

 - Idle SMT sibling prioritization in the idle CPU selection path

 - Documentation, selftest, and tooling updates. Misc bug fixes and
   cleanups

* tag 'sched_ext-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (134 commits)
  tools/sched_ext: Add explicit cast from void* in RESIZE_ARRAY()
  sched_ext: Make string params of __ENUM_set() const
  tools/sched_ext: Kick home CPU for stranded tasks in scx_qmap
  sched_ext: Drop spurious warning on kick during scheduler disable
  sched_ext: Warn on task-based SCX op recursion
  sched_ext: Rename scx_kf_allowed_on_arg_tasks() to scx_kf_arg_task_ok()
  sched_ext: Remove runtime kfunc mask enforcement
  sched_ext: Add verifier-time kfunc context filter
  sched_ext: Drop redundant rq-locked check from scx_bpf_task_cgroup()
  sched_ext: Decouple kfunc unlocked-context check from kf_mask
  sched_ext: Fix ops.cgroup_move() invocation kf_mask and rq tracking
  sched_ext: Track @p's rq lock across set_cpus_allowed_scx -> ops.set_cpumask
  sched_ext: Add select_cpu kfuncs to scx_kfunc_ids_unlocked
  sched_ext: Drop TRACING access to select_cpu kfuncs
  selftests/sched_ext: Fix wrong DSQ ID in peek_dsq error message
  sched_ext: Documentation: improve accuracy of task lifecycle pseudo-code
  selftests/sched_ext: Improve runner error reporting for invalid arguments
  sched_ext: Documentation: Fix scx_bpf_move_to_local kfunc name
  sched_ext: Documentation: Add ops.dequeue() to task lifecycle
  tools/sched_ext: Fix off-by-one in scx_sdt payload zeroing
  ...
parents 7de6b4a2 7e311baf
Loading
Loading
Loading
Loading
+187 −18
Original line number Diff line number Diff line
@@ -93,6 +93,55 @@ scheduler has been loaded):
    # cat /sys/kernel/sched_ext/enable_seq
    1

Each running scheduler also exposes a per-scheduler ``events`` file under
``/sys/kernel/sched_ext/<scheduler-name>/events`` that tracks diagnostic
counters. Each counter occupies one ``name value`` line:

.. code-block:: none

    # cat /sys/kernel/sched_ext/simple/events
    SCX_EV_SELECT_CPU_FALLBACK 0
    SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE 0
    SCX_EV_DISPATCH_KEEP_LAST 123
    SCX_EV_ENQ_SKIP_EXITING 0
    SCX_EV_ENQ_SKIP_MIGRATION_DISABLED 0
    SCX_EV_REENQ_IMMED 0
    SCX_EV_REENQ_LOCAL_REPEAT 0
    SCX_EV_REFILL_SLICE_DFL 456789
    SCX_EV_BYPASS_DURATION 0
    SCX_EV_BYPASS_DISPATCH 0
    SCX_EV_BYPASS_ACTIVATE 0
    SCX_EV_INSERT_NOT_OWNED 0
    SCX_EV_SUB_BYPASS_DISPATCH 0

The counters are described in ``kernel/sched/ext_internal.h``; briefly:

* ``SCX_EV_SELECT_CPU_FALLBACK``: ops.select_cpu() returned a CPU unusable by
  the task and the core scheduler silently picked a fallback CPU.
* ``SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE``: a local-DSQ dispatch was redirected
  to the global DSQ because the target CPU went offline.
* ``SCX_EV_DISPATCH_KEEP_LAST``: a task continued running because no other
  task was available (only when ``SCX_OPS_ENQ_LAST`` is not set).
* ``SCX_EV_ENQ_SKIP_EXITING``: an exiting task was dispatched to the local DSQ
  directly, bypassing ops.enqueue() (only when ``SCX_OPS_ENQ_EXITING`` is not set).
* ``SCX_EV_ENQ_SKIP_MIGRATION_DISABLED``: a migration-disabled task was
  dispatched to its local DSQ directly (only when
  ``SCX_OPS_ENQ_MIGRATION_DISABLED`` is not set).
* ``SCX_EV_REENQ_IMMED``: a task dispatched with ``SCX_ENQ_IMMED`` was
  re-enqueued because the target CPU was not available for immediate execution.
* ``SCX_EV_REENQ_LOCAL_REPEAT``: a reenqueue of the local DSQ triggered
  another reenqueue; recurring counts indicate incorrect ``SCX_ENQ_REENQ``
  handling in the BPF scheduler.
* ``SCX_EV_REFILL_SLICE_DFL``: a task's time slice was refilled with the
  default value (``SCX_SLICE_DFL``).
* ``SCX_EV_BYPASS_DURATION``: total nanoseconds spent in bypass mode.
* ``SCX_EV_BYPASS_DISPATCH``: number of tasks dispatched while in bypass mode.
* ``SCX_EV_BYPASS_ACTIVATE``: number of times bypass mode was activated.
* ``SCX_EV_INSERT_NOT_OWNED``: attempted to insert a task not owned by this
  scheduler into a DSQ; such attempts are silently ignored.
* ``SCX_EV_SUB_BYPASS_DISPATCH``: tasks dispatched from sub-scheduler bypass
  DSQs (only relevant with ``CONFIG_EXT_SUB_SCHED``).

``tools/sched_ext/scx_show_state.py`` is a drgn script which shows more
detailed information:

@@ -228,16 +277,23 @@ The following briefly shows how a waking task is scheduled and executed.
   scheduler can wake up any cpu using the ``scx_bpf_kick_cpu()`` helper,
   using ``ops.select_cpu()`` judiciously can be simpler and more efficient.

   A task can be immediately inserted into a DSQ from ``ops.select_cpu()``
   by calling ``scx_bpf_dsq_insert()``. If the task is inserted into
   ``SCX_DSQ_LOCAL`` from ``ops.select_cpu()``, it will be inserted into the
   local DSQ of whichever CPU is returned from ``ops.select_cpu()``.
   Additionally, inserting directly from ``ops.select_cpu()`` will cause the
   ``ops.enqueue()`` callback to be skipped.

   Note that the scheduler core will ignore an invalid CPU selection, for
   example, if it's outside the allowed cpumask of the task.

   A task can be immediately inserted into a DSQ from ``ops.select_cpu()``
   by calling ``scx_bpf_dsq_insert()`` or ``scx_bpf_dsq_insert_vtime()``.

   If the task is inserted into ``SCX_DSQ_LOCAL`` from
   ``ops.select_cpu()``, it will be added to the local DSQ of whichever CPU
   is returned from ``ops.select_cpu()``. Additionally, inserting directly
   from ``ops.select_cpu()`` will cause the ``ops.enqueue()`` callback to
   be skipped.

   Any other attempt to store a task in BPF-internal data structures from
   ``ops.select_cpu()`` does not prevent ``ops.enqueue()`` from being
   invoked. This is discouraged, as it can introduce racy behavior or
   inconsistent state.

2. Once the target CPU is selected, ``ops.enqueue()`` is invoked (unless the
   task was inserted directly from ``ops.select_cpu()``). ``ops.enqueue()``
   can make one of the following decisions:
@@ -251,6 +307,61 @@ The following briefly shows how a waking task is scheduled and executed.

   * Queue the task on the BPF side.

   **Task State Tracking and ops.dequeue() Semantics**

   A task is in the "BPF scheduler's custody" when the BPF scheduler is
   responsible for managing its lifecycle. A task enters custody when it is
   dispatched to a user DSQ or stored in the BPF scheduler's internal data
   structures. Custody is entered only from ``ops.enqueue()`` for those
   operations. The only exception is dispatching to a user DSQ from
   ``ops.select_cpu()``: although the task is not yet technically in BPF
   scheduler custody at that point, the dispatch has the same semantic
   effect as dispatching from ``ops.enqueue()`` for custody-related
   purposes.

   Once ``ops.enqueue()`` is called, the task may or may not enter custody
   depending on what the scheduler does:

   * **Directly dispatched to terminal DSQs** (``SCX_DSQ_LOCAL``,
     ``SCX_DSQ_LOCAL_ON | cpu``, or ``SCX_DSQ_GLOBAL``): the BPF scheduler
     is done with the task - it either goes straight to a CPU's local run
     queue or to the global DSQ as a fallback. The task never enters (or
     exits) BPF custody, and ``ops.dequeue()`` will not be called.

   * **Dispatch to user-created DSQs** (custom DSQs): the task enters the
     BPF scheduler's custody. When the task later leaves BPF custody
     (dispatched to a terminal DSQ, picked by core-sched, or dequeued for
     sleep/property changes), ``ops.dequeue()`` will be called exactly
     once.

   * **Stored in BPF data structures** (e.g., internal BPF queues): the
     task is in BPF custody. ``ops.dequeue()`` will be called when it
     leaves (e.g., when ``ops.dispatch()`` moves it to a terminal DSQ, or
     on property change / sleep).

   When a task leaves BPF scheduler custody, ``ops.dequeue()`` is invoked.
   The dequeue can happen for different reasons, distinguished by flags:

   1. **Regular dispatch**: when a task in BPF custody is dispatched to a
      terminal DSQ from ``ops.dispatch()`` (leaving BPF custody for
      execution), ``ops.dequeue()`` is triggered without any special flags.

   2. **Core scheduling pick**: when ``CONFIG_SCHED_CORE`` is enabled and
      core scheduling picks a task for execution while it's still in BPF
      custody, ``ops.dequeue()`` is called with the
      ``SCX_DEQ_CORE_SCHED_EXEC`` flag.

   3. **Scheduling property change**: when a task property changes (via
      operations like ``sched_setaffinity()``, ``sched_setscheduler()``,
      priority changes, CPU migrations, etc.) while the task is still in
      BPF custody, ``ops.dequeue()`` is called with the
      ``SCX_DEQ_SCHED_CHANGE`` flag set in ``deq_flags``.

   **Important**: Once a task has left BPF custody (e.g., after being
   dispatched to a terminal DSQ), property changes will not trigger
   ``ops.dequeue()``, since the task is no longer managed by the BPF
   scheduler.

3. When a CPU is ready to schedule, it first looks at its local DSQ. If
   empty, it then looks at the global DSQ. If there still isn't a task to
   run, ``ops.dispatch()`` is invoked which can use the following two
@@ -264,9 +375,9 @@ The following briefly shows how a waking task is scheduled and executed.
     rather than performing them immediately. There can be up to
     ``ops.dispatch_max_batch`` pending tasks.

   * ``scx_bpf_move_to_local()`` moves a task from the specified non-local
   * ``scx_bpf_dsq_move_to_local()`` moves a task from the specified non-local
     DSQ to the dispatching DSQ. This function cannot be called with any BPF
     locks held. ``scx_bpf_move_to_local()`` flushes the pending insertions
     locks held. ``scx_bpf_dsq_move_to_local()`` flushes the pending insertions
     tasks before trying to move from the specified DSQ.

4. After ``ops.dispatch()`` returns, if there are tasks in the local DSQ,
@@ -297,8 +408,8 @@ for more information.
Task Lifecycle
--------------

The following pseudo-code summarizes the entire lifecycle of a task managed
by a sched_ext scheduler:
The following pseudo-code presents a rough overview of the entire lifecycle
of a task managed by a sched_ext scheduler:

.. code-block:: c

@@ -311,30 +422,69 @@ by a sched_ext scheduler:

        ops.runnable();         /* Task becomes ready to run */

        while (task is runnable) {
            if (task is not in a DSQ && task->scx.slice == 0) {
        while (task_is_runnable(task)) {
            if (task is not in a DSQ || task->scx.slice == 0) {
                ops.enqueue();  /* Task can be added to a DSQ */

                /* Task property change (i.e., affinity, nice, etc.)? */
                if (sched_change(task)) {
                    ops.dequeue(); /* Exiting BPF scheduler custody */
                    ops.quiescent();

                    /* Property change callback, e.g. ops.set_weight() */

                    ops.runnable();
                    continue;
                }

                /* Any usable CPU becomes available */

                ops.dispatch();     /* Task is moved to a local DSQ */
                ops.dequeue();      /* Exiting BPF scheduler custody */
            }

            ops.running();      /* Task starts running on its assigned CPU */
            while (task->scx.slice > 0 && task is runnable)
                ops.tick();     /* Called every 1/HZ seconds */
            ops.stopping();     /* Task stops running (time slice expires or wait) */

            /* Task's CPU becomes available */
            while (task_is_runnable(task) && task->scx.slice > 0) {
                ops.tick();     /* Called every 1/HZ seconds */

                if (task->scx.slice == 0)
                    ops.dispatch(); /* task->scx.slice can be refilled */
            }

            ops.stopping();     /* Task stops running (time slice expires or wait) */
        }

        ops.quiescent();        /* Task releases its assigned CPU (wait) */
    }

    ops.disable();              /* Disable BPF scheduling for the task */
    ops.exit_task();            /* Task is destroyed */

Note that the above pseudo-code does not cover all possible state transitions
and edge cases, to name a few examples:

* ``ops.dispatch()`` may fail to move the task to a local DSQ due to a racing
  property change on that task, in which case ``ops.dispatch()`` will be
  retried.

* The task may be direct-dispatched to a local DSQ from ``ops.enqueue()``,
  in which case ``ops.dispatch()`` and ``ops.dequeue()`` are skipped and we go
  straight to ``ops.running()``.

* Property changes may occur at virtually any point during the task's lifecycle,
  not just when the task is queued and waiting to be dispatched. For example,
  changing a property of a running task will lead to the callback sequence
  ``ops.stopping()`` -> ``ops.quiescent()`` -> (property change callback) ->
  ``ops.runnable()`` -> ``ops.running()``.

* A sched_ext task can be preempted by a task from a higher-priority scheduling
  class, in which case it will exit the tick-dispatch loop even though it is runnable
  and has a non-zero slice.

See the "Scheduling Cycle" section for a more detailed description of how
a freshly woken up task gets on a CPU.

Where to Look
=============

@@ -377,6 +527,25 @@ Where to Look
    scheduling. Tasks with CPU affinity are direct-dispatched in FIFO order;
    all others are scheduled in user space by a simple vruntime scheduler.

Module Parameters
=================

sched_ext exposes two module parameters under the ``sched_ext.`` prefix that
control bypass-mode behaviour. These knobs are primarily for debugging; there
is usually no reason to change them during normal operation. They can be read
and written at runtime (mode 0600) via
``/sys/module/sched_ext/parameters/``.

``sched_ext.slice_bypass_us`` (default: 5000 µs)
    The time slice assigned to all tasks when the scheduler is in bypass mode,
    i.e. during BPF scheduler load, unload, and error recovery. Valid range is
    100 µs to 100 ms.

``sched_ext.bypass_lb_intv_us`` (default: 500000 µs)
    The interval at which the bypass-mode load balancer redistributes tasks
    across CPUs. Set to 0 to disable load balancing during bypass mode. Valid
    range is 0 to 10 s.

ABI Instability
===============

+4 −0
Original line number Diff line number Diff line
@@ -17,6 +17,7 @@
#include <linux/refcount.h>
#include <linux/percpu-refcount.h>
#include <linux/percpu-rwsem.h>
#include <linux/sched.h>
#include <linux/u64_stats_sync.h>
#include <linux/workqueue.h>
#include <linux/bpf-cgroup-defs.h>
@@ -628,6 +629,9 @@ struct cgroup {
#ifdef CONFIG_BPF_SYSCALL
	struct bpf_local_storage __rcu  *bpf_cgrp_storage;
#endif
#ifdef CONFIG_EXT_SUB_SCHED
	struct scx_sched __rcu *scx_sched;
#endif

	/* All ancestors including self */
	union {
+64 −45
Original line number Diff line number Diff line
@@ -62,6 +62,16 @@ enum scx_dsq_id_flags {
	SCX_DSQ_LOCAL_CPU_MASK	= 0xffffffffLLU,
};

struct scx_deferred_reenq_user {
	struct list_head	node;
	u64			flags;
};

struct scx_dsq_pcpu {
	struct scx_dispatch_q	*dsq;
	struct scx_deferred_reenq_user deferred_reenq_user;
};

/*
 * A dispatch queue (DSQ) can be either a FIFO or p->scx.dsq_vtime ordered
 * queue. A built-in DSQ is always a FIFO. The built-in local DSQs are used to
@@ -78,30 +88,58 @@ struct scx_dispatch_q {
	u64			id;
	struct rhash_head	hash_node;
	struct llist_node	free_node;
	struct scx_sched	*sched;
	struct scx_dsq_pcpu __percpu *pcpu;
	struct rcu_head		rcu;
};

/* scx_entity.flags */
/* sched_ext_entity.flags */
enum scx_ent_flags {
	SCX_TASK_QUEUED		= 1 << 0, /* on ext runqueue */
	SCX_TASK_IN_CUSTODY	= 1 << 1, /* in custody, needs ops.dequeue() when leaving */
	SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */
	SCX_TASK_DEQD_FOR_SLEEP	= 1 << 3, /* last dequeue was for SLEEP */
	SCX_TASK_SUB_INIT	= 1 << 4, /* task being initialized for a sub sched */
	SCX_TASK_IMMED		= 1 << 5, /* task is on local DSQ with %SCX_ENQ_IMMED */

	SCX_TASK_STATE_SHIFT	= 8,	  /* bit 8 and 9 are used to carry scx_task_state */
	/*
	 * Bits 8 and 9 are used to carry task state:
	 *
	 * NONE		ops.init_task() not called yet
	 * INIT		ops.init_task() succeeded, but task can be cancelled
	 * READY	fully initialized, but not in sched_ext
	 * ENABLED	fully initialized and in sched_ext
	 */
	SCX_TASK_STATE_SHIFT	= 8,	  /* bits 8 and 9 are used to carry task state */
	SCX_TASK_STATE_BITS	= 2,
	SCX_TASK_STATE_MASK	= ((1 << SCX_TASK_STATE_BITS) - 1) << SCX_TASK_STATE_SHIFT,

	SCX_TASK_CURSOR		= 1 << 31, /* iteration cursor, not a task */
};
	SCX_TASK_NONE		= 0 << SCX_TASK_STATE_SHIFT,
	SCX_TASK_INIT		= 1 << SCX_TASK_STATE_SHIFT,
	SCX_TASK_READY		= 2 << SCX_TASK_STATE_SHIFT,
	SCX_TASK_ENABLED	= 3 << SCX_TASK_STATE_SHIFT,

/* scx_entity.flags & SCX_TASK_STATE_MASK */
enum scx_task_state {
	SCX_TASK_NONE,		/* ops.init_task() not called yet */
	SCX_TASK_INIT,		/* ops.init_task() succeeded, but task can be cancelled */
	SCX_TASK_READY,		/* fully initialized, but not in sched_ext */
	SCX_TASK_ENABLED,	/* fully initialized and in sched_ext */
	/*
	 * Bits 12 and 13 are used to carry reenqueue reason. In addition to
	 * %SCX_ENQ_REENQ flag, ops.enqueue() can also test for
	 * %SCX_TASK_REENQ_REASON_NONE to distinguish reenqueues.
	 *
	 * NONE		not being reenqueued
	 * KFUNC	reenqueued by scx_bpf_dsq_reenq() and friends
	 * IMMED	reenqueued due to failed ENQ_IMMED
	 * PREEMPTED	preempted while running
	 */
	SCX_TASK_REENQ_REASON_SHIFT = 12,
	SCX_TASK_REENQ_REASON_BITS = 2,
	SCX_TASK_REENQ_REASON_MASK = ((1 << SCX_TASK_REENQ_REASON_BITS) - 1) << SCX_TASK_REENQ_REASON_SHIFT,

	SCX_TASK_REENQ_NONE	= 0 << SCX_TASK_REENQ_REASON_SHIFT,
	SCX_TASK_REENQ_KFUNC	= 1 << SCX_TASK_REENQ_REASON_SHIFT,
	SCX_TASK_REENQ_IMMED	= 2 << SCX_TASK_REENQ_REASON_SHIFT,
	SCX_TASK_REENQ_PREEMPTED = 3 << SCX_TASK_REENQ_REASON_SHIFT,

	SCX_TASK_NR_STATES,
	/* iteration cursor, not a task */
	SCX_TASK_CURSOR		= 1 << 31,
};

/* scx_entity.dsq_flags */
@@ -109,33 +147,6 @@ enum scx_ent_dsq_flags {
	SCX_TASK_DSQ_ON_PRIQ	= 1 << 0, /* task is queued on the priority queue of a dsq */
};

/*
 * Mask bits for scx_entity.kf_mask. Not all kfuncs can be called from
 * everywhere and the following bits track which kfunc sets are currently
 * allowed for %current. This simple per-task tracking works because SCX ops
 * nest in a limited way. BPF will likely implement a way to allow and disallow
 * kfuncs depending on the calling context which will replace this manual
 * mechanism. See scx_kf_allow().
 */
enum scx_kf_mask {
	SCX_KF_UNLOCKED		= 0,	  /* sleepable and not rq locked */
	/* ENQUEUE and DISPATCH may be nested inside CPU_RELEASE */
	SCX_KF_CPU_RELEASE	= 1 << 0, /* ops.cpu_release() */
	/*
	 * ops.dispatch() may release rq lock temporarily and thus ENQUEUE and
	 * SELECT_CPU may be nested inside. ops.dequeue (in REST) may also be
	 * nested inside DISPATCH.
	 */
	SCX_KF_DISPATCH		= 1 << 1, /* ops.dispatch() */
	SCX_KF_ENQUEUE		= 1 << 2, /* ops.enqueue() and ops.select_cpu() */
	SCX_KF_SELECT_CPU	= 1 << 3, /* ops.select_cpu() */
	SCX_KF_REST		= 1 << 4, /* other rq-locked operations */

	__SCX_KF_RQ_LOCKED	= SCX_KF_CPU_RELEASE | SCX_KF_DISPATCH |
				  SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST,
	__SCX_KF_TERMINAL	= SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST,
};

enum scx_dsq_lnode_flags {
	SCX_DSQ_LNODE_ITER_CURSOR = 1 << 0,

@@ -149,19 +160,31 @@ struct scx_dsq_list_node {
	u32			priv;		/* can be used by iter cursor */
};

#define INIT_DSQ_LIST_CURSOR(__node, __flags, __priv)				\
#define INIT_DSQ_LIST_CURSOR(__cursor, __dsq, __flags)				\
	(struct scx_dsq_list_node) {						\
		.node = LIST_HEAD_INIT((__node).node),				\
		.node = LIST_HEAD_INIT((__cursor).node),			\
		.flags = SCX_DSQ_LNODE_ITER_CURSOR | (__flags),			\
		.priv = (__priv),						\
		.priv = READ_ONCE((__dsq)->seq),				\
	}

struct scx_sched;

/*
 * The following is embedded in task_struct and contains all fields necessary
 * for a task to be scheduled by SCX.
 */
struct sched_ext_entity {
#ifdef CONFIG_CGROUPS
	/*
	 * Associated scx_sched. Updated either during fork or while holding
	 * both p->pi_lock and rq lock.
	 */
	struct scx_sched __rcu	*sched;
#endif
	struct scx_dispatch_q	*dsq;
	atomic_long_t		ops_state;
	u64			ddsp_dsq_id;
	u64			ddsp_enq_flags;
	struct scx_dsq_list_node dsq_list;	/* dispatch order */
	struct rb_node		dsq_priq;	/* p->scx.dsq_vtime order */
	u32			dsq_seq;
@@ -171,9 +194,7 @@ struct sched_ext_entity {
	s32			sticky_cpu;
	s32			holding_cpu;
	s32			selected_cpu;
	u32			kf_mask;	/* see scx_kf_mask above */
	struct task_struct	*kf_tasks[2];	/* see SCX_CALL_OP_TASK() */
	atomic_long_t		ops_state;

	struct list_head	runnable_node;	/* rq->scx.runnable_list */
	unsigned long		runnable_at;
@@ -181,8 +202,6 @@ struct sched_ext_entity {
#ifdef CONFIG_SCHED_CORE
	u64			core_sched_at;	/* see scx_prio_less() */
#endif
	u64			ddsp_dsq_id;
	u64			ddsp_enq_flags;

	/* BPF scheduler modifiable fields */

+4 −0
Original line number Diff line number Diff line
@@ -1190,6 +1190,10 @@ config EXT_GROUP_SCHED

endif #CGROUP_SCHED

config EXT_SUB_SCHED
        def_bool y
        depends on SCHED_CLASS_EXT && CGROUPS

config SCHED_MM_CID
	def_bool y
	depends on SMP && RSEQ
+5 −1
Original line number Diff line number Diff line
@@ -2514,8 +2514,12 @@ __latent_entropy struct task_struct *copy_process(
		fd_install(pidfd, pidfile);

	proc_fork_connector(p);
	sched_post_fork(p);
	/*
	 * sched_ext needs @p to be associated with its cgroup in its post_fork
	 * hook. cgroup_post_fork() should come before sched_post_fork().
	 */
	cgroup_post_fork(p, args);
	sched_post_fork(p);
	perf_event_fork(p);

	trace_task_newtask(p, clone_flags);
Loading