Commit 3e816361 authored by Andrii Nakryiko's avatar Andrii Nakryiko Committed by Ingo Molnar
Browse files

sched/tracepoints: Move and extend the sched_process_exit() tracepoint



It is useful to be able to access current->mm at task exit to, say,
record a bunch of VMA information right before the task exits (e.g., for
stack symbolization reasons when dealing with short-lived processes that
exit in the middle of profiling session). Currently,
trace_sched_process_exit() is triggered after exit_mm() which resets
current->mm to NULL making this tracepoint unsuitable for inspecting
and recording task's mm_struct-related data when tracing process
lifetimes.

There is a particularly suitable place, though, right after
taskstats_exit() is called, but before we do exit_mm() and other
exit_*() resource teardowns. taskstats performs a similar kind of
accounting that some applications do with BPF, and so co-locating them
seems like a good fit. So that's where trace_sched_process_exit() is
moved with this patch.

Also, existing trace_sched_process_exit() tracepoint is notoriously
missing `group_dead` flag that is certainly useful in practice and some
of our production applications have to work around this. So plumb
`group_dead` through while at it, to have a richer and more complete
tracepoint.

Note that we can't use sched_process_template anymore, and so we use
TRACE_EVENT()-based tracepoint definition. But all the field names and
order, as well as assign and output logic remain intact. We just add one
extra field at the end in backwards-compatible way.

Document the dependency to sched_process_template anyway.

Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
Acked-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250402180925.90914-1-andrii@kernel.org
parent a2cc6ff5
Loading
Loading
Loading
Loading
+30 −4
Original line number Diff line number Diff line
@@ -326,11 +326,37 @@ DEFINE_EVENT(sched_process_template, sched_process_free,
	     TP_ARGS(p));

/*
 * Tracepoint for a task exiting:
 * Tracepoint for a task exiting.
 * Note, it's a superset of sched_process_template and should be kept
 * compatible as much as possible. sched_process_exits has an extra
 * `group_dead` argument, so sched_process_template can't be used,
 * unfortunately, just like sched_migrate_task above.
 */
DEFINE_EVENT(sched_process_template, sched_process_exit,
	     TP_PROTO(struct task_struct *p),
	     TP_ARGS(p));
TRACE_EVENT(sched_process_exit,

	TP_PROTO(struct task_struct *p, bool group_dead),

	TP_ARGS(p, group_dead),

	TP_STRUCT__entry(
		__array(	char,	comm,	TASK_COMM_LEN	)
		__field(	pid_t,	pid			)
		__field(	int,	prio			)
		__field(	bool,	group_dead		)
	),

	TP_fast_assign(
		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
		__entry->pid		= p->pid;
		__entry->prio		= p->prio; /* XXX SCHED_DEADLINE */
		__entry->group_dead	= group_dead;
	),

	TP_printk("comm=%s pid=%d prio=%d group_dead=%s",
		  __entry->comm, __entry->pid, __entry->prio,
		  __entry->group_dead ? "true" : "false"
	)
);

/*
 * Tracepoint for waiting on task to unschedule:
+1 −1
Original line number Diff line number Diff line
@@ -936,12 +936,12 @@ void __noreturn do_exit(long code)

	tsk->exit_code = code;
	taskstats_exit(tsk, group_dead);
	trace_sched_process_exit(tsk, group_dead);

	exit_mm();

	if (group_dead)
		acct_process();
	trace_sched_process_exit(tsk);

	exit_sem(tsk);
	exit_shm(tsk);