Commit 7d7a103d authored by Linus Torvalds's avatar Linus Torvalds
Browse files

Merge tag 'vfs-6.16-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull pidfs updates from Christian Brauner:
 "Features:

   - Allow handing out pidfds for reaped tasks for AF_UNIX SO_PEERPIDFD
     socket option

     SO_PEERPIDFD is a socket option that allows to retrieve a pidfd for
     the process that called connect() or listen(). This is heavily used
     to safely authenticate clients in userspace avoiding security bugs
     due to pid recycling races (dbus, polkit, systemd, etc.)

     SO_PEERPIDFD currently doesn't support handing out pidfds if the
     sk->sk_peer_pid thread-group leader has already been reaped. In
     this case it currently returns EINVAL. Userspace still wants to get
     a pidfd for a reaped process to have a stable handle it can pass
     on. This is especially useful now that it is possible to retrieve
     exit information through a pidfd via the PIDFD_GET_INFO ioctl()'s
     PIDFD_INFO_EXIT flag

     Another summary has been provided by David Rheinsberg:

      > A pidfd can outlive the task it refers to, and thus user-space
      > must already be prepared that the task underlying a pidfd is
      > gone at the time they get their hands on the pidfd. For
      > instance, resolving the pidfd to a PID via the fdinfo must be
      > prepared to read `-1`.
      >
      > Despite user-space knowing that a pidfd might be stale, several
      > kernel APIs currently add another layer that checks for this. In
      > particular, SO_PEERPIDFD returns `EINVAL` if the peer-task was
      > already reaped, but returns a stale pidfd if the task is reaped
      > immediately after the respective alive-check.
      >
      > This has the unfortunate effect that user-space now has two ways
      > to check for the exact same scenario: A syscall might return
      > EINVAL/ESRCH/... *or* the pidfd might be stale, even though
      > there is no particular reason to distinguish both cases. This
      > also propagates through user-space APIs, which pass on pidfds.
      > They must be prepared to pass on `-1` *or* the pidfd, because
      > there is no guaranteed way to get a stale pidfd from the kernel.
      >
      > Userspace must already deal with a pidfd referring to a reaped
      > task as the task may exit and get reaped at any time will there
      > are still many pidfds referring to it

     In order to allow handing out reaped pidfd SO_PEERPIDFD needs to
     ensure that PIDFD_INFO_EXIT information is available whenever a
     pidfd for a reaped task is created by PIDFD_INFO_EXIT. The uapi
     promises that reaped pidfds are only handed out if it is guaranteed
     that the caller sees the exit information:

     TEST_F(pidfd_info, success_reaped)
     {
             struct pidfd_info info = {
                     .mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT,
             };

             /*
              * Process has already been reaped and PIDFD_INFO_EXIT been set.
              * Verify that we can retrieve the exit status of the process.
              */
             ASSERT_EQ(ioctl(self->child_pidfd4, PIDFD_GET_INFO, &info), 0);
             ASSERT_FALSE(!!(info.mask & PIDFD_INFO_CREDS));
             ASSERT_TRUE(!!(info.mask & PIDFD_INFO_EXIT));
             ASSERT_TRUE(WIFEXITED(info.exit_code));
             ASSERT_EQ(WEXITSTATUS(info.exit_code), 0);
     }

     To hand out pidfds for reaped processes we thus allocate a pidfs
     entry for the relevant sk->sk_peer_pid at the time the
     sk->sk_peer_pid is stashed and drop it when the socket is
     destroyed. This guarantees that exit information will always be
     recorded for the sk->sk_peer_pid task and we can hand out pidfds
     for reaped processes

   - Hand a pidfd to the coredump usermode helper process

     Give userspace a way to instruct the kernel to install a pidfd for
     the crashing process into the process started as a usermode helper.
     There's still tricky race-windows that cannot be easily or
     sometimes not closed at all by userspace. There's various ways like
     looking at the start time of a process to make sure that the
     usermode helper process is started after the crashing process but
     it's all very very brittle and fraught with peril

     The crashed-but-not-reaped process can be killed by userspace
     before coredump processing programs like systemd-coredump have had
     time to manually open a PIDFD from the PID the kernel provides
     them, which means they can be tricked into reading from an
     arbitrary process, and they run with full privileges as they are
     usermode helper processes

     Even if that specific race-window wouldn't exist it's still the
     safest and cleanest way to let the kernel provide the pidfd
     directly instead of requiring userspace to do it manually. In
     parallel with this commit we already have systemd adding support
     for this in [1]

     When the usermode helper process is forked we install a pidfd file
     descriptor three into the usermode helper's file descriptor table
     so it's available to the exec'd program

     Since usermode helpers are either children of the system_unbound_wq
     workqueue or kthreadd we know that the file descriptor table is
     empty and can thus always use three as the file descriptor number

     Note, that we'll install a pidfd for the thread-group leader even
     if a subthread is calling do_coredump(). We know that task linkage
     hasn't been removed yet and even if this @current isn't the actual
     thread-group leader we know that the thread-group leader cannot be
     reaped until
     @current has exited

   - Allow telling when a task has not been found from finding the wrong
     task when creating a pidfd

     We currently report EINVAL whenever a struct pid has no tasked
     attached anymore thereby conflating two concepts:

      (1) The task has already been reaped

      (2) The caller requested a pidfd for a thread-group leader but the
          pid actually references a struct pid that isn't used as a
          thread-group leader

     This is causing issues for non-threaded workloads as in where they
     expect ESRCH to be reported, not EINVAL

     So allow userspace to reliably distinguish between (1) and (2)

   - Make it possible to detect when a pidfs entry would outlive the
     struct pid it pinned

   - Add a range of new selftests

  Cleanups:

   - Remove unneeded NULL check from pidfd_prepare() for passed struct
     pid

   - Avoid pointless reference count bump during release_task()

  Fixes:

   - Various fixes to the pidfd and coredump selftests

   - Fix error handling for replace_fd() when spawning coredump usermode
     helper"

* tag 'vfs-6.16-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  pidfs: detect refcount bugs
  coredump: hand a pidfd to the usermode coredump helper
  coredump: fix error handling for replace_fd()
  pidfs: move O_RDWR into pidfs_alloc_file()
  selftests: coredump: Raise timeout to 2 minutes
  selftests: coredump: Fix test failure for slow machines
  selftests: coredump: Properly initialize pointer
  net, pidfs: enable handing out pidfds for reaped sk->sk_peer_pid
  pidfs: get rid of __pidfd_prepare()
  net, pidfs: prepare for handing out pidfds for reaped sk->sk_peer_pid
  pidfs: register pid in pidfs
  net, pidfd: report EINVAL for ESRCH
  release_task: kill the no longer needed get/put_pid(thread_pid)
  pidfs: ensure consistent ENOENT/ESRCH reporting
  exit: move wake_up_all() pidfd waiters into __unhash_process()
  selftest/pidfd: add test for thread-group leader pidfd open for thread
  pidfd: improve uapi when task isn't found
  pidfd: remove unneeded NULL check from pidfd_prepare()
  selftests/pidfd: adapt to recent changes
parents 2ca35346 db56723c
Loading
Loading
Loading
Loading
+59 −6
Original line number Diff line number Diff line
@@ -43,6 +43,8 @@
#include <linux/timekeeping.h>
#include <linux/sysctl.h>
#include <linux/elf.h>
#include <linux/pidfs.h>
#include <uapi/linux/pidfd.h>

#include <linux/uaccess.h>
#include <asm/mmu_context.h>
@@ -60,6 +62,12 @@ static void free_vma_snapshot(struct coredump_params *cprm);
#define CORE_FILE_NOTE_SIZE_DEFAULT (4*1024*1024)
/* Define a reasonable max cap */
#define CORE_FILE_NOTE_SIZE_MAX (16*1024*1024)
/*
 * File descriptor number for the pidfd for the thread-group leader of
 * the coredumping task installed into the usermode helper's file
 * descriptor table.
 */
#define COREDUMP_PIDFD_NUMBER 3

static int core_uses_pid;
static unsigned int core_pipe_limit;
@@ -339,6 +347,27 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
			case 'C':
				err = cn_printf(cn, "%d", cprm->cpu);
				break;
			/* pidfd number */
			case 'F': {
				/*
				 * Installing a pidfd only makes sense if
				 * we actually spawn a usermode helper.
				 */
				if (!ispipe)
					break;

				/*
				 * Note that we'll install a pidfd for the
				 * thread-group leader. We know that task
				 * linkage hasn't been removed yet and even if
				 * this @current isn't the actual thread-group
				 * leader we know that the thread-group leader
				 * cannot be reaped until @current has exited.
				 */
				cprm->pid = task_tgid(current);
				err = cn_printf(cn, "%d", COREDUMP_PIDFD_NUMBER);
				break;
			}
			default:
				break;
			}
@@ -493,7 +522,7 @@ static void wait_for_dump_helpers(struct file *file)
}

/*
 * umh_pipe_setup
 * umh_coredump_setup
 * helper function to customize the process used
 * to collect the core in userspace.  Specifically
 * it sets up a pipe and installs it as fd 0 (stdin)
@@ -503,11 +532,32 @@ static void wait_for_dump_helpers(struct file *file)
 * is a special value that we use to trap recursive
 * core dumps
 */
static int umh_pipe_setup(struct subprocess_info *info, struct cred *new)
static int umh_coredump_setup(struct subprocess_info *info, struct cred *new)
{
	struct file *files[2];
	struct coredump_params *cp = (struct coredump_params *)info->data;
	int err = create_pipe_files(files, 0);
	int err;

	if (cp->pid) {
		struct file *pidfs_file __free(fput) = NULL;

		pidfs_file = pidfs_alloc_file(cp->pid, 0);
		if (IS_ERR(pidfs_file))
			return PTR_ERR(pidfs_file);

		/*
		 * Usermode helpers are childen of either
		 * system_unbound_wq or of kthreadd. So we know that
		 * we're starting off with a clean file descriptor
		 * table. So we should always be able to use
		 * COREDUMP_PIDFD_NUMBER as our file descriptor value.
		 */
		err = replace_fd(COREDUMP_PIDFD_NUMBER, pidfs_file, 0);
		if (err < 0)
			return err;
	}

	err = create_pipe_files(files, 0);
	if (err)
		return err;

@@ -515,10 +565,13 @@ static int umh_pipe_setup(struct subprocess_info *info, struct cred *new)

	err = replace_fd(0, files[0], 0);
	fput(files[0]);
	if (err < 0)
		return err;

	/* and disallow core files too */
	current->signal->rlim[RLIMIT_CORE] = (struct rlimit){1, 1};

	return err;
	return 0;
}

void do_coredump(const kernel_siginfo_t *siginfo)
@@ -593,7 +646,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
		}

		if (cprm.limit == 1) {
			/* See umh_pipe_setup() which sets RLIMIT_CORE = 1.
			/* See umh_coredump_setup() which sets RLIMIT_CORE = 1.
			 *
			 * Normally core limits are irrelevant to pipes, since
			 * we're not writing to the file system, but we use
@@ -632,7 +685,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
		retval = -ENOMEM;
		sub_info = call_usermodehelper_setup(helper_argv[0],
						helper_argv, NULL, GFP_KERNEL,
						umh_pipe_setup, NULL, &cprm);
						umh_coredump_setup, NULL, &cprm);
		if (sub_info)
			retval = call_usermodehelper_exec(sub_info,
							  UMH_WAIT_EXEC);
+73 −9
Original line number Diff line number Diff line
@@ -746,7 +746,7 @@ static inline bool pidfs_pid_valid(struct pid *pid, const struct path *path,
{
	enum pid_type type;

	if (flags & PIDFD_CLONE)
	if (flags & PIDFD_STALE)
		return true;

	/*
@@ -755,10 +755,14 @@ static inline bool pidfs_pid_valid(struct pid *pid, const struct path *path,
	 * pidfd has been allocated perform another check that the pid
	 * is still alive. If it is exit information is available even
	 * if the task gets reaped before the pidfd is returned to
	 * userspace. The only exception is PIDFD_CLONE where no task
	 * linkage has been established for @pid yet and the kernel is
	 * in the middle of process creation so there's nothing for
	 * pidfs to miss.
	 * userspace. The only exception are indicated by PIDFD_STALE:
	 *
	 * (1) The kernel is in the middle of task creation and thus no
	 *     task linkage has been established yet.
	 * (2) The caller knows @pid has been registered in pidfs at a
	 *     time when the task was still alive.
	 *
	 * In both cases exit information will have been reported.
	 */
	if (flags & PIDFD_THREAD)
		type = PIDTYPE_PID;
@@ -852,11 +856,11 @@ struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags)
	int ret;

	/*
	 * Ensure that PIDFD_CLONE can be passed as a flag without
	 * Ensure that PIDFD_STALE can be passed as a flag without
	 * overloading other uapi pidfd flags.
	 */
	BUILD_BUG_ON(PIDFD_CLONE == PIDFD_THREAD);
	BUILD_BUG_ON(PIDFD_CLONE == PIDFD_NONBLOCK);
	BUILD_BUG_ON(PIDFD_STALE == PIDFD_THREAD);
	BUILD_BUG_ON(PIDFD_STALE == PIDFD_NONBLOCK);

	ret = path_from_stashed(&pid->stashed, pidfs_mnt, get_pid(pid), &path);
	if (ret < 0)
@@ -865,7 +869,8 @@ struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags)
	if (!pidfs_pid_valid(pid, &path, flags))
		return ERR_PTR(-ESRCH);

	flags &= ~PIDFD_CLONE;
	flags &= ~PIDFD_STALE;
	flags |= O_RDWR;
	pidfd_file = dentry_open(&path, flags, current_cred());
	/* Raise PIDFD_THREAD explicitly as do_dentry_open() strips it. */
	if (!IS_ERR(pidfd_file))
@@ -874,6 +879,65 @@ struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags)
	return pidfd_file;
}

/**
 * pidfs_register_pid - register a struct pid in pidfs
 * @pid: pid to pin
 *
 * Register a struct pid in pidfs. Needs to be paired with
 * pidfs_put_pid() to not risk leaking the pidfs dentry and inode.
 *
 * Return: On success zero, on error a negative error code is returned.
 */
int pidfs_register_pid(struct pid *pid)
{
	struct path path __free(path_put) = {};
	int ret;

	might_sleep();

	if (!pid)
		return 0;

	ret = path_from_stashed(&pid->stashed, pidfs_mnt, get_pid(pid), &path);
	if (unlikely(ret))
		return ret;
	/* Keep the dentry and only put the reference to the mount. */
	path.dentry = NULL;
	return 0;
}

/**
 * pidfs_get_pid - pin a struct pid through pidfs
 * @pid: pid to pin
 *
 * Similar to pidfs_register_pid() but only valid if the caller knows
 * there's a reference to the @pid through a dentry already that can't
 * go away.
 */
void pidfs_get_pid(struct pid *pid)
{
	if (!pid)
		return;
	WARN_ON_ONCE(!stashed_dentry_get(&pid->stashed));
}

/**
 * pidfs_put_pid - drop a pidfs reference
 * @pid: pid to drop
 *
 * Drop a reference to @pid via pidfs. This is only safe if the
 * reference has been taken via pidfs_get_pid().
 */
void pidfs_put_pid(struct pid *pid)
{
	might_sleep();

	if (!pid)
		return;
	VFS_WARN_ON_ONCE(!pid->stashed);
	dput(pid->stashed);
}

static void pidfs_inode_init_once(void *data)
{
	struct pidfs_inode *pi = data;
+1 −0
Original line number Diff line number Diff line
@@ -28,6 +28,7 @@ struct coredump_params {
	int vma_count;
	size_t vma_data_size;
	struct core_vma_metadata *vma_meta;
	struct pid *pid;
};

extern unsigned int core_file_note_size_limit;
+1 −1
Original line number Diff line number Diff line
@@ -77,7 +77,7 @@ struct file;
struct pid *pidfd_pid(const struct file *file);
struct pid *pidfd_get_pid(unsigned int fd, unsigned int *flags);
struct task_struct *pidfd_get_task(int pidfd, unsigned int *flags);
int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret);
int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret_file);
void do_notify_pidfd(struct task_struct *task);

static inline struct pid *get_pid(struct pid *pid)
+3 −0
Original line number Diff line number Diff line
@@ -8,5 +8,8 @@ void pidfs_add_pid(struct pid *pid);
void pidfs_remove_pid(struct pid *pid);
void pidfs_exit(struct task_struct *tsk);
extern const struct dentry_operations pidfs_dentry_operations;
int pidfs_register_pid(struct pid *pid);
void pidfs_get_pid(struct pid *pid);
void pidfs_put_pid(struct pid *pid);

#endif /* _LINUX_PID_FS_H */
Loading