Commit a96ef584 authored by Ryan Roberts's avatar Ryan Roberts Committed by Kees Cook
Browse files

randomize_kstack: Unify random source across arches



Previously different architectures were using random sources of
differing strength and cost to decide the random kstack offset. A number
of architectures (loongarch, powerpc, s390, x86) were using their
timestamp counter, at whatever the frequency happened to be. Other
arches (arm64, riscv) were using entropy from the crng via
get_random_u16().

There have been concerns that in some cases the timestamp counters may
be too weak, because they can be easily guessed or influenced by user
space. And get_random_u16() has been shown to be too costly for the
level of protection kstack offset randomization provides.

So let's use a common, architecture-agnostic source of entropy; a
per-cpu prng, seeded at boot-time from the crng. This has a few
benefits:

  - We can remove choose_random_kstack_offset(); That was only there to
    try to make the timestamp counter value a bit harder to influence
    from user space [*].

  - The architecture code is simplified. All it has to do now is call
    add_random_kstack_offset() in the syscall path.

  - The strength of the randomness can be reasoned about independently
    of the architecture.

  - Arches previously using get_random_u16() now have much faster
    syscall paths, see below results.

[*] Additionally, this gets rid of some redundant work on s390 and x86.
Before this patch, those architectures called
choose_random_kstack_offset() under arch_exit_to_user_mode_prepare(),
which is also called for exception returns to userspace which were *not*
syscalls (e.g. regular interrupts). Getting rid of
choose_random_kstack_offset() avoids a small amount of redundant work
for the non-syscall cases.

In some configurations, add_random_kstack_offset() will now call
instrumentable code, so for a couple of arches, I have moved the call a
bit later to the first point where instrumentation is allowed. This
doesn't impact the efficacy of the mechanism.

There have been some claims that a prng may be less strong than the
timestamp counter if not regularly reseeded. But the prng has a period
of about 2^113. So as long as the prng state remains secret, it should
not be possible to guess. If the prng state can be accessed, we have
bigger problems.

Additionally, we are only consuming 6 bits to randomize the stack, so
there are only 64 possible random offsets. I assert that it would be
trivial for an attacker to brute force by repeating their attack and
waiting for the random stack offset to be the desired one. The prng
approach seems entirely proportional to this level of protection.

Performance data are provided below. The baseline is v6.18 with rndstack
on for each respective arch. (I)/(R) indicate statistically significant
improvement/regression. arm64 platform is AWS Graviton3 (m7g.metal).
x86_64 platform is AWS Sapphire Rapids (m7i.24xlarge):

+-----------------+--------------+---------------+---------------+
| Benchmark       | Result Class |  per-cpu-prng |  per-cpu-prng |
|                 |              | arm64 (metal) |   x86_64 (VM) |
+=================+==============+===============+===============+
| syscall/getpid  | mean (ns)    |    (I) -9.50% |   (I) -17.65% |
|                 | p99 (ns)     |   (I) -59.24% |   (I) -24.41% |
|                 | p99.9 (ns)   |   (I) -59.52% |   (I) -28.52% |
+-----------------+--------------+---------------+---------------+
| syscall/getppid | mean (ns)    |    (I) -9.52% |   (I) -19.24% |
|                 | p99 (ns)     |   (I) -59.25% |   (I) -25.03% |
|                 | p99.9 (ns)   |   (I) -59.50% |   (I) -28.17% |
+-----------------+--------------+---------------+---------------+
| syscall/invalid | mean (ns)    |   (I) -10.31% |   (I) -18.56% |
|                 | p99 (ns)     |   (I) -60.79% |   (I) -20.06% |
|                 | p99.9 (ns)   |   (I) -61.04% |   (I) -25.04% |
+-----------------+--------------+---------------+---------------+

I tested an earlier version of this change on x86 bare metal and it
showed a smaller but still significant improvement. The bare metal
system wasn't available this time around so testing was done in a VM
instance. I'm guessing the cost of rdtsc is higher for VMs.

Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
Signed-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
Link: https://patch.msgid.link/20260303150840.3789438-3-ryan.roberts@arm.com


Signed-off-by: default avatarKees Cook <kees@kernel.org>
parent 37beb425
Loading
Loading
Loading
Loading
+2 −3
Original line number Diff line number Diff line
@@ -1519,9 +1519,8 @@ config HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
	def_bool n
	help
	  An arch should select this symbol if it can support kernel stack
	  offset randomization with calls to add_random_kstack_offset()
	  during syscall entry and choose_random_kstack_offset() during
	  syscall exit. Careful removal of -fstack-protector-strong and
	  offset randomization with a call to add_random_kstack_offset()
	  during syscall entry. Careful removal of -fstack-protector-strong and
	  -fstack-protector should also be applied to the entry code and
	  closely examined, as the artificial stack bump looks like an array
	  to the compiler, so it will attempt to add canary checks regardless
+0 −11
Original line number Diff line number Diff line
@@ -52,17 +52,6 @@ static void invoke_syscall(struct pt_regs *regs, unsigned int scno,
	}

	syscall_set_return_value(current, regs, 0, ret);

	/*
	 * This value will get limited by KSTACK_OFFSET_MAX(), which is 10
	 * bits. The actual entropy will be further reduced by the compiler
	 * when applying stack alignment constraints: the AAPCS mandates a
	 * 16-byte aligned SP at function boundaries, which will remove the
	 * 4 low bits from any entropy chosen here.
	 *
	 * The resulting 6 bits of entropy is seen in SP[9:4].
	 */
	choose_random_kstack_offset(get_random_u16());
}

static inline bool has_syscall_work(unsigned long flags)
+0 −11
Original line number Diff line number Diff line
@@ -79,16 +79,5 @@ void noinstr __no_stack_protector do_syscall(struct pt_regs *regs)
					   regs->regs[7], regs->regs[8], regs->regs[9]);
	}

	/*
	 * This value will get limited by KSTACK_OFFSET_MAX(), which is 10
	 * bits. The actual entropy will be further reduced by the compiler
	 * when applying stack alignment constraints: 16-bytes (i.e. 4-bits)
	 * aligned, which will remove the 4 low bits from any entropy chosen
	 * here.
	 *
	 * The resulting 6 bits of entropy is seen in SP[9:4].
	 */
	choose_random_kstack_offset(get_cycles());

	syscall_exit_to_user_mode(regs);
}
+2 −14
Original line number Diff line number Diff line
@@ -20,8 +20,6 @@ notrace long system_call_exception(struct pt_regs *regs, unsigned long r0)

	kuap_lock();

	add_random_kstack_offset();

	if (IS_ENABLED(CONFIG_PPC_IRQ_SOFT_MASK_DEBUG))
		BUG_ON(irq_soft_mask_return() != IRQS_ALL_DISABLED);

@@ -30,6 +28,8 @@ notrace long system_call_exception(struct pt_regs *regs, unsigned long r0)
	CT_WARN_ON(ct_state() == CT_STATE_KERNEL);
	user_exit_irqoff();

	add_random_kstack_offset();

	BUG_ON(regs_is_unrecoverable(regs));
	BUG_ON(!user_mode(regs));
	BUG_ON(arch_irq_disabled_regs(regs));
@@ -173,17 +173,5 @@ notrace long system_call_exception(struct pt_regs *regs, unsigned long r0)
	}
#endif

	/*
	 * Ultimately, this value will get limited by KSTACK_OFFSET_MAX(),
	 * so the maximum stack offset is 1k bytes (10 bits).
	 *
	 * The actual entropy will be further reduced by the compiler when
	 * applying stack alignment constraints: the powerpc architecture
	 * may have two kinds of stack alignment (16-bytes and 8-bytes).
	 *
	 * So the resulting 6 or 7 bits of entropy is seen in SP[9:4] or SP[9:3].
	 */
	choose_random_kstack_offset(mftb());

	return ret;
}
+0 −12
Original line number Diff line number Diff line
@@ -344,18 +344,6 @@ void do_trap_ecall_u(struct pt_regs *regs)
			syscall_handler(regs, syscall);
		}

		/*
		 * Ultimately, this value will get limited by KSTACK_OFFSET_MAX(),
		 * so the maximum stack offset is 1k bytes (10 bits).
		 *
		 * The actual entropy will be further reduced by the compiler when
		 * applying stack alignment constraints: 16-byte (i.e. 4-bit) aligned
		 * for RV32I or RV64I.
		 *
		 * The resulting 6 bits of entropy is seen in SP[9:4].
		 */
		choose_random_kstack_offset(get_random_u16());

		syscall_exit_to_user_mode(regs);
	} else {
		irqentry_state_t state = irqentry_nmi_enter(regs);
Loading