Commit 9f65592b authored by Eric Biggers's avatar Eric Biggers
Browse files

lib/crypto: x86/poly1305: Fix performance regression on short messages



Restore the len >= 288 condition on using the AVX implementation, which
was incidentally removed by commit 318c53ae ("crypto: x86/poly1305 -
Add block-only interface").  This check took into account the overhead
in key power computation, kernel-mode "FPU", and tail handling
associated with the AVX code.  Indeed, restoring this check slightly
improves performance for len < 256 as measured using poly1305_kunit on
an "AMD Ryzen AI 9 365" (Zen 5) CPU:

    Length      Before       After
    ======  ==========  ==========
         1     30 MB/s     36 MB/s
        16    516 MB/s    598 MB/s
        64   1700 MB/s   1882 MB/s
       127   2265 MB/s   2651 MB/s
       128   2457 MB/s   2827 MB/s
       200   2702 MB/s   3238 MB/s
       256   3841 MB/s   3768 MB/s
       511   4580 MB/s   4585 MB/s
       512   5430 MB/s   5398 MB/s
      1024   7268 MB/s   7305 MB/s
      3173   8999 MB/s   8948 MB/s
      4096   9942 MB/s   9921 MB/s
     16384  10557 MB/s  10545 MB/s

While the optimal threshold for this CPU might be slightly lower than
288 (see the len == 256 case), other CPUs would need to be tested too,
and these sorts of benchmarks can underestimate the true cost of
kernel-mode "FPU".  Therefore, for now just restore the 288 threshold.

Fixes: 318c53ae ("crypto: x86/poly1305 - Add block-only interface")
Cc: stable@vger.kernel.org
Reviewed-by: default avatarArd Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250706231100.176113-6-ebiggers@kernel.org


Signed-off-by: default avatarEric Biggers <ebiggers@kernel.org>
parent 16f2c30e
Loading
Loading
Loading
Loading
+8 −0
Original line number Diff line number Diff line
@@ -98,7 +98,15 @@ void poly1305_blocks_arch(struct poly1305_block_state *state, const u8 *inp,
	BUILD_BUG_ON(SZ_4K < POLY1305_BLOCK_SIZE ||
		     SZ_4K % POLY1305_BLOCK_SIZE);

	/*
	 * The AVX implementations have significant setup overhead (e.g. key
	 * power computation, kernel FPU enabling) which makes them slower for
	 * short messages.  Fall back to the scalar implementation for messages
	 * shorter than 288 bytes, unless the AVX-specific key setup has already
	 * been performed (indicated by ctx->is_base2_26).
	 */
	if (!static_branch_likely(&poly1305_use_avx) ||
	    (len < POLY1305_BLOCK_SIZE * 18 && !ctx->is_base2_26) ||
	    unlikely(!irq_fpu_usable())) {
		convert_to_base2_64(ctx);
		poly1305_blocks_x86_64(ctx, inp, len, padbit);