Commit b0430f39 authored by Eric Biggers's avatar Eric Biggers
Browse files

lib/crc: simplify the kconfig options for CRC implementations

Make the following simplifications to the kconfig options for choosing
CRC implementations for CRC32 and CRC_T10DIF:

1. Make the option to disable the arch-optimized code be visible only
   when CONFIG_EXPERT=y.
2. Make a single option control the inclusion of the arch-optimized code
   for all enabled CRC variants.
3. Make CRC32_SARWATE (a.k.a. slice-by-1 or byte-by-byte) be the only
   generic CRC32 implementation.

The result is there is now just one option, CRC_OPTIMIZATIONS, which is
default y and can be disabled only when CONFIG_EXPERT=y.

Rationale:

1. Enabling the arch-optimized code is nearly always the right choice.
   However, people trying to build the tiniest kernel possible would
   find some use in disabling it.  Anything we add to CRC32 is de facto
   unconditional, given that CRC32 gets selected by something in nearly
   all kernels.  And unfortunately enabling the arch CRC code does not
   eliminate the need to build the generic CRC code into the kernel too,
   due to CPU feature dependencies.  The size of the arch CRC code will
   also increase slightly over time as more CRC variants get added and
   more implementations targeting different instruction set extensions
   get added.  Thus, it seems worthwhile to still provide an option to
   disable it, but it should be considered an expert-level tweak.

2. Considering the use case described in (1), there doesn't seem to be
   sufficient value in making the arch-optimized CRC code be
   independently configurable for different CRC variants.  Note also
   that multiple variants were already grouped together, e.g.
   CONFIG_CRC32 actually enables three different variants of CRC32.

3. The bit-by-bit implementation is uselessly slow, whereas slice-by-n
   for n=4 and n=8 use tables that are inconveniently large: 4096 bytes
   and 8192 bytes respectively, compared to 1024 bytes for n=1.  Higher
   n gives higher instruction-level parallelism, so higher n easily wins
   on traditional microbenchmarks on most CPUs.  However, the larger
   tables, which are accessed randomly, can be harmful in real-world
   situations where the dcache may be cold or useful data may need be
   evicted from the dcache.  Meanwhile, today most architectures have
   much faster CRC32 implementations using dedicated CRC32 instructions
   or carryless multiplication instructions anyway, which make the
   generic code obsolete in most cases especially on long messages.

   Another reason for going with n=1 is that this is already what is
   used by all the other CRC variants in the kernel.  CRC32 was unique
   in having support for larger tables.  But as per the above this can
   be considered an outdated optimization.

   The standardization on slice-by-1 a.k.a. CRC32_SARWATE makes much of
   the code in lib/crc32.c unused.  A later patch will clean that up.

Link: https://lore.kernel.org/r/20250123212904.118683-2-ebiggers@kernel.org


Reviewed-by: default avatarArd Biesheuvel <ardb@kernel.org>
Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
parent d0d106a2
Loading
Loading
Loading
Loading
+14 −102
Original line number Diff line number Diff line
@@ -164,34 +164,9 @@ config CRC_T10DIF
config ARCH_HAS_CRC_T10DIF
	bool

choice
	prompt "CRC-T10DIF implementation"
	depends on CRC_T10DIF
	default CRC_T10DIF_IMPL_ARCH if ARCH_HAS_CRC_T10DIF
	default CRC_T10DIF_IMPL_GENERIC if !ARCH_HAS_CRC_T10DIF
	help
	  This option allows you to override the default choice of CRC-T10DIF
	  implementation.

config CRC_T10DIF_IMPL_ARCH
	bool "Architecture-optimized" if ARCH_HAS_CRC_T10DIF
	help
	  Use the optimized implementation of CRC-T10DIF for the selected
	  architecture.  It is recommended to keep this enabled, as it can
	  greatly improve CRC-T10DIF performance.

config CRC_T10DIF_IMPL_GENERIC
	bool "Generic implementation"
	help
	  Use the generic table-based implementation of CRC-T10DIF.  Selecting
	  this will reduce code size slightly but can greatly reduce CRC-T10DIF
	  performance.

endchoice

config CRC_T10DIF_ARCH
	tristate
	default CRC_T10DIF if CRC_T10DIF_IMPL_ARCH
	default CRC_T10DIF if ARCH_HAS_CRC_T10DIF && CRC_OPTIMIZATIONS

config CRC64_ROCKSOFT
	tristate "CRC calculation for the Rocksoft model CRC64"
@@ -214,6 +189,7 @@ config CRC32
	tristate "CRC32/CRC32c functions"
	default y
	select BITREVERSE
	select CRC32_SARWATE
	help
	  This option is provided for the case where no in-kernel-tree
	  modules require CRC32/CRC32c functions, but a module built outside
@@ -223,87 +199,12 @@ config CRC32
config ARCH_HAS_CRC32
	bool

choice
	prompt "CRC32 implementation"
	depends on CRC32
	default CRC32_IMPL_ARCH_PLUS_SLICEBY8 if ARCH_HAS_CRC32
	default CRC32_IMPL_SLICEBY8 if !ARCH_HAS_CRC32
	help
	  This option allows you to override the default choice of CRC32
	  implementation.  Choose the default unless you know that you need one
	  of the others.

config CRC32_IMPL_ARCH_PLUS_SLICEBY8
	bool "Arch-optimized, with fallback to slice-by-8" if ARCH_HAS_CRC32
	help
	  Use architecture-optimized implementation of CRC32.  Fall back to
	  slice-by-8 in cases where the arch-optimized implementation cannot be
	  used, e.g. if the CPU lacks support for the needed instructions.

	  This is the default when an arch-optimized implementation exists.

config CRC32_IMPL_ARCH_PLUS_SLICEBY1
	bool "Arch-optimized, with fallback to slice-by-1" if ARCH_HAS_CRC32
	help
	  Use architecture-optimized implementation of CRC32, but fall back to
	  slice-by-1 instead of slice-by-8 in order to reduce the binary size.

config CRC32_IMPL_SLICEBY8
	bool "Slice by 8 bytes"
	help
	  Calculate checksum 8 bytes at a time with a clever slicing algorithm.
	  This is much slower than the architecture-optimized implementation of
	  CRC32 (if the selected arch has one), but it is portable and is the
	  fastest implementation when no arch-optimized implementation is
	  available.  It uses an 8KiB lookup table.  Most modern processors have
	  enough cache to hold this table without thrashing the cache.

config CRC32_IMPL_SLICEBY4
	bool "Slice by 4 bytes"
	help
	  Calculate checksum 4 bytes at a time with a clever slicing algorithm.
	  This is a bit slower than slice by 8, but has a smaller 4KiB lookup
	  table.

	  Only choose this option if you know what you are doing.

config CRC32_IMPL_SLICEBY1
	bool "Slice by 1 byte (Sarwate's algorithm)"
	help
	  Calculate checksum a byte at a time using Sarwate's algorithm.  This
	  is not particularly fast, but has a small 1KiB lookup table.

	  Only choose this option if you know what you are doing.

config CRC32_IMPL_BIT
	bool "Classic Algorithm (one bit at a time)"
	help
	  Calculate checksum one bit at a time.  This is VERY slow, but has
	  no lookup table.  This is provided as a debugging option.

	  Only choose this option if you are debugging crc32.

endchoice

config CRC32_ARCH
	tristate
	default CRC32 if CRC32_IMPL_ARCH_PLUS_SLICEBY8 || CRC32_IMPL_ARCH_PLUS_SLICEBY1

config CRC32_SLICEBY8
	bool
	default y if CRC32_IMPL_SLICEBY8 || CRC32_IMPL_ARCH_PLUS_SLICEBY8

config CRC32_SLICEBY4
	bool
	default y if CRC32_IMPL_SLICEBY4
	default CRC32 if ARCH_HAS_CRC32 && CRC_OPTIMIZATIONS

config CRC32_SARWATE
	bool
	default y if CRC32_IMPL_SLICEBY1 || CRC32_IMPL_ARCH_PLUS_SLICEBY1

config CRC32_BIT
	bool
	default y if CRC32_IMPL_BIT

config CRC64
	tristate "CRC64 functions"
@@ -343,6 +244,17 @@ config CRC8
	  when they need to do cyclic redundancy check according CRC8
	  algorithm. Module will be called crc8.

config CRC_OPTIMIZATIONS
	bool "Enable optimized CRC implementations" if EXPERT
	default y
	help
	  Disabling this option reduces code size slightly by disabling the
	  architecture-optimized implementations of any CRC variants that are
	  enabled.  CRC checksumming performance may get much slower.

	  Keep this enabled unless you're really trying to minimize the size of
	  the kernel.

config XXHASH
	tristate