1
linux/arch/x86/crypto
Eric Biggers e6e758fa64 crypto: x86/aes-gcm - rewrite the AES-NI optimized AES-GCM
Rewrite the AES-NI implementations of AES-GCM, taking advantage of
things I learned while writing the VAES-AVX10 implementations.  This is
a complete rewrite that reduces the AES-NI GCM source code size by about
70% and the binary code size by about 95%, while not regressing
performance and in fact improving it significantly in many cases.

The following summarizes the state before this patch:

- The aesni-intel module registered algorithms "generic-gcm-aesni" and
  "rfc4106-gcm-aesni" with the crypto API that actually delegated to one
  of three underlying implementations according to the CPU capabilities
  detected at runtime: AES-NI, AES-NI + AVX, or AES-NI + AVX2.

- The AES-NI + AVX and AES-NI + AVX2 assembly code was in
  aesni-intel_avx-x86_64.S and consisted of 2804 lines of source and
  257 KB of binary.  This massive binary size was not really
  appropriate, and depending on the kconfig it could take up over 1% the
  size of the entire vmlinux.  The main loops did 8 blocks per
  iteration.  The AVX code minimized the use of carryless multiplication
  whereas the AVX2 code did not.  The "AVX2" code did not actually use
  AVX2; the check for AVX2 was really a check for Intel Haswell or later
  to detect support for fast carryless multiplication.  The long source
  length was caused by factors such as significant code duplication.

- The AES-NI only assembly code was in aesni-intel_asm.S and consisted
  of 1501 lines of source and 15 KB of binary.  The main loops did 4
  blocks per iteration and minimized the use of carryless multiplication
  by using Karatsuba multiplication and a multiplication-less reduction.

- The assembly code was contributed in 2010-2013.  Maintenance has been
  sporadic and most design choices haven't been revisited.

- The assembly function prototypes and the corresponding glue code were
  separate from and were not consistent with the new VAES-AVX10 code I
  recently added.  The older code had several issues such as not
  precomputing the GHASH key powers, which hurt performance.

This rewrite achieves the following goals:

- Much shorter source and binary sizes.  The assembly source shrinks
  from 4300 lines to 1130 lines, and it produces about 9 KB of binary
  instead of 272 KB.  This is achieved via a better designed AES-GCM
  implementation that doesn't excessively unroll the code and instead
  prioritizes the parts that really matter.  Sharing the C glue code
  with the VAES-AVX10 implementations also saves 250 lines of C source.

- Improve performance on most (possibly all) CPUs on which this code
  runs, for most (possibly all) message lengths.  Benchmark results are
  given in Tables 1 and 2 below.

- Use the same function prototypes and glue code as the new VAES-AVX10
  algorithms.  This fixes some issues with the integration of the
  assembly and results in some significant performance improvements,
  primarily on short messages.  Also, the AVX and non-AVX
  implementations are now registered as separate algorithms with the
  crypto API, which makes them both testable by the self-tests.

- Keep support for AES-NI without AVX (for Westmere, Silvermont,
  Goldmont, and Tremont), but unify the source code with AES-NI + AVX.
  Since 256-bit vectors cannot be used without VAES anyway, this is made
  feasible by just using the non-VEX coded form of most instructions.

- Use a unified approach where the main loop does 8 blocks per iteration
  and uses Karatsuba multiplication to save one pclmulqdq per block but
  does not use the multiplication-less reduction.  This strikes a good
  balance across the range of CPUs on which this code runs.

- Don't spam the kernel log with an informational message on every boot.

The following tables summarize the improvement in AES-GCM throughput on
various CPU microarchitectures as a result of this patch:

Table 1: AES-256-GCM encryption throughput improvement,
         CPU microarchitecture vs. message length in bytes:

                   | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
-------------------+-------+-------+-------+-------+-------+-------+
Intel Broadwell    |    2% |    8% |   11% |   18% |   31% |   26% |
Intel Skylake      |    1% |    4% |    7% |   12% |   26% |   19% |
Intel Cascade Lake |    3% |    8% |   10% |   18% |   33% |   24% |
AMD Zen 1          |    6% |   12% |    6% |   15% |   27% |   24% |
AMD Zen 2          |    8% |   13% |   13% |   19% |   26% |   28% |
AMD Zen 3          |    8% |   14% |   13% |   19% |   26% |   25% |

                   |   300 |   200 |    64 |    63 |    16 |
-------------------+-------+-------+-------+-------+-------+
Intel Broadwell    |   35% |   29% |   45% |   55% |   54% |
Intel Skylake      |   25% |   19% |   28% |   33% |   27% |
Intel Cascade Lake |   36% |   28% |   39% |   49% |   54% |
AMD Zen 1          |   27% |   22% |   23% |   29% |   26% |
AMD Zen 2          |   32% |   24% |   22% |   25% |   31% |
AMD Zen 3          |   30% |   24% |   22% |   23% |   26% |

Table 2: AES-256-GCM decryption throughput improvement,
         CPU microarchitecture vs. message length in bytes:

                   | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
-------------------+-------+-------+-------+-------+-------+-------+
Intel Broadwell    |    3% |    8% |   11% |   19% |   32% |   28% |
Intel Skylake      |    3% |    4% |    7% |   13% |   28% |   27% |
Intel Cascade Lake |    3% |    9% |   11% |   19% |   33% |   28% |
AMD Zen 1          |   15% |   18% |   14% |   20% |   36% |   33% |
AMD Zen 2          |    9% |   16% |   13% |   21% |   26% |   27% |
AMD Zen 3          |    8% |   15% |   12% |   18% |   23% |   23% |

                   |   300 |   200 |    64 |    63 |    16 |
-------------------+-------+-------+-------+-------+-------+
Intel Broadwell    |   36% |   31% |   40% |   51% |   53% |
Intel Skylake      |   28% |   21% |   23% |   30% |   30% |
Intel Cascade Lake |   36% |   29% |   36% |   47% |   53% |
AMD Zen 1          |   35% |   31% |   32% |   35% |   36% |
AMD Zen 2          |   31% |   30% |   27% |   38% |   30% |
AMD Zen 3          |   27% |   23% |   24% |   32% |   26% |

The above numbers are percentage improvements in single-thread
throughput, so e.g. an increase from 3000 MB/s to 3300 MB/s would be
listed as 10%.  They were collected by directly measuring the Linux
crypto API performance using a custom kernel module.  Note that indirect
benchmarks (e.g. 'cryptsetup benchmark' or benchmarking dm-crypt I/O)
include more overhead and won't see quite as much of a difference.  All
these benchmarks used an associated data length of 16 bytes.  Note that
AES-GCM is almost always used with short associated data lengths.

I didn't test Intel CPUs before Broadwell, AMD CPUs before Zen 1, or
Intel low-power CPUs, as these weren't readily available to me.
However, based on the design of the new code and the available
information about these other CPU microarchitectures, I wouldn't expect
any significant regressions, and there's a good chance performance is
improved just as it is above.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-06-07 19:47:58 +08:00
..
.gitignore
aegis128-aesni-asm.S
aegis128-aesni-glue.c
aes_ctrby8_avx-x86_64.S
aes-gcm-aesni-x86_64.S crypto: x86/aes-gcm - rewrite the AES-NI optimized AES-GCM 2024-06-07 19:47:58 +08:00
aes-gcm-avx10-x86_64.S crypto: x86/aes-gcm - add VAES and AVX512 / AVX10 optimized AES-GCM 2024-06-07 19:47:58 +08:00
aes-xts-avx-x86_64.S crypto: x86/aes-xts - optimize size of instructions operating on lengths 2024-04-19 18:54:19 +08:00
aesni-intel_asm.S crypto: x86/aes-gcm - rewrite the AES-NI optimized AES-GCM 2024-06-07 19:47:58 +08:00
aesni-intel_glue.c crypto: x86/aes-gcm - rewrite the AES-NI optimized AES-GCM 2024-06-07 19:47:58 +08:00
aria_aesni_avx2_glue.c
aria_aesni_avx_glue.c
aria_gfni_avx512_glue.c
aria-aesni-avx2-asm_64.S
aria-aesni-avx-asm_64.S crypto: x86/aria - Use 16 byte alignment for GFNI constant vectors 2023-05-24 18:10:27 +08:00
aria-avx.h
aria-gfni-avx512-asm_64.S
blake2s-core.S
blake2s-glue.c
blowfish_glue.c
blowfish-x86_64-asm_64.S
camellia_aesni_avx2_glue.c
camellia_aesni_avx_glue.c
camellia_glue.c
camellia-aesni-avx2-asm_64.S
camellia-aesni-avx-asm_64.S
camellia-x86_64-asm_64.S
camellia.h
cast5_avx_glue.c
cast5-avx-x86_64-asm_64.S
cast6_avx_glue.c
cast6-avx-x86_64-asm_64.S
chacha_glue.c
chacha-avx2-x86_64.S
chacha-avx512vl-x86_64.S
chacha-ssse3-x86_64.S
crc32-pclmul_asm.S crypto: x86/crc32 - Use local .L symbols for code 2023-04-20 18:20:04 +08:00
crc32-pclmul_glue.c crypto: x86 - add missing MODULE_DESCRIPTION() macros 2024-06-07 19:46:39 +08:00
crc32c-intel_glue.c
crc32c-pcl-intel-asm_64.S arch/x86: Fix typos 2024-01-03 11:46:22 +01:00
crct10dif-pcl-asm_64.S
crct10dif-pclmul_glue.c
curve25519-x86_64.c crypto: x86 - add missing MODULE_DESCRIPTION() macros 2024-06-07 19:46:39 +08:00
des3_ede_glue.c
des3_ede-asm_64.S crypto: x86/des3 - Use RIP-relative addressing 2023-04-20 18:20:04 +08:00
ecb_cbc_helpers.h
ghash-clmulni-intel_asm.S crypto: x86/ghash - Use RIP-relative addressing 2023-04-20 18:20:04 +08:00
ghash-clmulni-intel_glue.c
glue_helper-asm-avx2.S
glue_helper-asm-avx.S
Kconfig crypto: x86/aes-gcm - add VAES and AVX512 / AVX10 optimized AES-GCM 2024-06-07 19:47:58 +08:00
Makefile crypto: x86/aes-gcm - rewrite the AES-NI optimized AES-GCM 2024-06-07 19:47:58 +08:00
nh-avx2-x86_64.S crypto: x86/nh-avx2 - add missing vzeroupper 2024-04-12 15:07:52 +08:00
nh-sse2-x86_64.S
nhpoly1305-avx2-glue.c crypto: x86/nhpoly1305 - implement ->digest 2023-10-20 13:39:25 +08:00
nhpoly1305-sse2-glue.c crypto: x86/nhpoly1305 - implement ->digest 2023-10-20 13:39:25 +08:00
poly1305_glue.c crypto: x86/poly1305 - Switch to new Intel CPU model defines 2024-05-31 17:12:30 +08:00
poly1305-x86_64-cryptogams.pl
polyval-clmulni_asm.S
polyval-clmulni_glue.c
serpent_avx2_glue.c
serpent_avx_glue.c
serpent_sse2_glue.c
serpent-avx2-asm_64.S
serpent-avx-x86_64-asm_64.S
serpent-avx.h
serpent-sse2-i586-asm_32.S
serpent-sse2-x86_64-asm_64.S
serpent-sse2.h
sha1_avx2_x86_64_asm.S crypto: x86/sha - Use local .L symbols for code 2023-04-20 18:20:04 +08:00
sha1_ni_asm.S
sha1_ssse3_asm.S
sha1_ssse3_glue.c crypto: x86/sha1 - autoload if SHA-NI detected 2023-11-17 19:16:29 +08:00
sha256_ni_asm.S crypto: x86/sha256-ni - simplify do_4rounds 2024-04-19 18:54:18 +08:00
sha256_ssse3_glue.c crypto: x86/sha256 - autoload if SHA-NI detected 2023-11-17 19:16:29 +08:00
sha256-avx2-asm.S crypto: x86/sha256-avx2 - add missing vzeroupper 2024-04-12 15:07:52 +08:00
sha256-avx-asm.S crypto: x86/sha - Use local .L symbols for code 2023-04-20 18:20:04 +08:00
sha256-ssse3-asm.S crypto: x86/sha - Use local .L symbols for code 2023-04-20 18:20:04 +08:00
sha512_ssse3_glue.c
sha512-avx2-asm.S crypto: x86/sha512-avx2 - add missing vzeroupper 2024-04-12 15:07:52 +08:00
sha512-avx-asm.S arch/x86: Fix typos 2024-01-03 11:46:22 +01:00
sha512-ssse3-asm.S arch/x86: Fix typos 2024-01-03 11:46:22 +01:00
sm3_avx_glue.c
sm3-avx-asm_64.S
sm4_aesni_avx2_glue.c crypto: x86/sm4 - Remove cfb(sm4) 2023-12-08 11:59:45 +08:00
sm4_aesni_avx_glue.c crypto: x86/sm4 - Remove cfb(sm4) 2023-12-08 11:59:45 +08:00
sm4-aesni-avx2-asm_64.S crypto: x86/sm4 - Remove cfb(sm4) 2023-12-08 11:59:45 +08:00
sm4-aesni-avx-asm_64.S crypto: x86/sm4 - Remove cfb(sm4) 2023-12-08 11:59:45 +08:00
sm4-avx.h crypto: x86/sm4 - Remove cfb(sm4) 2023-12-08 11:59:45 +08:00
twofish_avx_glue.c
twofish_glue_3way.c crypto: x86/twofish - Switch to new Intel CPU model defines 2024-05-31 17:12:21 +08:00
twofish_glue.c
twofish-avx-x86_64-asm_64.S
twofish-i586-asm_32.S
twofish-x86_64-asm_64-3way.S
twofish-x86_64-asm_64.S
twofish.h