Software → all posts

hpke-ng: Faster, Smaller, Harder HPKE for Rust

· 26 min read · #HPKE #Rust #Post-Quantum

Today we’re releasing hpke-ng, a clean-slate Rust implementation of HPKE (RFC 9180). It is published under Apache-2.0 OR MIT, and you can install it now with cargo add hpke-ng.

We benchmark hpke-ng against the two major Rust HPKE libraries: hpke-rs, Cryspen’s library and currently the most widely deployed, and rust-hpke, the older type-state-based implementation by Michael Rosenberg. Across 126 head-to-head measurements — 76 against hpke-rs and 50 against rust-hpke (which doesn’t support every ciphersuite hpke-ng or hpke-rs do) — hpke-ng wins 103, ties 19, and loses 4. The wins concentrate in the post-quantum decap path (−54% to −55% on ML-KEM, the largest deltas in the dataset; rust-hpke has no post-quantum support), the classical decap path (−41% on X25519 vs hpke-rs, −31% vs rust-hpke), the single-shot open path (every payload size a clean win against both libraries), the post-quantum and ChaCha20 setup paths, and the export path (a new datapoint added in this round: −72% to −76% across all five output lengths vs hpke-rs). End-to-end roundtrip — one full seal-and-open against the same recipient — comes in at −30% vs both libraries on a 1 KiB message. The ties cluster on rows where the underlying primitive crate dominates wall time and the framing overhead is too small to matter.

cargo bench · comparative
126 head-to-head benchmarks · 2 libraries
vs hpke-rs76 benchmarks
62wins
10tied
4losses
vs rust-hpke50 benchmarks
41wins
9tied
0losses
combined126 benchmarks
103wins
19tied
4losses
hpke-ng faster (>3% delta) within criterion noise band (±3%) competitor faster

All three libraries lean on the RustCrypto family of primitive crates — with one exception we’ll flag below: rust-hpke pulls a different P-256 implementation, and you’ll see the gap show up at the primitive layer when we get there. On the rows where the underlying primitive crate matches, the cryptography itself is identical: all three pass the full RFC 9180 known-answer-test set on the vectors we hand-checked, and all three produce byte-identical wire output to each other on every ciphersuite we differentially tested. The wins are gains in framing, monomorphization, allocation behaviour, and dispatch — not in the underlying crypto math. That is the point: the math is a solved problem, and the surrounding library is where the engineering still has slack.

Why we built another HPKE library

Earlier this year we found and reported two security bugs in hpke-rs. The first was a missing RFC 9180 §7.1.4 zero-shared-secret check: a low-order or identity public key forces the X25519 shared secret to all zeros, after which the rest of the key schedule becomes deterministic and predictable to anyone who knows the static recipient key. The fix is a single comparison against zero, which RFC 9180 explicitly requires; it was missing. The second was a u32 sequence-counter that silently wrapped in release builds, reusing nonces after 2³² messages. Nonce reuse in AEAD is catastrophic — for AES-GCM it leaks the authentication key; for ChaCha20-Poly1305 it leaks plaintext via XOR — and the wrap was happening below the type system, in a release build, with no diagnostic. Both are now fixed. We documented the broader pattern of bugs we kept finding in libraries marketed under “high assurance” branding in February.

rust-hpke, by contrast, is not a library we have audit findings against. It is older — its 0.x line predates hpke-rs by years — uses a type-state design similar to the one we settled on for hpke-ng, and has the careful, hand-tuned feel of a personal project that has aged well. We added it to this round of benchmarks for exactly that reason: comparing only against hpke-rs invites the objection that the bar is low. rust-hpke raises it. The deltas you’ll see below are against both libraries; where rust-hpke supports a ciphersuite, we show its row alongside hpke-rs’s. It doesn’t support post-quantum KEMs, X-Wing, or secp256k1, so those rows stay two-way.

That experience — together with the day-to-day frictions of integrating HPKE through libraries that weren’t built to make those bug classes structurally impossible — is what got us thinking about a rewrite. Three frictions in particular kept showing up in hpke-rs.

The provider abstraction. hpke-rs is structured as a generic library over an HpkeCrypto trait, with two backend implementations — RustCrypto and libcrux — shipped as separate crates. The abstraction is real engineering and serves a real purpose: it lets a deployment swap the underlying crypto stack without touching call sites. The cost is that every primitive call goes through a trait dispatch, every Hpke value carries a 344-byte instance struct (most of it a 256-byte ChaCha20 PRNG state), and the workspace is four crates instead of one.

The struct-owned PRNG. Hpke<Crypto>::new constructs and stores a PRNG. That PRNG is reused across operations, which is fine functionally but creates a subtle aliasing hazard: cloning an Hpke does not clone the PRNG state — per the rustdoc — so a careless clone can reset randomness in ways that aren’t visible at the call site. This is the kind of footgun whose damage is invisible until the day it isn’t.

Option<&[u8]> for required-by-mode arguments. hpke.seal(&pk, info, aad, pt, None, None, None) is the canonical Base-mode call. The three Nones are the PSK, PSK ID, and sender static key — all required in the Auth or AuthPsk modes. The single seal method accepts every mode by making mode-specific arguments optional, which means the type system can’t tell you that you’ve built a Base-mode call with a PSK supplied.

These are not catastrophic problems. They’re the kind of small persistent costs you stop noticing until you build the alternative.

The shape change: enum dispatch becomes type-state

The single biggest design difference between hpke-ng and hpke-rs is what carries the ciphersuite. In hpke-rs, the ciphersuite is four runtime enums (Mode, KemAlgorithm, KdfAlgorithm, AeadAlgorithm) constructed at Hpke::new time. In hpke-ng, the ciphersuite is the type itself — Hpke<DhKemX25519HkdfSha256, HkdfSha256, ChaCha20Poly1305> — and the struct body is PhantomData<(K, F, A)>. Zero bytes at runtime; everything resolved at the call site by the compiler.

canonical seal · "encrypt one message"
hpke-rs7 args · 3× Option<&[u8]>
let mut hpke = Hpke::<HpkeRustCrypto>::new(
    Mode::Base,
    KemAlgorithm::DhKem25519,
    KdfAlgorithm::HkdfSha256,
    AeadAlgorithm::ChaCha20Poly1305,
);
let kp = hpke.generate_key_pair()?;
let (sk, pk) = kp.into_keys();
let (enc, ct) = hpke.seal(
    &pk, info, aad, pt,
    None, None, None, // psk, psk_id, sk_s
)?;
hpke-ng5 args · zero placeholders
type Suite = Hpke<
    DhKemX25519HkdfSha256,
    HkdfSha256,
    ChaCha20Poly1305,
>;
let mut os = OsRng;
let mut rng = os.unwrap_mut();
let (sk, pk) = DhKemX25519HkdfSha256::generate(&mut rng)?;
let (enc, ct) = Suite::seal_base(
    &mut rng, &pk, info, aad, pt,
)?;

rust-hpke’s API is closer to hpke-ng’s: it also encodes KEM / KDF / AEAD as type parameters, and exposes setup_sender::<Aead, Kdf, Kem, _>(&OpModeS::Base, &pk, info, &mut rng) as the canonical entry point. The visible consequence is the same as in hpke-ng: no Option placeholders for unused fields. The wider difference between rust-hpke and hpke-ng is in scope, not shape — rust-hpke ships fewer ciphersuites (no secp256k1, no post-quantum KEMs), bundles the AEAD into the type chain rather than letting Context specialize on it, and doesn’t expose raw encap/decap separately from setup. We’ll come back to that last point in the methodology note.

The visible consequence of hpke-ng’s type-state choice is the call site. The invisible consequence is what the compiler can rule out:

Operationhpke-rshpke-ng
Hpke::<XWingDraft06, _, _>::seal_auth(...)
runtime Error::UnsupportedKemOperation
compile trait bound K: AuthKem not satisfied
Hpke::<_, _, ExportOnly>::seal_base(...)
runtime HpkeError::InvalidConfig
compile no seal_base on ExportOnly
Wrong KemAlgorithm for a private key
runtime mismatch error at setup_*
compile key types are KEM-tagged
Base-mode call with Some(psk) argument
runtime HpkeError::UnnecessaryPsk
compile seal_base has no PSK parameter

Each row is a runtime error in hpke-rs and a compiler diagnostic in hpke-ng. rust-hpke catches most of the same shapes at compile time too — but hpke-ng goes further on the Auth/Base and ExportOnly rows, because the method-set split there relies on per-KEM trait bounds (K: AuthKem) and per-AEAD method availability that rust-hpke doesn’t model.

This isn’t theoretical. We’ve seen each of those four shapes in production code reviews — typically as a stale match arm catching the runtime error and turning it into a generic 500. Surfacing them as compile errors deletes a class of code path entirely.

Feature parity

Before going further: the obvious skeptical question is what got cut? The answer is one thing, deliberately, and it’s the thing that buys the wins everywhere else.

hpke-rsrust-hpkehpke-ng
DH KEMs (X25519, X448, P-256, P-384, P-521, secp256k1)✓ all 64 (no X448, no secp256k1)✓ all 6
Post-quantum KEMs (X-Wing draft-06, ML-KEM-768, ML-KEM-1024)✓ all 3 (experimental, upstream-flagged unstable)none✓ all 3 (pq feature)
KDFs (HKDF-SHA-256 / -384 / -512)✓ all 3✓ all 3✓ all 3
AEADs (AES-128-GCM, AES-256-GCM, ChaCha20-Poly1305, ExportOnly)✓ all 4✓ all 4✓ all 4
Modes (Base, Psk, Auth, AuthPsk)✓ all 4✓ all 4✓ all 4
RFC 9180 KAT conformance✓ pass✓ pass✓ pass
Type-state ciphersuite selectionruntime enums✓ type-params✓ type-params + per-mode method-set
Compile-time-rejected operation classes0~24
Raw encap / decap exposed (separable from setup)no (only via setup_*)no (only via setup_*)✓ yes
Pluggable crypto provider (multiple backends)✓ RustCrypto + libcrux✓ generic over CryptoRng— RustCrypto only

hpke-ng supports every ciphersuite hpke-rs’s RustCrypto provider supports (it’s a superset of rust-hpke), passes the RFC 9180 KAT vector set, and adds two structural capabilities neither alternative has: a deeper method-set split that catches more configuration errors at compile time, and standalone encap/decap exposed alongside the high-level setup paths.

The one thing hpke-rs has that hpke-ng deliberately doesn’t: a pluggable crypto provider. hpke-rs ships an HpkeCrypto trait with two backend implementations, RustCrypto and libcrux; a deployment can swap one for the other at compile time. hpke-ng ships one provider, RustCrypto, and removes the abstraction. That’s the load-bearing tradeoff. It’s why Hpke<...> is zero-sized, why Context::seal is monomorphized rather than dispatched, why the workspace is one crate, and why the canonical call site is five arguments instead of seven.

The libcrux half of that tradeoff is, in our view, not a feature worth chasing. Our February audit of Cryspen’s “formally verified” libraries found undisclosed silent cryptographic failures in libcrux-ml-dsa (platform-dependent SHA-3 corruption that was patched without a public advisory), entropy-reducing pre-hash clamping in Ed25519, and a denial-of-service panic in libcrux-psq’s AES-GCM decryption path. Cryspen’s public response acknowledged the findings without retracting the “highest level of assurance” marketing language attached to the library; our follow-up analysis and an independent paper have since surfaced further specification deviations in libcrux-ml-dsa. The “formally verified” framing that draws users to libcrux does not, on the evidence we have assembled, describe the engineering you are getting. If you’re using RustCrypto anyway — which is what hpke-ng does by default and what most hpke-rs deployments do in practice — the provider abstraction is paying rent for a backend you would be wise to avoid.

Speed

The full benchmark protocol is in cargo bench --features comparative in the hpke-ng repository: criterion harness, sample sizes 40–60 per group depending on per-iteration cost, 2–3 second measurement windows, RUSTFLAGS="-C target-cpu=native", lto = "thin", codegen-units = 1. Apple Silicon M-series, macOS. The numbers below are the median across two independent bench runs. The post-quantum rows enable hpke-rs-rust-crypto’s experimental feature flag — without it, hpke-rs’s PQ KEMs return UnsupportedKemOperation at runtime; the upstream comment on the gate is “broken and pre-releases. Disabling them until they are stable.” rust-hpke doesn’t ship post-quantum at all, so the PQ and X-Wing rows are two-way (hpke-ng vs hpke-rs only).

Methodology note: the _via_setup_* suffix. hpke-ng exposes raw encap and decap as separable KEM operations. hpke-rs and rust-hpke do not — the nearest public API is setup_sender / setup_receiver, which performs the KEM operation plus the HPKE key schedule plus Context construction. Where the benchmark name carries a _via_setup_* suffix, it is measuring that fuller path because there is no narrower one available; rows so labeled are explicitly not apples-to-apples with hpke-ng’s bare-operation rows, and that is the honest framing. We still include them — the gap reflects an API granularity difference that is itself meaningful to anyone trying to batch KEM operations or pre-compute encap material. We also seed hpke-rs’s deterministic PRNG once per benchmark group, in setup, never inside a timed b.iter() closure; the bare seed() cost (~58 ns) is deliberately hoisted out so it does not bias any hpke-rs measurement.

The wins concentrate in five places: the post-quantum KEM path (the largest deltas in the entire dataset, peaking at −55% on ML-KEM-1024 decap), the classical KEM decap path, the single-shot open path (every payload size, both AEADs, against both competitors), the post-quantum and ChaCha20 setup paths, and the brand-new export path (−72% to −76% across all five output lengths). Start with the KEM operations on X25519:

KEM operations · X25519

hpke-rsrust-hpkehpke-ng
generate
8.00 µs
9.23 µs
8.27 µs
+3% rs · −10% rh
derive_key_pair
8.80 µs
8.94 µs
7.93 µs
−10% rs · −11% rh
encap
37.97 µs
40.45 µs
30.41 µs
−20% rs · −25% rh
decap
37.03 µs
32.00 µs
21.96 µs
−41% rs · −31% rh
Lower is faster. Bars normalized per-row to max(rs, rh, ng). on encap/decap: hpke-ng measures the raw KEM op; hpke-rs and rust-hpke have no such API, so the bar shows their setup_sender/setup_receiver (KEM op + key schedule + Context construction). See the methodology note.

The decap row tells the load-bearing story for hpke-ng’s design: hpke-ng caches the recipient’s serialized public-key bytes alongside the secret, so decap no longer pays a base-point scalar multiplication just to rebuild the recipient-PK piece of kem_context on every call. The classical libraries each run that scalar mult per setup_receiver. The encap row is the methodology note in action — hpke-ng can isolate encap from the surrounding key schedule, the others can’t, and a use case that wants to precompute encap material or batch ephemeral generation can exploit that gap directly. The generate and derive_key_pair rows are fully comparable across all three libraries; hpke-ng wins by 10% on derive_key_pair (cached-PK monomorphization) and is tied with hpke-rs on generate (the bare X25519 keypair generation, where everyone reaches the same primitive ceiling).

Setup is the combined fixed cost paid every time a sender or receiver context is constructed: KEM op + key schedule + Context allocation. The wins are consistent across ChaCha20, AES-128, AES-256, all three modes:

Setup paths · sender / receiver / PSK · X25519 and P-256

hpke-rsrust-hpkehpke-ng
X25519+ChaCha20
sender (Base)
38.69 µs
43.07 µs
30.60 µs
−21% rs · −29% rh
X25519+ChaCha20
receiver (Base)
38.31 µs
34.86 µs
21.95 µs
−43% rs · −37% rh
X25519+ChaCha20
sender (PSK)
39.87 µs
40.53 µs
29.56 µs
−26% rs · −27% rh
X25519+AES-128
sender (Base)
40.60 µs
41.07 µs
31.91 µs
−21% rs · −22% rh
X25519+AES-256
sender (Base)
38.16 µs
41.51 µs
30.75 µs
−19% rs · −26% rh
P-256+AES-128
sender (Base)
150.16 µs
220.93 µs
139.35 µs
−7% rs · −37% rh
P-256+AES-256
sender (Base)
150.66 µs
220.46 µs
139.53 µs
−7% rs · −37% rh
secp256k1+ChaCha20
sender (Base)
59.05 µs
47.12 µs
−20% rs · — rh unsupported
Receiver-side wins are larger than sender-side because both decap and the post-decap key schedule benefit from the cached public-key bytes. rust-hpke's P-256 path is markedly slower than hpke-rs's — the gap there is at the primitive layer (rust-hpke uses a different P-256 crate); hpke-ng is faster than both regardless.

The post-quantum KEMs add a structural dimension to the comparison that the classical KEMs don’t. hpke-ng stores MlKem* private keys as the 64-byte (d, z) seed plus the materialized FIPS 203 expanded decapsulation key — built once at construction, kept in the PrivateKey wrapper, reused on every decap. hpke-rs stores only the seed and rebuilds the expanded key on every setup_receiver call by re-running FIPS 203 KeyGen_internal. That single design choice is the load-bearing reason ML-KEM decap shows up at −54% to −55% — the largest single deltas in the dataset. rust-hpke doesn’t ship post-quantum at all, so these rows are two-way. ML-KEM-768 first:

KEM operations · ML-KEM-768 (rust-hpke unsupported)

hpke-rshpke-ng
generate
18.85 µs
17.66 µs
−6% rs
derive_key_pair
14.54 µs
14.26 µs
−2% rs
encap
24.62 µs
14.70 µs
−40% rs
decap
40.86 µs
18.58 µs
−55% rs
decap −55%, encap −40% — hpke-ng caches the expanded decapsulation key in the private key and the parsed encapsulation key in the public key; hpke-rs rebuilds both from raw bytes on every call.

ML-KEM-1024 — the higher-security parameter set — shows the same shape, larger absolute numbers, same architectural delta in the same place:

KEM operations · ML-KEM-1024 (rust-hpke unsupported)

hpke-rshpke-ng
generate
27.55 µs
28.43 µs
+3% rs
derive_key_pair
23.07 µs
24.59 µs
+7% rs
encap
36.97 µs
22.99 µs
−38% rs
decap
61.03 µs
27.24 µs
−55% rs
Same pattern as ML-KEM-768 at higher security level: encap −38%, decap −55%. The +7% on derive_key_pair is the cost of cloning the larger ML-KEM-1024 expanded encapsulation key into the public-key wrapper at construction — paid once, recovered on every subsequent decap.

X-Wing — the X25519 + ML-KEM-768 hybrid — picks up the same DecapsulationKey-caching trick we applied to ML-KEM, plus the parsed EncapsulationKey cache on the encap side. The decap delta lands close to the ML-KEM rows:

KEM operations · X-Wing draft-06 (rust-hpke unsupported)

hpke-rshpke-ng
generate
40.42 µs
38.15 µs
−6% rs
derive_key_pair
38.77 µs
37.63 µs
−3% rs
encap
66.01 µs
55.26 µs
−16% rs
decap
115.75 µs
72.64 µs
−37% rs
decap −37%: the construction-side expand_key that DecapsulationKey::from(seed) runs (SHAKE-256 + ML-KEM-768 keygen) is now amortized to once per private key.

The same pattern propagates to setup paths. setup_sender is dominated by encap; setup_receiver is dominated by decap; the ML-KEM rows widen accordingly. rust-hpke is unsupported across all of these:

Setup paths · post-quantum (HKDF-SHA-256 + ChaCha20-Poly1305) (rust-hpke unsupported)

hpke-rshpke-ng
X-Wing
sender (Base)
67.10 µs
58.38 µs
−13% rs
X-Wing
receiver (Base)
125.95 µs
74.83 µs
−41% rs
ML-KEM-768
sender (Base)
23.90 µs
16.39 µs
−31% rs
ML-KEM-768
receiver (Base)
39.74 µs
19.16 µs
−52% rs
ML-KEM-1024
sender (Base)
34.19 µs
23.77 µs
−31% rs
ML-KEM-1024
receiver (Base)
60.28 µs
27.99 µs
−54% rs
All six post-quantum setup rows: 6 wins, 0 ties, 0 losses.

The single-shot open path — setup_receiver + Context::open for one message — is where hpke-ng wins most consistently across payload size. Six rows across two orders of magnitude, against both competitors, every row is hpke-ng:

Single-shot open · X25519 + HKDF-SHA-256 + ChaCha20-Poly1305

hpke-rsrust-hpkehpke-ng
64 B
38.19 µs
35.52 µs
23.21 µs
−39% rs · −35% rh
256 B
39.09 µs
33.79 µs
22.85 µs
−42% rs · −32% rh
1 KiB
38.57 µs
34.08 µs
23.30 µs
−40% rs · −32% rh
4 KiB
42.89 µs
38.18 µs
27.69 µs
−35% rs · −28% rh
16 KiB
61.20 µs
58.12 µs
45.32 µs
−26% rs · −22% rh
64 KiB
131.77 µs
128.58 µs
116.16 µs
−12% rs · −10% rh
Lower is faster. Open inherits the full setup-receiver win — including the cached pk_bytes shaving a scalar mult off every decap — so the small-payload regime where setup dominates picks up the largest deltas. All 12 cells against both libraries are hpke-ng wins.

The AES-128-GCM seal sweep shows hpke-ng 6–22% ahead from 64 B through 64 KiB against both competitors — the cached AES cipher state (round keys plus the GHash precomputed table, built once at key-schedule time) eliminates the per-call expansion that hpke-rs runs on every seal:

Single-shot seal · X25519 + HKDF-SHA-256 + AES-128-GCM

hpke-rsrust-hpkehpke-ng
64 B
39.16 µs
41.46 µs
31.87 µs
−19% rs · −23% rh
256 B
39.67 µs
42.18 µs
30.91 µs
−22% rs · −27% rh
1 KiB
42.41 µs
44.71 µs
33.60 µs
−21% rs · −25% rh
4 KiB
55.20 µs
56.50 µs
44.25 µs
−20% rs · −22% rh
16 KiB
99.38 µs
100.21 µs
93.00 µs
−6% rs · −7% rh
64 KiB
286.56 µs
282.39 µs
260.29 µs
−9% rs · −8% rh
At 64 KiB the AES-NI bulk encryption cost dominates and the framing delta is the residual — still a clean ~8% win against both competitors.

The ChaCha20 seal sweep mirrors the AES one but reaches further into the large-payload regime (up to 256 KiB) where the AEAD primitive dominates and all three libraries converge to the same ceiling. The interesting story is what happens before convergence:

Single-shot seal · X25519 + HKDF-SHA-256 + ChaCha20-Poly1305

hpke-rsrust-hpkehpke-ng
16 B
44.40 µs
44.59 µs
30.46 µs
−31% rs · −32% rh
64 B
40.46 µs
41.98 µs
31.84 µs
−21% rs · −24% rh
256 B
38.95 µs
41.38 µs
30.03 µs
−23% rs · −27% rh
1 KiB
40.11 µs
42.50 µs
31.81 µs
−21% rs · −25% rh
4 KiB
44.70 µs
47.03 µs
35.85 µs
−20% rs · −24% rh
16 KiB
62.65 µs
64.97 µs
53.70 µs
−14% rs · −17% rh
64 KiB
135.19 µs
138.98 µs
126.17 µs
−7% rs · −9% rh
256 KiB
423.89 µs
420.97 µs
409.19 µs
−3% rs · tie rh
At 256 KiB the ChaCha20-Poly1305 primitive both libraries call identically dominates the wall time and the deltas vanish into noise — at the primitive's ceiling, the framing has nothing left to remove.

Once setup is paid, the post-setup hot path is just Context::seal and Context::open: nonce computation + AEAD primitive call, no KEM, no key schedule. We benchmark these in isolation by running sender and receiver contexts in lockstep outside the timed loop, then measuring seal-then-open on every iteration. This is where the framing-vs-primitive divide is sharpest:

Post-setup hot path · Context::seal · X25519+ChaCha20

hpke-rsrust-hpkehpke-ng
64 B
252.51 ns
222.61 ns
217.77 ns
−14% rs · tie rh
1 KiB
1.63 µs
1.60 µs
1.58 µs
tie rs · tie rh
16 KiB
23.46 µs
25.24 µs
23.48 µs
tie rs · −7% rh
64 KiB
93.48 µs
97.70 µs
97.94 µs
+5% rs · tie rh
At 64 B framing dominates and hpke-ng's stack-array nonce + monomorphized AEAD call wins 14% vs hpke-rs; at 64 KiB the AEAD primitive owns the wall time and hpke-rs's slightly tighter inner loop edges ahead by 5% — the only post-setup loss in the entire suite.

Post-setup hot path · Context::open · X25519+ChaCha20

hpke-rsrust-hpkehpke-ng
64 B
507.59 ns
460.40 ns
439.10 ns
−14% rs · −5% rh
1 KiB
3.28 µs
3.21 µs
3.27 µs
tie rs · tie rh
16 KiB
46.99 µs
46.74 µs
46.88 µs
tie rs · tie rh
64 KiB
187.32 µs
187.28 µs
187.35 µs
tie rs · tie rh
Open's pattern mirrors seal: framing-dominant at 64 B (hpke-ng −14% vs hpke-rs), primitive-bound everywhere else (all three libraries hit the same ceiling within 0.04%).

The convergence at 64 KiB on both seal and open is the most diagnostic single result in the suite: all three libraries reach the same wall-clock cost of the underlying ChaCha20-Poly1305 primitive, because all three call the same code. hpke-ng matches that ceiling without overhead.

Export

The HPKE secret-export interface — Context::export(ctx, len) — is a new addition to this round of benchmarks. It is also the largest single delta in the dataset by a wide margin. Five output lengths, ranging from 16 B (typical for keying a downstream symmetric cipher) to 256 B (room for several derived secrets), against hpke-rs only — rust-hpke’s export() writes into a caller-supplied &mut [u8] rather than returning a Vec, so benchmarking it at these lengths would include a Vec allocation in the timed path and bias the comparison; we omit it rather than ship a misleading number.

Export · X25519 + HKDF-SHA-256 + ChaCha20-Poly1305 (rust-hpke API-incompatible)

hpke-rshpke-ng
16 B
497.37 ns
117.09 ns
−76% rs
32 B
500.27 ns
131.15 ns
−74% rs
64 B
842.48 ns
223.93 ns
−73% rs
128 B
1.52 µs
406.16 ns
−73% rs
256 B
2.87 µs
806.43 ns
−72% rs
All five rows wins of 72–76%, the largest sustained delta in the suite. Export is on the small enough path that the per-call cost is dominated by what the library does *around* the HKDF call.

The −72% to −76% range is consistent across all five sizes, which tells you the win is structural, not size-dependent. hpke-ng’s export is a labeled-HKDF call with a stack-allocated label and a single allocation for the output Vec; hpke-rs’s path goes through the HpkeCrypto trait, allocates intermediate buffers for the LabeledExtract and LabeledExpand arguments, and routes the HKDF call itself through the provider abstraction. That’s overhead that doesn’t appear in hpke-ng because the type system already knows the KDF at the call site.

End-to-end roundtrip

A single number for the eyes-glaze-over question of “how fast is the whole thing, end to end”: one 1 KiB seal-and-open against the same recipient keypair, full setup on each side every iteration, three libraries:

End-to-end roundtrip · 1 KiB · X25519+ChaCha20

hpke-rsrust-hpkehpke-ng
seal + open
77.31 µs
77.11 µs
53.90 µs
−30% rs · −30% rh
hpke-rs and rust-hpke land within 0.3% of each other — the framing differences between them roughly cancel at this granularity. hpke-ng is a clean 30% faster than both.

Memory

Memory has two halves: the configuration and per-context state — where hpke-ng is uniformly smaller — and the per-key footprint, where the post-quantum KEMs introduce a deliberate tradeoff in the other direction. The memory comparison below is against hpke-rs only; rust-hpke’s Context shape varies with its type parameters in ways that aren’t directly comparable to hpke-rs’s enum-keyed instance struct.

sizeof(Hpke<K, F, A>)−344 B (zero-sized)
hpke-rs
344 B
hpke-ng
PhantomData · 0 B
sizeof(Context<_, _, ChaCha20Poly1305>)−312 B (−78%)
hpke-rs
400 B
hpke-ng
88 B
sizeof(Context<_, _, Aes128Gcm>)+392 B (cached AES round keys + GHash)
hpke-rs
400 B
hpke-ng
792 B
sizeof(Context<_, _, Aes256Gcm>)+648 B (cached AES round keys + GHash)
hpke-rs
400 B
hpke-ng
1,048 B

hpke-ng::Hpke<K, F, A> is PhantomData<(K, F, A)>. There is no runtime presence; it costs zero bytes and cargo expand confirms the compiler optimizes it out completely. hpke-rs::Hpke<Crypto> carries a 256-byte ChaCha20 PRNG plus four enum discriminants and padding (344 bytes measured at the time of writing).

Context is where hpke-ng’s per-AEAD specialization shows up. With ChaCha20Poly1305 the cipher state is just the 32-byte key, so Context is 88 bytes — under a quarter of hpke-rs’s 400, since hpke-rs’s Context carries a per-instance PRNG plus trait-object overhead that hpke-ng’s monomorphized design doesn’t need. With AES-GCM the trade goes the other way: hpke-ng caches the expanded round keys plus the precomputed GHash table inline so that Context::seal doesn’t pay key-schedule expansion on every call. That’s the load-bearing reason AES-128 single-shot seal is 6–22% faster across the sweep. Streaming AES applications get the throughput; ChaCha20 deployments stay at the small footprint.

Practical impact: an application that holds a thousand long-lived ChaCha20 contexts — a server with persistent client sessions, a relay, MLS group state — saves ~310 KB of resident memory over hpke-rs. An AES-128-GCM deployment with the same shape pays ~390 KB of additional context state for the per-call seal-side speedup. Whether that’s a good trade is application-specific, and it’s now an explicit choice the type system makes visible.

The post-quantum and PK-cache tradeoff: hpke-ng private keys are larger across the board

Per-key footprint is where the speed/memory trade lands explicitly. Every KEM private key in hpke-ng now caches material that hpke-rs reconstructs from raw bytes on demand: the recipient’s serialized public key for the DH-KEMs (so decap doesn’t recompute it via base-point scalar multiplication), the expanded x_wing::DecapsulationKey for X-Wing (so decap doesn’t re-run SHAKE-256 + ML-KEM-768 keygen), and the materialized FIPS 203 decapsulation key for ML-KEM (same trick at the parameter-set boundary). The size impact is uniform and easy to reason about:

KEM private key footprint · stack + heap, in bytes

hpke-rshpke-ng
X25519
56 B
88 B
+32 B
P-256
56 B
121 B
+65 B
X-Wing
56 B
1,698 B
+1,642 B
ML-KEM-768
88 B
3,266 B
+3,178 B
ML-KEM-1024
88 B
4,290 B
+4,202 B
Every row spends extra memory at private-key construction in exchange for not paying the same work on every subsequent decap. That is the explicit memory cost of the −37% to −55% decap deltas.

So the trade is concrete: an extra 32–65 B per DH private key, ~1.7 KB extra per X-Wing private key, ~3.2 KB extra per ML-KEM-768 private key, and ~4.2 KB extra per ML-KEM-1024 private key. In exchange the recipient skips a base-point scalar mult per DH decap, a SHAKE-256 + ML-KEM-768 keygen per X-Wing decap, and a FIPS 203 KeyGen_internal per ML-KEM decap. For a server pinning a few thousand long-lived receiver keys this is good arithmetic in essentially every setting we can think of, a possibly-bad trade on a microcontroller pinning very few keys, and we’d rather you have the numbers than guess.

Public keys grow in the same direction on the post-quantum side — hpke-ng now caches the parsed EncapsulationKey alongside the wire bytes so encap doesn’t re-decode the 1,184/1,568-byte payload on every call. X-Wing public keys are 1,656 B, ML-KEM-768 public keys are 1,624 B, ML-KEM-1024 public keys are 2,136 B (vs hpke-rs’s Vec<u8> of just the wire bytes). DH public keys are unchanged: 32 B for X25519, 96 B for P-256 (uncompressed encoded point form).

Smaller

Project surface area

hpke-rshpke-ng
End-user binary (stripped, release)
561 KB
392 KB
−30%
Total project code (cloc)
5,631
4,817
−14%
Library source (cloc)
2,623
2,426
−8%
Test code (cloc)
2,230
1,124
−50%
Bench code (cloc)
759
~1,700
+124%
Crates in workspace
4
1
−75%
User-facing Cargo features
11
7
−36%

Expanding on the above:

End-user binary, −30%. A minimal application — generate a key, seal a message, open it back, ten lines of Rust — compiled with RUSTFLAGS="-C target-cpu=native", lto="thin", codegen-units=1, strip="symbols". hpke-ng comes in at 392 KB; hpke-rs at 561 KB. 169 KB is not nothing on embedded targets, in WASM bundles, or in CDN-served binaries.

Library code, −8%. hpke-ng is 197 lines smaller than hpke-rs at the library level (2,426 vs 2,623), while implementing strictly more — the full HPKE surface plus the optional post-quantum suite (X-Wing draft-06, ML-KEM-768, ML-KEM-1024) that hpke-rs reaches only via experimental feature flags. The type-state design earns its keep here: ciphersuite selection lives entirely in the type system, so there is no provider trait, no per-primitive enum dispatch, and no glue between the two. Inline #[cfg(test)] modules are kept tight — anything covered by tests/roundtrip.rs’s 59-cell macro matrix is deleted from src/.

Test code, −50%. hpke-rs’s test suite is 167 tests in 2,230 lines; hpke-ng’s is 128 tests in 1,124 lines, with deeper coverage on roundtrips (59 macro-generated (mode, KEM, KDF, AEAD) combinations vs hpke-rs’s 17 hand-written cases). The reduction is structural — the type system carries information that would otherwise be repeated test setup, and the roundtrip! macro generates one test per supported configuration from a single declaration.

Bench code, +124%. This row runs against the section’s grain, and it’s deliberate. hpke-ng’s bench harness has more than doubled — concentrated in a new comparative.rs that loads hpke-rs and rust-hpke as dev-dependencies and benchmarks every supported ciphersuite against both, including the post-quantum suite that hpke-rs ships only behind an experimental feature flag and that rust-hpke doesn’t ship at all. That single-source-of-truth harness is what made the 126-row tally at the top of this post possible. Coverage, not waste.

The one place this chart doesn’t reach is the fuzz harnesses. hpke-ng spends substantially more code on cargo-fuzz targets than hpke-rs does — deliberately so. That’s the next section.

Harder

cargo test · --features pq,kat-internals,differential
128 / 128 passing
128
tests passing
1.9s
total wall time
37×
faster than hpke-rs's KAT runner
4 / 1
cargo-fuzz targets (hpke-ng / hpke-rs)
unit (lib)46/46
RFC 9180 KAT13/13
roundtrip matrix59/59
differential vs hpke-rs8/8
doctests2/2
cargo-fuzz targets4 / 4 clean

The 1.9-second figure is the headline number. The full hpke-ng test matrix — 128 tests across library unit tests, RFC 9180 KAT, generative roundtrips across every ciphersuite × mode combination, and byte-by-byte differential vs hpke-rs — runs in about 1.9 seconds. The roundtrip layer alone is 59 macro-generated tests covering every supported (mode, KEM, KDF, AEAD) combination including all four post-quantum and X-Wing/ML-KEM rows. hpke-rs’s KAT runner is structured as a single test that iterates 144 vectors sequentially and takes about 70 seconds. Same vectors, same coverage, structured to take advantage of cargo test’s thread pool.

For day-to-day development the headline number understates the impact. A 70-second feedback loop is one you avoid running until you’re “done”; a 1.9-second feedback loop is one you run after every save.

The fuzz layer is where hpke-ng makes its biggest investment in lines of code. hpke-rs ships one cargo-fuzz target — a seal/open harness for one ciphersuite. hpke-ng ships four:

  • pk_from_bytes — fuzzes public-key parsing for all 9 KEMs (X25519, X448, P-256, P-384, P-521, secp256k1, X-Wing, ML-KEM-768, ML-KEM-1024).
  • enc_from_bytes — fuzzes encapsulated-key parsing for all 9 KEMs.
  • key_schedule — fuzzes the internal key schedule with arbitrary mode bytes (including invalid 0x04..=0xFF mode values), arbitrary PSK / PSK-ID combinations, and arbitrary shared secrets.
  • open — fuzzes Hpke::open_base with arbitrary [encap || ciphertext] byte splits against a fixed receiver keypair.

The shared invariant across all four is that panics are bugs. Authentication failures, decode errors, length mismatches — all expected outcomes that the harness considers a successful run. A panic, a misaligned-pointer fault, or a debug-assertion failure under cargo-fuzz’s instrumentation is a finding. As of release, all four targets run clean.

There are also several structural footgun-prevention details that don’t show up in the fuzz output but are worth listing.

Context is not Clone. Cloning an HPKE context lets two callers reuse the same (key, base_nonce, seq) triple and produce a nonce-reuse bug — the kind of bug that’s invisible until it isn’t and unrecoverable when it surfaces. hpke-ng’s Context deliberately doesn’t implement Clone; cloning is a compile error.

Context::seal refuses to encrypt at seq == u64::MAX. A pre-check, before nonce computation. This makes nonce-reuse via counter wraparound structurally impossible regardless of how the caller handles a MessageLimitReached error.

All-zero shared-secret rejection (RFC 9180 §7.1.4) uses subtle::ConstantTimeEq for X25519 and X448. An attacker who supplies a small-order point and watches for timing variance is one of the more subtle attacks on DHKEM; the constant-time comparison closes it.

AEAD nonce length is enforced at compile time. Context::compute_nonce uses a const assertion that the AEAD’s NONCE_LEN is between 8 and 12 bytes. Any AEAD that violates this is a compile error at the call site that uses it.

Interop

We tested interop two ways. The first is RFC 9180 known-answer tests: both libraries are run against the same vendored test vector JSON (8 MB, derived from RFC 9180’s own test vector tooling) and required to produce byte-equal key, base_nonce, exporter_secret, decrypted ciphertexts, and exported values for every vector. The second is byte-by-byte differential testing: a deterministic ChaCha20Rng feeds identical inputs to both libraries; hpke-ng plays sender, hpke-rs plays receiver, and every byte that crosses the wire is asserted equal. Roughly 600 byte-equality assertions per CI run, all passing.

CiphersuiteRFC 9180 KATDifferential vs hpke-rs
DHKEM(X25519, SHA-256) × ChaCha20-Poly1305✓ Base / Psk / Auth / AuthPsk✓ Base + Psk
DHKEM(X25519, SHA-256) × AES-128-GCM✓ Base / Psk✓ Base
DHKEM(X25519, SHA-256) × AES-256-GCM✓ Base / Psk✓ Base
DHKEM(X25519, SHA-256) × ExportOnly✓ Base / Psk— no AEAD
DHKEM(P-256, SHA-256) × ChaCha20-Poly1305✓ Base / Psk✓ via KAT
DHKEM(P-256, SHA-256) × AES-128-GCM✓ Base / Psk / Auth / AuthPsk✓ Base + Psk
DHKEM(P-521, SHA-512) × AES-256-GCM✓ Base / Psk / Auth / AuthPsk— hpke-rs/RustCrypto unsupported
DHKEM(secp256k1, SHA-256) × ChaCha20-Poly1305✓ Base / Psk✓ via KAT
DHKEM(X448, SHA-512) × ChaCha20-Poly1305✓ Base / Psk— hpke-rs/RustCrypto unsupported
ML-KEM-768 × ChaCha20-Poly1305— no RFC vectors— seed-derivation differs by design
X-Wing draft-06 × ChaCha20-Poly1305— no RFC vectors— seed-derivation differs by design

Some gaps worth covering, for full disclosure:

Auth and AuthPsk differential. hpke-rs’s seed() injects raw bytes for the base ephemeral; for Auth modes there is also a sender static keypair derived earlier, before any seed-injection happens. Aligning the two libraries’ state for byte-by-byte Auth-mode differential testing would require a deeper hpke-rs API hook than hpke-test-prng exposes. Auth/AuthPsk-mode interop is verified at the KAT layer instead — both libraries pass the X25519+ChaCha20 Auth/AuthPsk vectors, the P-256+AES-128 vectors, and the P-521+AES-256 vectors.

Post-quantum differential. hpke-rs’s X-Wing and ML-KEM implementations use different SHAKE-256 seeding from hpke-ng’s RFC 9180 §7.1.3-compliant derive_key_pair construction, so the libraries produce different ephemeral keys from the same IKM bytes. The encap wire format itself is determined by the underlying KEM crate (which both use), so they should agree at the wire level — but we haven’t tested it in this repository, and we’d rather call that out than pretend otherwise.

rust-hpke differential. Not currently wired up. rust-hpke’s PRNG is a generic CryptoRng + RngCore parameter and is straightforward to align with a deterministic ChaCha20Rng, so adding a third differential leg is on the to-do list — but as of this release, interop with rust-hpke is verified only at the speed-bench layer (each library opens what hpke-ng sealed, ad hoc, in unit tests).

Migrate today

All three libraries pass the same RFC 9180 KATs against the same primitive crates. If your code uses HPKE through a small wrapper — which most production HPKE code does — switching is mechanical:

// hpke-rs
let mut hpke = Hpke::<HpkeRustCrypto>::new(
    Mode::Base,
    KemAlgorithm::DhKem25519,
    KdfAlgorithm::HkdfSha256,
    AeadAlgorithm::ChaCha20Poly1305,
);
let kp = hpke.generate_key_pair()?;
let (sk, pk) = kp.into_keys();
let (enc, ct) = hpke.seal(&pk, info, aad, pt, None, None, None)?;

// rust-hpke
let (sk, pk) = X25519HkdfSha256::gen_keypair(&mut rng);
let (enc, mut ctx) = hpke::setup_sender::<ChaCha20Poly1305, HkdfSha256, X25519HkdfSha256, _>(
    &OpModeS::Base, &pk, info, &mut rng,
)?;
let ct = ctx.seal(pt, aad)?;

// hpke-ng
type Suite = Hpke<DhKemX25519HkdfSha256, HkdfSha256, ChaCha20Poly1305>;
let mut os = OsRng;
let mut rng = os.unwrap_mut();
let (sk, pk) = DhKemX25519HkdfSha256::generate(&mut rng)?;
let (enc, ct) = Suite::seal_base(&mut rng, &pk, info, aad, pt)?;

The most common migration shape: define a type Suite = Hpke<…, …, …> alias once, change hpke.seal calls to Suite::seal_base (or seal_psk / seal_auth / seal_auth_psk per mode), thread an &mut rng through the call sites that need encap entropy, drop the Option placeholders. The rust-hpke → hpke-ng shape is even closer: keep the type-state ciphersuite alias, swap module paths, and replace the per-iteration setup_sender with the appropriate Suite::seal_* if you only ever encrypt one message per recipient.

Get it

[dependencies]
hpke-ng = "0.1.0-rc.3"

If you find a row in our benchmark suite that’s wrong, an interop gap we haven’t documented, or a footgun we missed, file an issue. We’d rather know.

Read more Cryptographic audits, advisories, and research from Symbolic Software. New posts roughly twice a month. RSS GitHub

More from Software

2026.03.01 · Software

Making Verifpal Easier to Reason About

Verifpal's analysis engine has been redesigned with a unified equational theory, provenance-tagged values, a formally grounded deduction loop, and a bounded-depth search that runs 3x faster — plus updated tooling across the board.

11 min read