NIST’s Known Answer Tests verify final outputs. Formal verification tools verify abstract models. Neither catches the bugs that live in between — the off-by-one rounding, the bounds check compiled to dead code, the inverse NTT quietly omitted from the decryption path.

We know these bugs exist because we’ve found them in production cryptographic code, including code marketed as “formally verified.” So we built a tool to catch them systematically.

Crucible is a conformance testing framework for ML-KEM (FIPS 203) and ML-DSA (FIPS 204) implementations. It encodes the specific bug classes we’ve encountered in real audits as targeted, reusable test batteries that any implementation can be wired up to. Then we pointed it at every ML-KEM and ML-DSA implementation we could find.

The Setup

Crucible communicates with implementations through a simple JSON line protocol over stdin/stdout. A thin harness shim maps Crucible’s function calls to the implementation’s API. We built harnesses for 15 implementations across Rust, Go, C, C++20, and Java:

Implementation Language Who uses it
libcrux Rust Cryspen’s formally verified library
Kyber-K2SO Go Our own implementation
mlkem-native C Ships inside AWS-LC and liboqs
Go stdlib crypto/mlkem Go Every Go 1.26+ binary using TLS 1.3
AWS-LC C AWS KMS, S3, CloudFront
pq-crystals ref C The original reference from the algorithm designers
CIRCL Go Cloudflare’s TLS deployment
liboqs C The most widely used PQC prototyping library
Bouncy Castle Java The dominant JVM crypto library
wolfCrypt C Embedded/IoT, FIPS 140-3 validated
itzmeanjan/ml-kem C++20 Header-only, fully constexpr
pqcrypto Rust Rust FFI bindings to PQClean
PQClean C Clean portable C implementations
Trail of Bits ml-dsa Go Side-channel resistant ML-DSA

The ML-KEM battery has 78 tests across 6 categories (compression arithmetic, NTT correctness, coefficient bounds enforcement, decapsulation robustness, serialization, rejection sampling). The ML-DSA battery has 51 tests across 6 categories (norm checks, arithmetic, signing internals, verification edge cases, serialization, constant-time behavior). Every test is tagged with the bug class it targets, the FIPS spec section it verifies, and — where applicable — the real-world audit finding that motivated it.

The Results

We ran 1,296 test executions. Here’s the ML-KEM scorecard:

Implementation Pass Fail Skip
Reference (Crucible) 78 0 0
libcrux v0.0.8 54 0 24
Kyber-K2SO v1.1.0 54 0 24
mlkem-native 21 0 45
Go stdlib 25 0 45
AWS-LC 33 0 45
CIRCL (Cloudflare) 33 0 45
liboqs 33 0 45
wolfCrypt 33 0 45
itzmeanjan 21 0 45
Bouncy Castle 1.80 25 2 45
pq-crystals ref 19 2 45
PQClean 19 2 45
pqcrypto (rustpq) 21 6 45

(Skipped tests are NTT/sampling internals that implementations don’t expose through their public API. Errors from single-parameter-set harnesses are omitted for clarity.)

Finding 1: Missing Encapsulation Key Modulus Check

Affected: pq-crystals/kyber ref, PQClean, pqcrypto/rustpq Severity: Conformance gap Spec reference: FIPS 203 §7.2, Equation 7.1

FIPS 203 requires that before encapsulation, the encapsulation key must pass a modulus check:

test ← ByteEncode₁₂(ByteDecode₁₂(ek[0 : 384k])). If test ≠ ek[0 : 384k], then input checking failed.

This check ensures that no 12-bit coefficient in the serialized public key encodes a value ≥ q = 3329. All three affected implementations accept a key with coefficient value 3329 without error. None of them provide a separate key validation function.

This is a spec conformance gap, not a vulnerability. The practical impact is limited to interoperability: if two implementations handle an out-of-range coefficient differently (one reduces mod q, another wraps), the same key produces different shared secrets. The pq-crystals codebase predates FIPS 203 — the modulus check was added during standardization.

Finding 2: Bouncy Castle Accepts Wrong-Length Decapsulation Keys

Affected: Bouncy Castle 1.80 (Java) Severity: Input validation gap Spec reference: FIPS 203 §7.3

Bouncy Castle’s MLKEMPrivateKeyParameters constructor accepts byte arrays of any length without validating against the expected size for the parameter set. Crucible sends a decapsulation key that is 1 byte too short (2399 bytes for ML-KEM-768, expected 2400) and Bouncy Castle proceeds with decapsulation.

A truncated dk means the implicit rejection secret z is partially missing, which could weaken the FO transform’s security guarantee. In practice, the dk is private and an attacker wouldn’t normally control it — but a corrupted key file should fail loudly rather than silently produce wrong results.

What About the Other 11 Implementations?

They all passed. Every test. libcrux, mlkem-native, AWS-LC, CIRCL, liboqs, wolfCrypt, itzmeanjan, the Go standard library, and Trail of Bits ml-dsa — all clean.

What This Means

We set out to find the kinds of bugs we’d found in audits: missing inverse NTTs, dead bounds checks, incorrect rounding, timing leaks. We built 129 tests specifically targeting those bug classes. We tested the 15 most deployed post-quantum implementations in the world.

We found two conformance gaps and zero security vulnerabilities.

This is actually the good news. It means the implementations that matter — the ones shipping inside AWS, Cloudflare, the Go standard library, and wolfSSL’s FIPS module — are doing the right things. The FIPS 203 standardization process added the right checks. The formally verified implementations (libcrux) pass. The clean-room implementations (mlkem-native, CIRCL, Trail of Bits) pass.

The bugs we designed Crucible to catch existed in the pre-FIPS era and have largely been fixed. The ecosystem learned from the Cryspen findings, from KyberSlash, from the various timing advisories. The 2026 implementations are better than the 2024 ones.

Crucible Is Open Source

Crucible is available on GitHub. It ships with harness templates for Rust, Go, and C. If you maintain an ML-KEM or ML-DSA implementation, wire it up and run the battery. If you find a bug we missed, it becomes a new test. The framework grows sharper over time.

cargo build --release
cargo run --bin crucible -- ./your-harness
cargo run --bin crucible -- ./your-harness --battery ml-dsa --json

The test batteries will continue to expand as we encounter new bug classes in future audits. The naming isn’t coincidental — like the ancient Jedi vessel where younglings forge their lightsabers, Crucible is where implementations prove they’re truly sound, or reveal where they’re flawed.