Dilithium / ML-DSA Signatures

ML-DSA (FIPS 204), formerly CRYSTALS-Dilithium, is the primary post-quantum digital signature standard. This article covers the implementation-level details: the arithmetic substrate, the key operations, the hint compression scheme, and the side-channel considerations that matter when building a production implementation.

Beginner's Intuition 💡

If Kyber is how two computers securely whisper a secret code to each other, Dilithium (ML-DSA) is how one computer publicly proves its identity without revealing its password.

It works like a highly complex mathematical interrogation. A verifier asks the computer a random, difficult question (a "challenge"). Using its private key, the computer generates a mathematically perfect answer (the "signature") that only the true owner could produce.

Crucially, Dilithium includes a clever safety mechanism ("rejection sampling") that ensures absolutely none of the private key's underlying structure "leaks" into the answer, making it perfectly secure against both classical and quantum eavesdroppers.

The Arithmetic Ring

Dilithium operates in $R_q = \mathbb{Z}_q[x]/(x^{256}+1)$ with $q = 8{,}380{,}417 = 2^{23} - 2^{13} + 1$ . This specific prime was chosen because:

Why q = 8,380,417

$q \equiv 1 \pmod{2 \cdot 256}$ , making a 256-point NTT possible. The primitive 512th root of unity exists modulo $$q$$ . The factored form $2^{23} - 2^{13} + 1$ enables efficient Montgomery reduction using just shifts and additions.

Coefficient ranges

Dilithium uses centered representation: coefficients in $$[-(q-1)/2, (q-1)/2]$$ rather than $$[0, q-1]$$ . This simplifies norm computations and makes the "small" condition for secret/error polynomials equivalent to small absolute value.

Secret key distribution

Secret vectors $s_1 \in R_q^\ell, s_2 \in R_q^k$ have coefficients sampled uniformly from $\{-\eta, \dots, \eta\}$ where $\eta \in \{2, 4\}$ depending on security level. This gives $2\eta+1$ possible values per coefficient.

NTT Implementation

All polynomial multiplications in Dilithium use the NTT. The primitive 256th root of unity modulo $$q$$ is $\zeta = 1{,}753$ (a generator satisfying $\zeta^{256} \equiv -1 \pmod{q}$ ). The NTT algorithm used is a negacyclic NTT — the ring $$R_q$$ factors as a product of 256 degree-1 rings via the Chinese Remainder Theorem, enabling coefficient-wise multiplication in the NTT domain.

NTT Structure in Dilithium

The forward NTT maps $f(x) \in R_q$ to its evaluations at $\zeta^{2i+1}$ for $i = 0, \dots, 255$ :

$\hat{f}_i = f(\zeta^{2i+1}) = \sum_{j=0}^{255} f_j \cdot \zeta^{j(2i+1)} \pmod{q}$

In-place Cooley-Tukey butterfly: 8 stages of 128 butterflies each = 1,024 butterfly operations total. Each butterfly is: $(a, b) \leftarrow (a + \zeta^k b, a - \zeta^k b)$ modulo $$q$$ .

With AVX2: compute 8 butterflies in parallel using 256-bit SIMD registers with Montgomery multiplication for the modular reduction. Full NTT on one polynomial: ≈1μs on modern x86.

Expanding the Matrix A

The public matrix $A \in R_q^{k \times \ell}$ is generated deterministically from a seed $\rho$ using SHAKE-128. It is never stored — always regenerated from the seed when needed.

ExpandA Algorithm

For each entry $$A[i][j]$$ (row $$i$$ , column $$j$$ ):

1. Feed $\rho \| i \| j$ (seed concatenated with indices) to SHAKE-128.
2. Stream output bytes and use rejection sampling to extract coefficients in $$[0, q-1]$$ : discard any 3-byte group whose value $\geq q$ ; accept otherwise.
3. Each coefficient uses 3 bytes; expected ~1.25× overhead from rejection sampling.

The matrix is generated in NTT domain directly — no need to apply NTT afterward. Total cost for ML-DSA-65 ( $6 \times 5 = 30$ ring elements): roughly 30 SHAKE-128 squeezes of 256 coefficients each.

The HighBits / LowBits Decomposition

Dilithium uses a clever bit-decomposition trick to compress signatures. Instead of sending the full commitment $w = Ay \pmod{q}$ , the signer sends only the high bits of $$w$$ , saving significant bandwidth. A small "hint" vector corrects any rounding errors during verification.

Definition 33 — Bit Decomposition

For coefficient $r \in \mathbb{Z}_q$ and parameter $\gamma_2$ :

$r = r_1 \cdot 2\gamma_2 + r_0$
where $r_0 = r \bmod^+ 2\gamma_2$ (centered modulo) and $r_1 = (r - r_0) / 2\gamma_2$ .

$\mathrm{HighBits}(r) = r_1$ , $\mathrm{LowBits}(r) = r_0$ .

In the signature, only $$r_1$$ (high bits) is transmitted. The hint bits $$h$$ in the signature encode corrections to the high bits that occur during the verification computation due to the term $$-cs_2$$ . This keeps the signature compact while allowing exact verification.

Sampling Algorithms

Dilithium needs to sample three types of small polynomials during key generation and signing:

Secret key coefficients (±η)

Use SHAKE-256 in "uniform" mode, rejection-sampling coefficients in $\{0, \dots, 2\eta\}$ and centering. For $\eta = 2$ : each byte provides two coefficients via nibbles (high 4 bits, low 4 bits), rejecting values $$> 14$$ . Rejection rate ~7% — efficient.

Mask vector y (±γ₁)

Coefficients uniform in $[-\gamma_1 + 1, \gamma_1]$ . For $\gamma_1 = 2^{17}$ : each coefficient uses 18 bits; pack 8 coefficients into 18 bytes. For $\gamma_1 = 2^{19}$ : each uses 20 bits; pack 8 coefficients into 20 bytes.

Challenge polynomial c

A sparse polynomial with exactly $\tau \in \{39, 49, 60\}$ nonzero coefficients of value $\pm 1$ out of 256. Generated by SHAKE-256 of the commitment hash: use Fisher-Yates to place $\tau$ nonzero positions, then sign each based on a bit.

Constant-Time Implementation

Dilithium is designed for straightforward constant-time implementation. The main areas of concern:

Side-Channel Considerations

Rejection sampling: The abort-and-restart loop in signing is not secret-dependent — the test is whether $\|z\|_\infty \geq \gamma_1 - \beta$ , which can be computed in constant time. The number of restarts leaks only statistical information (not the secret), and is bounded in expectation.

Coefficient arithmetic: All reductions modulo $$q$$ use Barrett or Montgomery reduction — no secret-dependent branches.

Hint computation: The MakeHint and UseHint operations compare coefficients to a threshold — implemented with branchless conditional selection (e.g., using arithmetic right shift tricks).

Key expansion: Generate $$A$$ from seed — SHAKE-128 output is data-independent. No secret key touches A expansion.

Performance Benchmarks

ML-DSA-65 (AVX2, Skylake)

Key generation: ~50,000 cycles (~17 μs)
Signing: ~110,000 cycles (~37 μs) average
Verification: ~45,000 cycles (~15 μs)

For comparison: ECDSA-256 signing ≈70,000 cycles. Dilithium is comparable in speed with much stronger post-quantum security.

Microcontroller (Cortex-M4)

Key generation: ~2.5M cycles
Signing: ~5.8M cycles
Verification: ~2.2M cycles

At 168 MHz, this is ~15 ms for signing — fast enough for IoT authentication use cases. NTT can be optimized with Cortex-M4's DSP multiply-accumulate instructions.

Memory footprint

Public key: 1,952 bytes (ML-DSA-65)
Secret key: 4,032 bytes
Signature: 3,293 bytes

Stack usage during signing: ~8 KB for intermediate values. Acceptable for constrained embedded systems that typically have ≥16 KB RAM.

Deterministic vs Randomized Signing

ML-DSA supports both randomized signing (fresh random mask $$y$$ each time) and deterministic signing (derive $$y$$ from the secret key and message using a PRF). The NIST standard uses randomized signing by default. Deterministic signing is provided as an option for environments lacking a good entropy source — but it requires careful implementation to avoid fault injection attacks that can recover the secret key if the same mask is used twice.

Fiat-Shamir with Aborts — Why Aborts?

Dilithium's signing algorithm uses a technique called Fiat-Shamir with Aborts, introduced by Lyubashevsky in 2009. The abort mechanism is the core of why the scheme is secure. Understanding it is essential for anyone implementing or auditing the code.

The naive Fiat-Shamir approach for a lattice signature would be:

Commit: sample mask $y$ , compute $w = A y p m o d q$ , send $e x t H i g h B i t s (w)$ .
Challenge: receive sparse polynomial $c in - 1, 0, 1^{256}$ from hash of message and commitment.
Respond: output $z = y + c s_{1}$ .

The problem: $z = y + c s_{1}$ leaks information about the secret key $s_{1}$ . Specifically, the distribution of $z$ is not uniform over short vectors — it is shifted toward $c s_{1}$ . A statistical attack collecting enough signatures can recover $s_{1}$ by averaging, since the mask $y$ is random but $c s_{1}$ is correlated with the secret.

The fix: rejection sampling. Instead of always outputting $z$ , the signer checks whether the distribution of $z$ "looks like" it came from a uniform distribution over short vectors — independent of $s_{1}$ . Specifically, the signer aborts and restarts if:

|z|_infty geq gamma_1 - eta quad ext{where } eta = au cdot eta

If $z$ passes this check, it is output. The security argument relies on the following theorem (Lyubashevsky's rejection sampling lemma): if the mask $y$ is uniform over $[- g amm a_{1} + 1, g amm a_{1}]^{256 k}$ and $c s_{1}$ is small (norm at most $eta$ ), then conditioned on not aborting, the distribution of $z = y + c s_{1}$ is statistically indistinguishable from the uniform distribution over $[-gamma_1+eta, gamma_1-eta]^{256k}$ . The statistical distance is at most $2^{- 80}$ for Dilithium's parameters.

The abort probability is approximately $1 - M^{- 1}$ per iteration, where $M approx e^{eta^2 / (2gamma_1^2) + eta/gamma_1}$ . For ML-DSA-65 parameters, the expected number of restarts is about 4.25 — meaning ~76% of signing attempts succeed on the first try, ~18% require a second try, and so on. In expectation, signing requires fewer than 5 iterations.

The crucial security property: the number of aborts is not secret-dependent in a way that leaks useful information. An attacker seeing the final signature cannot tell how many restarts occurred, and even knowing the restart count reveals nothing about $s_{1}$ beyond what the final signature already reveals (which is provably nothing, by the rejection sampling argument).

Key Sizes — ML-DSA Parameter Sets

ML-DSA comes in three security levels, each a different setting of the module parameters $(k, e l l, e t a, g amm a_{1}, g amm a_{2}, a u)$ . Here are the concrete sizes in bytes.

ML-DSA-44 (NIST Level 2)

Parameters: $k = 4, e l l = 4, e t a = 2, g amm a_{1} = 2^{17}, a u = 39$

Public key: 1,312 bytes
Private key: 2,528 bytes
Signature: 2,420 bytes

Classical security: ~128 bits. Comparable to 128-bit AES / RSA-3072 in terms of NIST level. Smallest parameter set; use when bandwidth is constrained and long-term security (beyond 2040) is not required.

ML-DSA-65 (NIST Level 3)

Parameters: $k = 6, e l l = 5, e t a = 4, g amm a_{1} = 2^{19}, a u = 49$

Public key: 1,952 bytes
Private key: 4,032 bytes
Signature: 3,309 bytes

Classical security: ~192 bits. Recommended default for most applications. Comparable security to AES-192 or P-384 ECDSA.

ML-DSA-87 (NIST Level 5)

Parameters: $k = 8, e l l = 7, e t a = 2, g amm a_{1} = 2^{19}, a u = 60$

Public key: 2,592 bytes
Private key: 4,896 bytes
Signature: 4,627 bytes

Classical security: ~256 bits. Use for long-term data, classified information, or applications requiring the maximum available security margin. Comparable to AES-256 or P-521 ECDSA.

For context: RSA-2048 signature is 256 bytes (but has no quantum security). ECDSA-256 signature is 64 bytes. ML-DSA trades larger signatures for post-quantum security — the 3–4 KB signature size is the main operational cost of the migration.

Signing Algorithm — Full Procedure

Here is the complete signing procedure at the level of the FIPS 204 specification, with each step explained. This is what happens when you call ML-DSA.Sign(sk, message).

Setup. The secret key contains $(s_{1} in R_{q}^{e} l l, s_{2} in R_{q}^{k})$ (secret vectors), the public key seed $h o$ , and the precomputed values $t_{0}$ (low bits of the public key polynomial, used for hint computation) and a hash of the public key.

Step 1 — Compute message hash.

m u = e x t S H A K E - 256 (e x t H (p k) ∣ m)

where $e x t H (p k) = e x t S H A K E - 256 (h o ∣ t_{1})$ is a 32-byte hash of the public key. $m u$ is 64 bytes and acts as the "message representative" — all subsequent operations use $m u$ , not the raw message.

Step 2 — Commit (sample mask). Sample a random 32-byte seed $r_{y}$ (either from CSPRNG for randomized signing, or derived from the secret key and $m u$ for deterministic signing). Use $r_{y}$ to generate the mask vector:

y in R_{q}^{e} l l, q u a d y_{i} e x t u ni f or m o v er [- g amm a_{1} + 1, g amm a_{1}]

Expand the public matrix $A in R_{q}^{k im ese l l}$ from $h o$ (or use the cached NTT-domain version). Compute:

w = A c d o t ha t y p m o d q

Apply inverse NTT to get $w in R_{q}^{k}$ in standard form, then extract high bits:

w_{1} = e x t H i g h B i t s (w) in R_{q}^{k}

Step 3 — Challenge. Compute the challenge polynomial:

c = e x t S am pl e I n B a l l (e x t S H A K E - 256 (m u ∣ w_{1}))

$e x t S am pl e I n B a l l$ produces a polynomial with exactly $a u$ nonzero coefficients of value $p m 1$ , using Fisher-Yates shuffle seeded by the hash output. This is the Fiat-Shamir transform: the challenge is determined by the commitment and the message, making it unpredictable to the signer before the commitment is fixed.

Step 4 — Response with rejection sampling. Compute:

z = y + cc d o t s_{1} p m o d q

Check the rejection condition: if $|z|_infty geq gamma_1 - eta$ or the low-bit check fails, abort and restart from Step 2 with fresh randomness. Otherwise, continue.

Step 5 — Compute hint. Compute:

r_{0} = e x t L o w B i t s (A y - c s_{2}) p m o d q

If $|r_0|_infty geq gamma_2 - eta$ , abort and restart. Otherwise, compute the hint vector $h$ : for each coefficient position where $c t_{0}$ shifts the high bits of $A y - c s_{2}$ , set the corresponding hint bit to 1. If $∣ h ∣_{1} > o m e g a$ (too many hint bits, leaking information), abort and restart.

Step 6 — Output signature. The signature is $(i l d e c, z, h)$ where $i l d e c = e x t S H A K E - 256 (m u ∣ w_{1})$ is the challenge hash (32 bytes), $z in R_{q}^{e} l l$ is the response vector, and $hin 0, 1^{k c d o t 256}$ is the hint vector (at most $o m e g a$ ones, encoded sparsely).

Verification recovers $w_{1}$ from $(A z - c t_{1} c d o t 2^{d}, h)$ and checks that $e x t S H A K E - 256 (m u ∣ w_{1}) = i l d e c$ . The hint $h$ corrects the rounding error introduced by working with $t_{1}$ (compressed public key) instead of the full $t$ .

Dilithium vs FALCON — Trade-offs

NIST standardized two lattice signature schemes: ML-DSA (Dilithium) and FALCON (FIPS 206). They solve the same problem but make very different engineering choices. Choosing between them depends on your constraints.

Signature size

FALCON-512: 666 bytes
FALCON-1024: 1,330 bytes
ML-DSA-44: 2,420 bytes
ML-DSA-65: 3,309 bytes

FALCON produces signatures roughly 3–5x smaller than Dilithium at comparable security levels. For applications with a hard signature-size budget (e.g., blockchain transactions, certificate transparency logs), FALCON is significantly more bandwidth-efficient.

Implementation complexity

Dilithium's arithmetic is over $R_{q}$ with a simple NTT — the same ring as Kyber, with straightforward integer arithmetic throughout. The rejection sampling loop and hint computation are the only non-trivial parts.

FALCON requires discrete Gaussian sampling over the integers — a notoriously difficult operation to implement correctly in constant time. FALCON's key generation involves NTRU lattice trapdoor generation, which requires sampling from a bivariate discrete Gaussian conditioned on a lattice, using the "fast Fourier orthogonalization" algorithm (FFSAMPLING). Getting this right — and in constant time — requires significant expertise.

Side-channel resistance

Dilithium: constant-time implementation is straightforward. The rejection sampling check and NTT operations have data-independent access patterns. NIST's reference implementation is already constant-time on all major platforms.

FALCON: the Gaussian sampler is a known side-channel attack surface. Timing attacks on Gaussian sampling have been demonstrated in practice. Constant-time Gaussian sampling requires careful implementation (e.g., using the FACCT algorithm or rejection sampling from a discrete uniform distribution).

Performance

Signing speed (AVX2, rough order of magnitude):
FALCON-512: ~900 μs (key gen is slow due to NTRU trapdoor)
ML-DSA-65: ~37 μs

Verification speed:
FALCON-512: ~42 μs
ML-DSA-65: ~15 μs

Dilithium is dramatically faster at signing. FALCON key generation is 50–100x slower than Dilithium's. If you generate keys frequently (e.g., ephemeral signing keys), Dilithium wins.

When to use each

Use ML-DSA (Dilithium) when: implementation simplicity and auditability matter; signing speed is critical; you are on an embedded platform without a validated Gaussian sampler; or you are doing a first deployment and want the lowest implementation risk.

Use FALCON when: signature size is the dominant constraint (e.g., blockchains, certificate transparency); you have a vetted, constant-time Gaussian sampler; or you need the smallest possible public key and signature footprint.

The practical consensus

The cryptographic community broadly recommends Dilithium as the safer default for new deployments. NIST itself notes that "ML-DSA is expected to be the more widely used signature scheme." FALCON is valuable for bandwidth-constrained environments where signature size dominates and teams have the expertise to implement Gaussian sampling safely. For most enterprise deployments, ML-DSA-65 is the right choice.

References

Implementation resources

FIPS 204 specification — the authoritative ML-DSA standard document from NIST. doi.org ↗

Reference implementation and test vectors: github.com/pq-crystals/dilithium ↗

Micciancio & Regev survey for the mathematical foundations: PDF ↗