How to prove LLM inference

2026-03-15

Introduction

Verifiable LLM inference is one of the tools in the toolbox for containing powerful AI, and potentially for enforcing international agreements around AI.

By "proving inference" I mean producing a zero-knowledge proof that a given output was produced by a specific model on a given input, without revealing the model's weights. A few AI safety and security use cases that performant proofs of inference would unlock:

I've been working on a project with 0xPARC to build a ZK prover for LLM inference since February 2026. The goal of this article is to transmit some of the intuition I've built since then, with the hope of inspiring more people to work on this problem.

Approach

We want to prove that the output of an LLM inference was produced by a specified model given a known input, without revealing the model's weights. The first step in our approach is to define the computation graph of every operation that happens in a single forward pass, prove each operation individually, and connect the proofs by showing the input to an operation matches the output of the previous operation, without sharing intermediate activations.

The operations we have to prove include linear operations like matrix multiplication, non-linear operations like activation functions, as well as structural operations like splitting or concatenating a matrix for attention heads. Each of these operations requires a different proof setup, but they all use a shared set of tools. To build intuition, we will look at the proof setup for Hadamard multiplication and matrix multiplication in some detail, and provide the main idea for the remaining operations below.

How to prove a Hadamard multiplication

We will start by explaining the approach for proving Hadamard multiplication to introduce the most fundamental concepts, and will expand it to proving matrix multiplication in the next section.

We want to prove AB=CA*B=C without revealing AA, BB, or CC.

One of the most fundamental tools in our toolkit is a protocol called sum check. To keep the focus on the intuition for proving Hadamard, in this section we will treat sum check as a black box API, and will defer the more detailed explanation to a later section.

Sum check allows us to prove that an arbitrary multivariate polynomial sums to a claimed value over all binary inputs, without the verifier having to evaluate the polynomial at all 2n2^n binary inputs. More formally, it allows us to prove statements of the form

x{0,1}nf(x)=c\sum_{x \in \{0,1\}^n} f(x) = c

The result of sum check is a single evaluation f(r)=vf(r)=v, where rr is a non-binary point derived via the protocol.

To apply sum check to prove Hadamard multiplication, we need to reframe AB=CA * B = C as a claim about a polynomial summing to a known value over all binary inputs.

The first step is to realize that we can turn any matrix with nn entries into a multilinear polynomial with log(n)log(n) variables, by constructing a polynomial that, for each binary input, returns the matrix value at that binary index.

Let's make this more concrete with an example. Take the matrix AA.

A=[1234]A = \begin{bmatrix}\: 1 & 2 \\ 3 & 4 \:\end{bmatrix}

We can address each entry with a binary index.

A=[100201310411]A = \begin{bmatrix}\: 1_{00} & 2_{01} \\ 3_{10} & 4_{11} \:\end{bmatrix}

Now we can construct a multilinear polynomial with two variables, x1x_1 and x2x_2, where each corresponds to one bit of the binary index, starting from the most significant bit.

f(x1,x2)=(1x1)(1x2)1+(1x1)x22+x1(1x2)3+x1x24f(x_1, x_2) = (1-x_1)\cdot(1-x_2)\cdot 1 + (1-x_1)\cdot x_2\cdot 2 + x_1\cdot(1-x_2)\cdot 3 + x_1\cdot x_2\cdot 4 f(0,0)=1f(0,1)=2f(1,0)=3f(1,1)=4f(0,0)=1 \qquad f(0,1)=2 \qquad f(1,0)=3 \qquad f(1,1)=4

The polynomial returns the matrix value at all binary inputs, but it can be evaluated at any input. That's why this polynomial is called the Multilinear Extension (MLE) of AA.

A multilinear polynomial in nn variables has 2n2^n coefficients, so it is uniquely defined by 2n2^n datapoints. There is a unique MLE for each matrix, and there is no other multilinear polynomial that agrees on all binary inputs but disagrees on any other input.

The polynomial we used to select entries in the example generalizes to eqˉ\bar{eq}, an equality indicator that returns 1 when its two binary inputs match and 0 otherwise:

eqˉ(ρ,x)={1if ρ=x0otherwisefor ρ,x{0,1}n\bar{eq}(\rho,x) = \begin{cases} 1 \quad \text{if } \rho = x \\ 0 \quad \text{otherwise} \end{cases} \quad \text{for } \rho, x \in \{0,1\}^n

To be able to accept non-binary inputs for ρ\rho, we formally define eqˉ\bar{eq} as

eqˉ(ρ,x)=xi=1ρixi=0(1ρi)\bar{eq}(\rho, x) = \prod_{x_i = 1} \rho_i\cdot\prod_{x_i = 0} (1 - \rho_i)

This form assumes the second argument xx is binary. When xx is itself evaluated at non-binary points, as happens inside sum check, we use the symmetric form eqˉ(ρ,x)=i(ρixi+(1ρi)(1xi))\bar{eq}(\rho, x) = \prod_i \big(\rho_i x_i + (1 - \rho_i)(1 - x_i)\big).

With eqˉ\bar{eq} as a selector, we can construct the MLE for any matrix as a sum over all binary indices.

Aˉ(ρ)=x{0,1}neqˉ(ρ,x)A[x]\bar{A}(\rho) = \sum_{x \in \{0,1\}^n} \bar{eq}(\rho, x) \cdot A[x]

At binary ρ\rho it returns the entry at index ρ\rho, at non-binary ρ\rho it returns a weighted sum of entries.

Using this method, we can turn our matrices AA, BB and CC into multilinear polynomials Aˉ(ρ)\bar{A}(\rho), Bˉ(ρ)\bar{B}(\rho) and Cˉ(ρ)\bar{C}(\rho).

We want to show that f(x)=Aˉ(x)Bˉ(x)Cˉ(x)=0f(x) = \bar{A}(x) \cdot \bar{B}(x) - \bar{C}(x) = 0 for all binary inputs x{0,1}nx \in \{0,1\}^n. This is close to a statement we can prove with sum check, but not quite: sum check proves a sum over binary inputs equals a value, but a sum being zero doesn't imply each term is zero, since non-zero terms can cancel.

A common trick is to reduce our statement to one about the zero polynomial (the polynomial that is zero for all inputs).

f(x)f(x) is not the zero polynomial: the product Aˉ(x)Bˉ(x)\bar{A}(x) \cdot \bar{B}(x) makes f(x)f(x) degree 2 in each variable rather than multilinear, and a non-multilinear polynomial can be zero at every binary input without being the zero polynomial.

But just like we can construct the MLE of any matrix, we can construct the MLE of f(x)f(x) from its values on all binary inputs:

fˉ(ρ)=x{0,1}neqˉ(ρ,x)f(x)\bar{f}(\rho) = \sum_{x \in \{0,1\}^n} \bar{eq}(\rho,x) \cdot f(x)

In the honest prover case, f(x)f(x) is zero on every binary input, and since the MLE is uniquely determined by its values on binary inputs, fˉ(ρ)\bar{f}(\rho) is the zero polynomial.

But how can we prove fˉ(ρ)\bar{f}(\rho) is the zero polynomial? By the Schwartz-Zippel lemma, a non-zero polynomial of degree dd over a field F\mathbb{F} evaluates to zero on at most d/Fd /|\mathbb{F}| fraction of inputs in Fn\mathbb{F}^n, so if fˉ(ρ)\bar{f}(\rho) is zero at a random input, it is the zero polynomial with overwhelming probability.

The verifier picks[^1] a random value rr and we apply sum check to prove

x{0,1}neqˉ(r,x)f(x)=0\sum_{x \in \{0,1\}^n} \bar{eq}(r, x) \cdot f(x) = 0

which by construction is fˉ(r)=0\bar{f}(r) = 0.

To summarize:

fˉ(r)=0 for random r    fˉ is the zero polynomial    f(x)=Aˉ(x)Bˉ(x)Cˉ(x)=0x{0,1}n    AB=C\begin{aligned} \bar{f}(r) = 0 \text{ for random } r \quad & \implies \quad \bar{f} \text{ is the zero polynomial} \\ & \iff \quad f(x) = \bar{A}(x) \cdot \bar{B}(x) - \bar{C}(x) = 0 \quad \forall x \in \{0,1\}^n \\ & \iff \quad A * B = C \end{aligned}

The prover commits to AA, BB, CC as cmAcm_A, cmBcm_B and cmCcm_C, and runs sum check on xeqˉ(r,x)f(x)=0\sum_x \bar{eq}(r, x) \cdot f(x) = 0. The output is the sum check transcript, a final challenge point rr', and the claimed evaluations Aˉ(r)=valA\bar{A}(r') = val_A, Bˉ(r)=valB\bar{B}(r') = val_B, Cˉ(r)=valC\bar{C}(r') = val_C.

The verifier checks the sum check transcript, which in the final step requires evaluating the summand at rr'.

eqˉ(r,r)(valAvalBvalC)\bar{eq}(r, r') \cdot (val_A \cdot val_B - val_C)

The proof is only sound if Aˉ(r)=valA\bar{A}(r') = val_A, Bˉ(r)=valB\bar{B}(r') = val_B, Cˉ(r)=valC\bar{C}(r') = val_C, but Ajtai commitments don't support point openings without revealing the full matrix, so the verifier can't check them directly. Instead, the verifier records each unverified claim as a tuple (cm,r,val)(cm, r', val), meaning the MLE of the matrix behind this commitment evaluates to valval at rr'. Later all accumulated unverified claims are folded into a few final claims via linear folding (covered in a later section). The verifier checks the final claims by opening the folded commitments, which reveals the folded matrices but nothing about the original matrices.

How to prove other operations

Matrix multiplication

The general approach for proving matrix multiplication is the same as for the Hadamard proof: construct a polynomial sum that captures the statement and apply sum check to prove it. The main difference is how to construct the polynomial sum.

Matrix multiplication, like Hadamard, is a pointwise claim:

AB=C    jA[i,j]B[j,k]=C[i,k]for all binary (i,k)AB = C \iff \sum_j A[i,j] \cdot B[j,k] = C[i,k] \quad \text{for all binary } (i,k)

We could follow the Hadamard approach exactly and turn the pointwise claim into an MLE over the full evaluation table via the eqˉ\bar{eq} polynomial. But there is a faster way: instead of proving AB=CAB = C, we pick random vectors ss and tt and prove sABt=sCtsABt = sCt. If ABCAB \neq C, then sABtsCtsABt \neq sCt with overwhelming probability. This idea is known as Freivalds' algorithm.

With MLEs, the "random vectors" are equivalent to evaluating at random points. For random r1,r2r_1, r_2:

So instead of proving AB=CAB=C pointwise, we can prove its Freivalds-projected form:

jAˉ(r1,j)Bˉ(j,r2)=Cˉ(r1,r2)\sum_j \bar{A}(r_1, j) \cdot \bar{B}(j, r_2) = \bar{C}(r_1,r_2)

The rest (picking r1,r2r_1, r_2, running sum check, folding claims) is analogous to the Hadamard proof.

Non-linear operations: Softmax, SiLU, RMSNorm

Both Hadamard and matrix multiplication are linear operations, which can be proved directly, because they fit the linear nature of the proving system very well.

For proving non-linear operations, we have two options: we can either use a generic approach via lookup tables, or we can find specialized approaches to prove an approximate version that exploits certain attributes of the function. While lookup tables can be used to prove any operation, they are usually much less efficient than the specialized approaches. For this project we've started by implementing a lookup table proof to close the gaps, and then replaced it operation by operation with a specialized proof.

For softmax, the main complexity is proving exe^x. We can rewrite exe^x as ex=2log2(ex)=2xlog2(e)e^x = 2^{\log_2(e^x)} = 2^{x \cdot \log_2(e)} (all logarithms here are base 2). For numerical stability, in practice we subtract max(v)max(v) from all entries of a vector vv before applying softmax, so the value for xx in exe^x will always be non-positive. Since large negative values don't contribute significantly to the softmax output, we can clamp xx at about 11-11, which keeps the resulting integer part qq below 16. We split the exponent into a non-negative integer part and a fractional remainder in (1,0  ](-1, 0\;]:

y=xlog2(e)q=yf=y+q2y=2q+f=2q2f=2f>>q\begin{matrix} y = x \cdot \log_2(e) \\ q = ⌊|y|⌋ \\ f = y + q \\ \\ 2^y = 2^{-q+f} = 2^{-q} \cdot 2^{f} = 2^f >> q \end{matrix}

Then we approximate 2f2^f via linear approximation in (1,0](-1,0] with 1+f/21+f/2. This is the secant through the two endpoints, and its error stays under about 0.050.05 on the interval. Finally we prove the bitshift by qq by proving the inverse, that the result multiplied by 2q2^q equals the input, while proving 2q2^q via binary decomposition of the exponent qq:

q=q120+q221+q322+q4230q<16q = q_1 \cdot 2^0 + q_2 \cdot 2^1 + q_3 \cdot 2^2 + q_4 \cdot 2^3 \qquad 0 \le q \lt 16 2q=i=14(1+qi(22i11))2^q = \prod_{i=1}^{4}\left(1 + q_i \cdot (2^{2^{i-1}} - 1)\right)

where each set bit qiq_i contributes a factor of 22i12^{2^{i-1}} (that is, 2, 4, 16, or 256). For SiLU we can reuse the approach for proving exe^x, and for both SiLU and RMSNorm we can reuse the idea of proving a division by proving the inverse multiplication instead. RMSNorm's nonlinearity is actually an inverse square root rather than a plain division, but the same trick applies: instead of proving r=1/sr = 1/\sqrt{s} we prove r2s=1r^2 \cdot s = 1.

Reshape

To connect the individual proofs of each operation in the computation graph, the verifier checks that the input to each operation matches the output of the previous operation. But sometimes adjacent mathematical operations don't operate on the same data shape. For example, each attention head acts on a subset of the QQ matrix, so the input to each attention head does not equal the whole output of the previous QQ projection. With access to the matrix, it's trivial to check the subsegment is part of the matrix, but the verifier doesn't have access to the matrix, it can only see commitments and other proof artifacts.

To connect the commitments of these operations in the computation graph, we insert a reshape proof in between them. Conceptually, a reshape operation takes a matrix as input, and returns many smaller matrices as output, each corresponding to a contiguous subsegment of the larger matrix.

The construction for this proof is simpler than for the mathematical operations. As long as the subsegment is aligned with MLE variables, which it is for attention heads, we can select a subsegment by fixing the corresponding MLE variables to the binary values selecting the subsegment.

Then we derive a random evaluation point for each subsegment. We evaluate the subsegment's MLE at this point, and evaluate the original matrix' MLE at the same point, with the fixed variables spliced in. If the value at this random point matches, the MLEs match at all points with high probability[^2] and therefore the subsegment matches the original matrix. The proof doesn't include a sum check, because there is no mathematical relation to prove, only equality of a subsegment that can be defined by fixing MLE variables.

Addition and scalar multiplication

Ajtai commitments are linearly homomorphic, so we don't need any explicit proofs for addition of matrices and scalar multiplication of matrices.

commit(A)+commit(B)=commit(A+B)commit(A) + commit(B) = commit(A+B) commit(cA)=ccommit(A)commit(c \cdot A) = c \cdot commit(A)

Sum check and Fiat-Shamir

We've previously introduced sum check as a protocol to prove the sum of a function f(x)f(x) over all binary inputs equals a claimed value cc, without the verifier having to evaluate f(x)f(x) at all 2n2^n inputs.

x{0,1}nf(x)=c\sum_{x \in \{0,1\}^n} f(x) = c

Naively, verifying this claim would require evaluating f(x)f(x) at all 2n2^n binary input values. The sum check protocol reduces this to nn rounds of interaction.

In each round, the prover sums out all the other variables to get a univariate polynomial in the round's free variable.

gi(xi)=x{0,1}n1f(xi,x)g_i(x_i) = \sum_{x \in \{0,1\}^{n-1}} f(x_i, x)

The verifier checks that gi(0)+gi(1)g_i(0) + g_i(1) equals the claimed sum from the previous round, then picks a random challenge value for the free variable. The value of gig_i at that challenge becomes the expected sum of the next round. The prover fixes the current variable to the chosen value and proceeds to the next round.

To reconstruct gig_i, the verifier needs one more evaluation than its degree (a degree dd univariate polynomial is determined by d+1d + 1 points), and the degree of gig_i equals the degree of ff in that variable. For the Hadamard zero check, the summand eqˉ(r,x)(Aˉ(x)Bˉ(x)Cˉ(x))\bar{eq}(r, x) \cdot (\bar{A}(x) \cdot \bar{B}(x) - \bar{C}(x)) is degree 3 in each variable, so the prover sends 4 evaluations per round. For matrix multiplication the summand is degree 2, so 3 evaluations suffice.

The result of sum check is a final evaluation point rr and its evaluation value f(r)f(r). It is often used to reduce the proof of mathematical operations to single point evaluations, as described in the Hadamard and matrix multiplication sections.

Sum check is fundamentally an interactive round based protocol. The prover can't know the evaluation points for each round in advance, or otherwise they could forge a polynomial g(x)g(x) that matches f(x)f(x) only in the challenge points.

Fiat-Shamir is a simple approach for turning sum check into a non-interactive protocol. At its core is a challenger, a piece of data that is hashed in each round to create deterministic and reproducible randomness. The prover starts by hashing the committed polynomial and derives the first evaluation point from the hash. In each round, the prover hashes the previous challenger to derive the next random challenge.

The verifier can verify the sum check transcript by hashing the same committed polynomial to initialize the same randomness, and verify each round's random challenge by hashing the challenger again.

Ajtai commitments

Throughout the proofs above we treated commitments as a black box: the prover "commits" to a matrix and gets a short commitment cmcm it can be held to later. Here's the basic idea.

A commitment scheme is binding: once you commit to a matrix, you can't later open it to a different one. We use Ajtai commitments, which are lattice-based. To commit to a matrix, you flatten it into a vector with small entries and multiply it by a fixed random matrix over a finite field, which gives a result much smaller than the input. Binding comes from the hardness of the Short Integer Solution (SIS) problem: finding two different small-norm inputs with the same commitment would mean finding a short solution to a random lattice problem, which is assumed to be hard. On its own this commitment is not hiding, because it is a deterministic function of the input. Hiding is added by blinding the commitment with a random mask, and the proof's zero-knowledge property comes from the folding step described below.

Two properties of Ajtai commitments we care about:

Linear folding

The operation proofs above leave behind a list of unverified evaluation claims (cm,r,val)(cm, r', val), each meaning the matrix behind this commitment evaluates to valval at point rr'. Linear folding is an algorithm that folds all accumulated claims into a small number of final claims, checked by opening the folded commitments. Folding reveals the folded matrices but nothing about the original ones, which is also where the proof gets its zero-knowledge property and most of its compression.

Folding is its own topic and not the focus of this article. The scheme we build on is introduced in LatticeFold (Boneh and Chen, 2024). I might write a dedicated explainer on it in the future.


[^1]: To turn this into a non-interactive protocol, the random value is derived from previous prover commitments via a protocol called Fiat-Shamir. This is explained in the "Sum check and Fiat-Shamir" section.

[^2]: According to the Schwartz-Zippel lemma.