How to prove LLM inference
Introduction
Verifiable LLM inference is one of the tools in the toolbox for containing powerful AI, and potentially for enforcing international agreements around AI.
By "proving inference" I mean producing a zero-knowledge proof that a given output was produced by a specific model on a given input, without revealing the model's weights. A few AI safety and security use cases that performant proofs of inference would unlock:
- A reverse firewall for containment. A firewall around an air-gapped inference datacenter that only lets a data packet out if it comes with a ZK proof that the packet is the inference result of the expected model. This could detect weight exfiltration and models being maliciously modified after their safety evaluation, without the firewall needing access to the model weights. Zero-knowledge is the central piece: the firewall never needs the weights, and an attacker would have to compromise both the air-gapped datacenter and the firewall to get around it.
- Enforcing international agreements. Agreements to regulate large AI training runs can only be enforced if states can verify that the others are following them. Proofs of inference can be a building block: a datacenter could be required to prove that the data going in and out corresponds to inference with an agreed-upon model, not training. An even stronger version would install network taps inside the datacenter and require proofs at that scale, so all internal traffic can be checked.
- Trustless safety evaluations. External safety evaluations of closed-source models rely on the assumption that the model being served is the same one that was tested. A proof of inference removes that trust assumption.
I've been working on a project with 0xPARC to build a ZK prover for LLM inference since February 2026. The goal of this article is to transmit some of the intuition I've built since then, with the hope of inspiring more people to work on this problem.
Approach
We want to prove that the output of an LLM inference was produced by a specified model given a known input, without revealing the model's weights. The first step in our approach is to define the computation graph of every operation that happens in a single forward pass, prove each operation individually, and connect the proofs by showing the input to an operation matches the output of the previous operation, without sharing intermediate activations.
The operations we have to prove include linear operations like matrix multiplication, non-linear operations like activation functions, as well as structural operations like splitting or concatenating a matrix for attention heads. Each of these operations requires a different proof setup, but they all use a shared set of tools. To build intuition, we will look at the proof setup for Hadamard multiplication and matrix multiplication in some detail, and provide the main idea for the remaining operations below.
How to prove a Hadamard multiplication
We will start by explaining the approach for proving Hadamard multiplication to introduce the most fundamental concepts, and will expand it to proving matrix multiplication in the next section.
We want to prove without revealing , , or .
One of the most fundamental tools in our toolkit is a protocol called sum check. To keep the focus on the intuition for proving Hadamard, in this section we will treat sum check as a black box API, and will defer the more detailed explanation to a later section.
Sum check allows us to prove that an arbitrary multivariate polynomial sums to a claimed value over all binary inputs, without the verifier having to evaluate the polynomial at all binary inputs. More formally, it allows us to prove statements of the form
The result of sum check is a single evaluation , where is a non-binary point derived via the protocol.
To apply sum check to prove Hadamard multiplication, we need to reframe as a claim about a polynomial summing to a known value over all binary inputs.
The first step is to realize that we can turn any matrix with entries into a multilinear polynomial with variables, by constructing a polynomial that, for each binary input, returns the matrix value at that binary index.
Let's make this more concrete with an example. Take the matrix .
We can address each entry with a binary index.
Now we can construct a multilinear polynomial with two variables, and , where each corresponds to one bit of the binary index, starting from the most significant bit.
The polynomial returns the matrix value at all binary inputs, but it can be evaluated at any input. That's why this polynomial is called the Multilinear Extension (MLE) of .
A multilinear polynomial in variables has coefficients, so it is uniquely defined by datapoints. There is a unique MLE for each matrix, and there is no other multilinear polynomial that agrees on all binary inputs but disagrees on any other input.
The polynomial we used to select entries in the example generalizes to , an equality indicator that returns 1 when its two binary inputs match and 0 otherwise:
To be able to accept non-binary inputs for , we formally define as
This form assumes the second argument is binary. When is itself evaluated at non-binary points, as happens inside sum check, we use the symmetric form .
With as a selector, we can construct the MLE for any matrix as a sum over all binary indices.
At binary it returns the entry at index , at non-binary it returns a weighted sum of entries.
Using this method, we can turn our matrices , and into multilinear polynomials , and .
We want to show that for all binary inputs . This is close to a statement we can prove with sum check, but not quite: sum check proves a sum over binary inputs equals a value, but a sum being zero doesn't imply each term is zero, since non-zero terms can cancel.
A common trick is to reduce our statement to one about the zero polynomial (the polynomial that is zero for all inputs).
is not the zero polynomial: the product makes degree 2 in each variable rather than multilinear, and a non-multilinear polynomial can be zero at every binary input without being the zero polynomial.
But just like we can construct the MLE of any matrix, we can construct the MLE of from its values on all binary inputs:
In the honest prover case, is zero on every binary input, and since the MLE is uniquely determined by its values on binary inputs, is the zero polynomial.
But how can we prove is the zero polynomial? By the Schwartz-Zippel lemma, a non-zero polynomial of degree over a field evaluates to zero on at most fraction of inputs in , so if is zero at a random input, it is the zero polynomial with overwhelming probability.
The verifier picks[^1] a random value and we apply sum check to prove
which by construction is .
To summarize:
The prover commits to , , as , and , and runs sum check on . The output is the sum check transcript, a final challenge point , and the claimed evaluations , , .
The verifier checks the sum check transcript, which in the final step requires evaluating the summand at .
The proof is only sound if , , , but Ajtai commitments don't support point openings without revealing the full matrix, so the verifier can't check them directly. Instead, the verifier records each unverified claim as a tuple , meaning the MLE of the matrix behind this commitment evaluates to at . Later all accumulated unverified claims are folded into a few final claims via linear folding (covered in a later section). The verifier checks the final claims by opening the folded commitments, which reveals the folded matrices but nothing about the original matrices.
How to prove other operations
Matrix multiplication
The general approach for proving matrix multiplication is the same as for the Hadamard proof: construct a polynomial sum that captures the statement and apply sum check to prove it. The main difference is how to construct the polynomial sum.
Matrix multiplication, like Hadamard, is a pointwise claim:
We could follow the Hadamard approach exactly and turn the pointwise claim into an MLE over the full evaluation table via the polynomial. But there is a faster way: instead of proving , we pick random vectors and and prove . If , then with overwhelming probability. This idea is known as Freivalds' algorithm.
With MLEs, the "random vectors" are equivalent to evaluating at random points. For random :
- is entry of a random linear combination of the rows of
- is entry of a random linear combination of the columns of
- is the same projection applied to from both sides
So instead of proving pointwise, we can prove its Freivalds-projected form:
The rest (picking , running sum check, folding claims) is analogous to the Hadamard proof.
Non-linear operations: Softmax, SiLU, RMSNorm
Both Hadamard and matrix multiplication are linear operations, which can be proved directly, because they fit the linear nature of the proving system very well.
For proving non-linear operations, we have two options: we can either use a generic approach via lookup tables, or we can find specialized approaches to prove an approximate version that exploits certain attributes of the function. While lookup tables can be used to prove any operation, they are usually much less efficient than the specialized approaches. For this project we've started by implementing a lookup table proof to close the gaps, and then replaced it operation by operation with a specialized proof.
For softmax, the main complexity is proving . We can rewrite as (all logarithms here are base 2). For numerical stability, in practice we subtract from all entries of a vector before applying softmax, so the value for in will always be non-positive. Since large negative values don't contribute significantly to the softmax output, we can clamp at about , which keeps the resulting integer part below 16. We split the exponent into a non-negative integer part and a fractional remainder in :
Then we approximate via linear approximation in with . This is the secant through the two endpoints, and its error stays under about on the interval. Finally we prove the bitshift by by proving the inverse, that the result multiplied by equals the input, while proving via binary decomposition of the exponent :
where each set bit contributes a factor of (that is, 2, 4, 16, or 256). For SiLU we can reuse the approach for proving , and for both SiLU and RMSNorm we can reuse the idea of proving a division by proving the inverse multiplication instead. RMSNorm's nonlinearity is actually an inverse square root rather than a plain division, but the same trick applies: instead of proving we prove .
Reshape
To connect the individual proofs of each operation in the computation graph, the verifier checks that the input to each operation matches the output of the previous operation. But sometimes adjacent mathematical operations don't operate on the same data shape. For example, each attention head acts on a subset of the matrix, so the input to each attention head does not equal the whole output of the previous projection. With access to the matrix, it's trivial to check the subsegment is part of the matrix, but the verifier doesn't have access to the matrix, it can only see commitments and other proof artifacts.
To connect the commitments of these operations in the computation graph, we insert a reshape proof in between them. Conceptually, a reshape operation takes a matrix as input, and returns many smaller matrices as output, each corresponding to a contiguous subsegment of the larger matrix.
The construction for this proof is simpler than for the mathematical operations. As long as the subsegment is aligned with MLE variables, which it is for attention heads, we can select a subsegment by fixing the corresponding MLE variables to the binary values selecting the subsegment.
Then we derive a random evaluation point for each subsegment. We evaluate the subsegment's MLE at this point, and evaluate the original matrix' MLE at the same point, with the fixed variables spliced in. If the value at this random point matches, the MLEs match at all points with high probability[^2] and therefore the subsegment matches the original matrix. The proof doesn't include a sum check, because there is no mathematical relation to prove, only equality of a subsegment that can be defined by fixing MLE variables.
Addition and scalar multiplication
Ajtai commitments are linearly homomorphic, so we don't need any explicit proofs for addition of matrices and scalar multiplication of matrices.
Sum check and Fiat-Shamir
We've previously introduced sum check as a protocol to prove the sum of a function over all binary inputs equals a claimed value , without the verifier having to evaluate at all inputs.
Naively, verifying this claim would require evaluating at all binary input values. The sum check protocol reduces this to rounds of interaction.
In each round, the prover sums out all the other variables to get a univariate polynomial in the round's free variable.
The verifier checks that equals the claimed sum from the previous round, then picks a random challenge value for the free variable. The value of at that challenge becomes the expected sum of the next round. The prover fixes the current variable to the chosen value and proceeds to the next round.
To reconstruct , the verifier needs one more evaluation than its degree (a degree univariate polynomial is determined by points), and the degree of equals the degree of in that variable. For the Hadamard zero check, the summand is degree 3 in each variable, so the prover sends 4 evaluations per round. For matrix multiplication the summand is degree 2, so 3 evaluations suffice.
The result of sum check is a final evaluation point and its evaluation value . It is often used to reduce the proof of mathematical operations to single point evaluations, as described in the Hadamard and matrix multiplication sections.
Sum check is fundamentally an interactive round based protocol. The prover can't know the evaluation points for each round in advance, or otherwise they could forge a polynomial that matches only in the challenge points.
Fiat-Shamir is a simple approach for turning sum check into a non-interactive protocol. At its core is a challenger, a piece of data that is hashed in each round to create deterministic and reproducible randomness. The prover starts by hashing the committed polynomial and derives the first evaluation point from the hash. In each round, the prover hashes the previous challenger to derive the next random challenge.
The verifier can verify the sum check transcript by hashing the same committed polynomial to initialize the same randomness, and verify each round's random challenge by hashing the challenger again.
Ajtai commitments
Throughout the proofs above we treated commitments as a black box: the prover "commits" to a matrix and gets a short commitment it can be held to later. Here's the basic idea.
A commitment scheme is binding: once you commit to a matrix, you can't later open it to a different one. We use Ajtai commitments, which are lattice-based. To commit to a matrix, you flatten it into a vector with small entries and multiply it by a fixed random matrix over a finite field, which gives a result much smaller than the input. Binding comes from the hardness of the Short Integer Solution (SIS) problem: finding two different small-norm inputs with the same commitment would mean finding a short solution to a random lattice problem, which is assumed to be hard. On its own this commitment is not hiding, because it is a deterministic function of the input. Hiding is added by blinding the commitment with a random mask, and the proof's zero-knowledge property comes from the folding step described below.
Two properties of Ajtai commitments we care about:
- They are linearly homomorphic: and . This is why addition and scalar multiplication of matrices come for free, with no extra proof.
- They do not come with a free way to open a single evaluation point without revealing the whole matrix (dedicated evaluation arguments over lattice commitments exist, but they aren't free). This is why the operation proofs can't directly check the claimed MLE evaluations , and instead accumulate them to be discharged later via linear folding.
Linear folding
The operation proofs above leave behind a list of unverified evaluation claims , each meaning the matrix behind this commitment evaluates to at point . Linear folding is an algorithm that folds all accumulated claims into a small number of final claims, checked by opening the folded commitments. Folding reveals the folded matrices but nothing about the original ones, which is also where the proof gets its zero-knowledge property and most of its compression.
Folding is its own topic and not the focus of this article. The scheme we build on is introduced in LatticeFold (Boneh and Chen, 2024). I might write a dedicated explainer on it in the future.
[^1]: To turn this into a non-interactive protocol, the random value is derived from previous prover commitments via a protocol called Fiat-Shamir. This is explained in the "Sum check and Fiat-Shamir" section.
[^2]: According to the Schwartz-Zippel lemma.