Running an EVM Inside a Proof Assistant

There's a sentence in the project's own README that is no longer true, and the gap between what it says and what the code does is the whole story of this post. It describes an "ALU-only demonstrator" — a toy that folds nine arithmetic operations over a stack to make a point about soundness. What's actually sitting in the repository now is a machine that decodes and executes real post-Cancun EVM bytecode: roughly 146 opcodes, a full 256-bit word machine, memory and storage and transient storage, gas metering, sub-calls, and a conformance harness that runs the Ethereum test corpus. It runs your bytecode. And it is the same artifact that a machine-checked soundness theorem reasons about.

That last clause is the one that matters. Plenty of things execute EVM bytecode — geth, revm, evmone, dozens of hobby interpreters. What's rare to the point of being almost unheard-of is an EVM where the interpreter and its correctness proof are one object, checked by a kernel you cannot argue with. This is what it looks like to run an EVM inside a proof assistant, and why that's worth doing.

The EVM is a loop, which is the good news

Strip away the folklore and the EVM is a stack-based virtual machine running one small-step loop. A transaction hands it bytecode and calldata; the machine repeats five moves until the program halts:

Decode the opcode byte at the program counter pc. Most opcodes are one byte; the PUSH1…PUSH32 family carry 1–32 immediate bytes inline.
Dispatch and compute — pop operands off the 256-bit word stack, compute, push results.
Charge gas — every opcode costs something; if gas would go negative, halt out-of-gas and revert.
Advance pc — by one, or past the immediate for PUSH; JUMP/JUMPI write pc directly, but only to a byte tagged JUMPDEST.
Halt? — on STOP/RETURN/REVERT/INVALID/SELFDESTRUCT, stop; otherwise loop.

The state it threads through is a stack and memory (volatile, per-frame), persistent and transient storage, and a world of accounts with balances, nonces, code, and storage. CALL and CREATE push fresh frames with their own budgets.

The "single transition, iterated" shape is exactly why this is tractable to verify. If execution is iterate step, then a statement about the whole program can be reduced to a statement about one step plus an induction. Hold that thought — it's the hinge the entire design turns on.

Yes, it runs — here is the command

The fastest way to dissolve the "is this real" question is to run it:

$ lake exe kevm_pilot run-bytecode 600560030100
stack:      [8]
returnData: 0x
gas:        0

# 60 05  PUSH1 5
# 60 03  PUSH1 3
# 01     ADD
# 00     STOP        →  5 + 3 = 8

run-bytecode decodes the hex into a byte array, builds an initial machine configuration, runs it, and prints the terminal stack, return data, and gas. Add --calldata <hex> to feed it input, --gas <N> to run under the gas-metered machine instead of the gas-free reference, and --json to get the full KEVM-shaped outcome (status, halted, output). A sibling command, run-fixture, loads an ethereum/tests state test and diffs the resulting world against the expected post-state, account by account.

I built a visual, drive-it-yourself version of that loop — step through 60 05 60 03 01 00 one transition at a time and watch pc walk the code while the stack grows and collapses — alongside the full architecture, the soundness story, and the honest list of what it doesn't prove.

→ Open the interactive explainer

Four modules, mirroring the yellow paper

Execution is a pipeline of four Lean files, and the shape deliberately mirrors KEVM cell-for-cell — but every piece is a total function, so the kernel can reason about all of it.

Decode (Evm/Decode.lean) turns bytes into a typed opcode vocabulary. The stack/dup/swap/log families collapse into Fin-carrying constructors — push (n : Fin 33) (imm : Word), dup (n : Fin 16), and so on — so the whole ~146-opcode set is a finite, exhaustive inductive. Crucially, the decoder is upstream of the trusted base: its correctness is discharged by round-trip theorems (decode ∘ encode = id), all axiom-clean, rather than by trusting the implementation.

Config (Evm/Config.lean) is the machine state — stack, memory, pc, gas, the world of accounts, the block context. The big maps (memory, storage) use a spec representation: a total function with a default value, so there are no partiality side-conditions to drag through every proof.

Step (Evm/Step.lean) is the small-step dispatcher. It decodes the opcode at pc exactly once and dispatches to a per-bucket handler (Steps/B1..B11, plus Call). The result is a halting taxonomy:

inductive StepResult (σ) where
  | cont   (s : σ)                      -- keep stepping
  | halt   (s : σ) (out : ByteArray)    -- STOP / RETURN
  | revert (s : σ) (out : ByteArray)
  | stuck                               -- underflow / INVALID / fuel-out

run (Framework/Machine.lean) is the one evaluator — and it's where the magic lives:

structure EvmAlgebra (σ) where
  step : σ → StepResult σ

def EvmAlgebra.run (M : EvmAlgebra σ) : Nat → σ → StepResult σ
  | 0,        _ => .stuck
  | fuel + 1, s => match M.step s with
      | .cont s' => M.run fuel s'
      | r        => r

The fuel parameter is what keeps this a total function even though real EVM programs loop — run out of fuel and you get stuck, never a non-terminating definition the kernel would reject. The concrete EVM is simply aluEvm : EvmAlgebra Config, and you run it with aluEvm.run 10000 c0.

The trick: one `run`, two readings, proved once

Look again at EvmAlgebra.run. Nothing in it mentions Config. It's parametric in the state type σ. Instantiate σ := Config and you get the concrete interpreter. Instantiate σ at an abstract domain — intervals over the 256-bit word, or a sign lattice — and the exact same run becomes a static analyzer that computes an over-approximation of every reachable state.

Soundness is then a single theorem, proved once, by induction on fuel:

-- ★ THE FUNDAMENTAL LEMMA
theorem run_sound (h : SoundEvm MC MA γ) :
    ∀ (fuel : Nat) (sA) (sC), sC ∈ γ sA →
      StepResult.Refines γ (MC.run fuel sC) (MA.run fuel sA)

Read it in English: for every program, every input, and every step count, the concrete run is refined by the abstract run — provided each opcode's per-step transfer function is sound (the SoundEvm witness). The induction is the loop from the very first section: one-step soundness plus iterate gives whole-program soundness. Add a brand-new abstract domain and you owe only the per-opcode arms; the whole-program guarantee falls out for free, with zero edits to the framework.

This is why there's no runtime gate. An earlier version of this lineage sampled abstract-vs-concrete agreement at runtime and trusted the sample. That's gone. Now an unsound transfer function simply fails to produce the SoundEvm witness, and the build goes red. A half-built or dishonest analysis cannot certify itself, because certifying itself means compiling a proof, and the proof doesn't typecheck. The Lean kernel is the only thing you trust.

Two machines, on purpose

There are deliberately two concrete interpreters. aluEvm is gas-free — the soundness reference — so that theorems about what a program computes never get tangled with what it costs. aluEvmMetered adds the static opcode-cost frontier and out-of-gas halting, and is proved to refine the gas-free machine: metered execution does exactly what the reference does, or it halts out-of-gas earlier — never something different. Gas is an enforcement layer, not part of meaning, and the architecture keeps it that way.

The honest floor

A verification project earns its trust by being loud about what it doesn't prove. keccak is kept an uninterpreted function — precisely so the soundness proofs never depend on a hash implementation — which means there's no byte-exact state-root hashing, and fixture comparisons go account-by-account, field-by-field instead. ECDSA sender recovery isn't modelled; fixtures use their explicit sender. Contract-creation transactions are reported skipped. Every one of these is a ledgered deferral: it shows up as a visible SKIP or a documented bound, and the harness never prints PASS for something it didn't actually check. That discipline is exactly what makes the passes mean anything.

This is the fourth in a small series, and it's where the thread lands. One behavior, many views asks in what sense a pile of analysis tools are all talking about the same program; one substrate, many instruments turns that into a build recipe; prove more, run less follows the soundness proof until it becomes a certified optimizing compiler. This post is the substrate made concrete: an actual EVM you can hand a byte string, that runs it, and whose every claim about that run is a theorem rather than a test result. The flashcards below are the TL;DR.