veric-k-evm-pilot · executing bytecode

Running EVM bytecode inside a proof assistant

This repo doesn't just describe the EVM — it executes real bytecode through a Lean 4 interpreter whose soundness is a machine-checked theorem. The decoder, the state machine, the gas meter, and the conformance harness are all here, and the same evaluator that runs your bytecode is the one the proofs reason about.

✓ executes today Lean 4 + Mathlib ~146 post-Cancun opcodes run_sound theorem EEST fixture-conformant

01 What the EVM actually is

The Ethereum Virtual Machine is a stack-based virtual machine. A transaction hands it a contract's bytecode (a flat byte string) plus calldata, and execution is one small-step loop repeated until the program halts.

1 Decode
Read the opcode byte at the program counter pc. Most opcodes are one byte. The PUSH1…PUSH32 family carry 1–32 immediate bytes inline that get loaded onto the stack — so they're 2–33 bytes wide.
2 Dispatch & compute
Each opcode pops its operands off the 256-bit word stack (max depth 1024), computes, and pushes results. ADD pops two, pushes one; SLOAD pops a key, pushes the stored value; and so on.
3 Charge gas
Every opcode costs gas. Memory expansion and storage writes cost more. If gas would go negative, the frame halts out-of-gas and reverts. Gas is the EVM's halting guarantee and its DoS price.
4 Advance pc
Move pc forward by 1 (or 1 + immediate width for PUSH). JUMP/JUMPI write pc directly — but only to a byte tagged JUMPDEST, which is what keeps control flow honest.
5 Halt?
Loop until a halting opcode: STOP / RETURN (success, maybe return data), REVERT (undo state, return data), INVALID, or SELFDESTRUCT. Otherwise go back to step 1.

The state it touches

Volatile (per frame)

Stack — 256-bit words, depth ≤ 1024.

Memory — a byte array, zero-extended, grows on access.

Persistent

Storage — the SLOAD/SSTORE key→value map per account.

Transient — EIP-1153, cleared each tx.

World

Every account's balance, nonce, code, storage. CALL/CREATE push a new frame with its own stack, memory and gas budget.

Mental model: the EVM is a pure function from (world state, transaction) to (new world state, return data), computed by iterating a single small-step transition. That "single transition iterated" shape is exactly what makes it amenable to a one-evaluator-proved-once design.

02 Can we actually run bytecode? Yes.

The repo's CLAUDE.md still describes an "ALU-only" demonstrator — that's stale. What's actually here is a real, machine-checked EVM interpreter with a CLI to drive it.

kevm_pilot — run-bytecode
$ lake exe kevm_pilot run-bytecode 600560030100
stack: [8]
returnData: 0x
gas: 0

# PUSH1 5 ; PUSH1 3 ; ADD ; STOP → 5 + 3 = 8

$ lake exe kevm_pilot run-bytecode 600560030100 --gas 100
# same program, now under the gas-metered machine (OOG enforced)

$ lake exe kevm_pilot run-bytecode <hex> --calldata <hex> --json
{ "stack": [...], "output": "0x...", "halted": true, "status": "..." }

$ lake exe kevm_pilot run-fixture tests/state_test.json
PASS add_test :: account-by-account post-state matched

Two execution subcommands live in Main.lean:

Verified live in this session: feeding 6005600301… to the decoder produced the exact instruction stream push 1 0x05 · push 1 0x07 · add · … with a correct JUMPDEST table, and the execution machine (Evm.Step, Evm.Metered, Evm.Decode, Evm.Config) builds green. (The only red on the current WIP branch is an unrelated half-finished ABI proof in Evm/Abi.lean — not the executor.)

03 Step through a program

Here's the exact program above — 60 05 60 03 01 00 — executed one transition at a time. Watch pc walk the code and the stack grow and collapse. This is the loop from §1, and it's this loop that EvmAlgebra.run implements in Lean.

step 0 / 4
Stack (top = green)
empty
pc = 0gas used = 0running
Press Step to decode and execute the opcode under pc.

Note: gas figures here are illustrative (3 per PUSH/ADD) to show the meter moving — the real schedule lives in Evm/Schedule.lean and the metered machine Evm.Metered.aluEvmMetered.

04 How it works in this repo

Execution is a pipeline of four Lean modules. The shape mirrors the EVM yellow paper and KEVM cell-for-cell, but everything is a total Lean function so the kernel can reason about it.

Evm/Decode.lean — the full post-Cancun decoder

Decodes the entire ≈146-opcode set. Evm/Op.lean carries 72+ distinct opcodes plus the families collapsed into Fin-carrying constructors:

inductive Op
  | add | mul | sub | div | sdiv | mod | … | keccak256
  | address | balance | caller | calldataload | … | basefee | blobhash
  | mload | mstore | sload | sstore | tload | tstore | mcopy
  | jump | jumpi | pc | gas | jumpdest
  | push (n : Fin 33) (imm : Word)  -- PUSH0..PUSH32  0x5f–0x7f
  | dup  (n : Fin 16)              -- DUP1..DUP16
  | swap (n : Fin 16)              -- SWAP1..SWAP16
  | log  (n : Fin 5)               -- LOG0..LOG4
  | stop | return | revert | invalid | selfdestruct
  | create | call | callcode | delegatecall | create2 | staticcall

It handles PUSH-immediate consumption, truncated-PUSH-at-end-of-stream (zero-padding), and the JUMPDEST scan that marks valid jump targets.

Trust posture: the decoder is upstream of the soundness TCB. Its correctness is discharged by round-trip theorems (decode ∘ encode = id), all proved axiom-clean — not by trusting the code.

Evm/Config.lean — the machine state

One structure holding every cell step reads or writes, mapped to KEVM's <callState> / <block> / <network>:

structure Machine where
  stack  : List Word   -- head = top, soft bound 1024
  memory : MemSpec      -- total byte fn, default 0 outside touched region
  pc     : ℕ
  gas    : ℕ            -- OOG = would-go-negative

structure Account where
  balance, nonce, code
  storage          : WordWord   -- spec rep, default 0
  origStorage      : WordWord   -- pre-tx snapshot (refunds)
  transientStorage : WordWord   -- EIP-1153

Big maps (memory, storage, world) use the spec representation: a total function with a default, so there are no partiality side-conditions to carry through proofs. Word = BitVec 256.

Evm/Step.lean — the small-step dispatcher

One function decodes the opcode at pc once and dispatches to a per-bucket handler. The dispatch list is frozen; opcode buckets Steps/B1..B11 + Call own disjoint branches:

def stepConcrete : ConfigStepResult Config
-- decode env.code at machine.pc, dispatch decoded Op to Steps.Bn.step

inductive StepResult (σ)        -- the halting taxonomy
  | cont   (s : σ)                  -- keep stepping
  | halt   (s : σ) (out : ByteArray) -- STOP / RETURN
  | revert (s : σ) (out : ByteArray)
  | stuck                           -- underflow / INVALID / fuel-out

The decoder runs in exactly one place; buckets never re-decode, never edit the dispatch list, never add a constructor. That discipline is what keeps the proof obligations local.

Framework/Machine.lean — the one evaluator

The whole-program driver is a fuel-indexed iteration of step. Fuel is what makes it a total function even though real EVM programs loop:

structure EvmAlgebra (σ) where
  step : σ → StepResult σ

-- THE ONE EVALUATOR: iterate step until terminal or fuel runs out
def EvmAlgebra.run (M : EvmAlgebra σ) : Nat → σ → StepResult σ
  | 0,        s => .stuck
  | fuel + 1, s => match M.step s with
      | .cont s' => M.run fuel s'
      | r        => r

The concrete EVM is just aluEvm : EvmAlgebra Config. Run it with aluEvm.run 10000 c0. The very same run is instantiated by the abstract interpreters too — which is the whole trick (§06).

05 Two machines, one shape

There are deliberately two concrete interpreters — and keeping them apart is a design decision, not an accident.

aluEvm — gas-free reference

The soundness reference. No gas accounting at all, so the proofs about what a program computes never get tangled up with what it costs. This is what run-bytecode uses by default.

aluEvmMetered — gas-enforced

A separate machine that adds the static opcode-cost frontier and out-of-gas halting. Engaged by --gas N. It refines the reference: metered execution does exactly what the gas-free machine does, or it halts OOG earlier — never something different.

Why split them? Gas is an enforcement layer, not part of the meaning of a program. By proving the gas-free machine correct and then proving the metered machine refines it, you get a clean theorem about computation and a clean theorem about resource bounds, without either contaminating the other.

06 Why this is a big deal

Plenty of EVMs execute bytecode — geth, revm, evmone. What's rare is an EVM where the interpreter and its correctness proof are the same artifact, checked by a kernel you can't lie to.

One evaluator, two readings, proved once

The killer move: EvmAlgebra.run is parametric. Plug in Config and you get the concrete EVM. Plug in an abstract domain (Sign, Interval, …) and you get a static analyzer — the exact same run. Soundness is then a single theorem:

-- ★ THE FUNDAMENTAL LEMMA — proved ONCE, by induction on fuel
theorem run_sound (h : SoundEvm MC MA γ) :
    ∀ (fuel : Nat) (sA) (sC), sC ∈ γ sA →
      StepResult.Refines γ (MC.run fuel sC) (MA.run fuel sA)

Read it plainly: for every program, every input, and every step count, the concrete run is refined by the abstract run — provided each opcode's per-step transfer is sound (the SoundEvm witness). Add a new abstract domain and you only prove the per-opcode arms; whole-program soundness is free, with zero edits to the framework.

The kernel is the only trusted base. There's no runtime "congruence gate" that samples agreement and hopes. An unsound transfer function simply fails to produce the SoundEvm witness, and lake build goes red. A half-built or dishonest reading can't certify itself — it fails to compile a proof.

Conformance in three honest tiers

T1 — kernel theorems
The real trust base
Machine-checked Lean theorems (run_sound, per-domain soundness, decoder round-trips). Axiom-clean: only propext, Classical.choice, Quot.sound. No sorry.
T2 — spec vectors
KEVM & EIP discharges
Refinement against the KEVM reference and EIP test vectors — the machine agrees with the canonical spec on concrete cases.
T3 — corpus floor
EEST differential
The ethereum/tests state-test corpus, diffed account-by-account. A floor, additive to T1/T2 — never a sole soundness claim.

What you get that a normal EVM can't give you

  • Proven static analyses. A Sign or Interval analysis isn't "tested against the interpreter" — it's proved to over-approximate it for all programs and inputs.
  • No drift between analyzer and semantics. They literally share run. The concrete and abstract interpreters can't disagree about what an opcode means, because there's one definition.
  • A verified Yul→bytecode compiler (Evm/Yul.lean, compileStmt_correct & friends) — compiling then executing agrees with evaluating the source, control flow included, proved.
  • Self-auditing. kevm_pilot proof-status --check walks every soundness theorem and collects its axioms at runtime, asserting kernel-cleanliness. The proof of trustworthiness ships with the binary.

07 Honest limits — the ledgered deferrals

A verification project earns trust by being explicit about what it does not prove. These are tracked non-executions, never faked passes.

GapWhat it means
keccak256 is opaqueKept an uninterpreted function so soundness proofs never depend on a hash implementation. Consequence: no byte-exact state-root hashing — fixture checks compare accounts field-by-field instead.
ECDSA sender recoveryNot modelled; fixtures use their explicit sender field rather than recovering it from the signature.
CREATE-as-transactionA fixture whose to is empty (contract creation) is reported skipped, not executed.
d/g/v vector expansionThe EEST data/gas/value index matrix is reduced to the index-0 entry by the fixture decoder.

The discipline: every gap is a visible SKIP or a documented bound — the harness never prints PASS for something it didn't actually check. That honesty is what makes the PASSes worth anything.

08 Try it yourself

# first time: grab the prebuilt Mathlib cache
lake exe cache get

# build the framework, EVM frontend + domains, and the CLI
lake build Framework Evm kevm_pilot

# run some bytecode  (PUSH1 5; PUSH1 3; ADD; STOP)
lake exe kevm_pilot run-bytecode 600560030100

# run it metered, with calldata, as JSON
lake exe kevm_pilot run-bytecode <hex> --gas 21000 --calldata <hex> --json

# run an ethereum/tests state-test fixture
lake exe kevm_pilot run-fixture path/to/fixture.json

# audit the soundness theorems' axiom sets at runtime
lake exe kevm_pilot proof-status --check

Where the pieces live

ModuleHolds
Framework/Machine.leanEvmAlgebra, the one fuel-driven run, StepResult, and run_sound — the keystone.
Evm/Op.lean · Evm/Decode.leanThe ≈146-opcode vocabulary and the round-trip-proved bytecode decoder.
Evm/Config.leanThe machine state: stack, memory, gas, world, exec env, block context.
Evm/Step.lean · Evm/Steps/B1..B11The small-step dispatcher and per-bucket opcode handlers.
Evm/Metered.lean · Evm/Schedule.leanThe gas-enforcing machine and the hardfork gas schedule.
Evm/Fixture.lean · scripts/eest_consume.pyThe EEST state-test consumer and pass-rate harness.
Main.leanThe kevm_pilot CLI: run-bytecode, run-fixture, proof-status.