Proofs meet performance — verified & proof-driven EVM optimization, and where we fit

01 · The landscape

One picture: trust on one axis, speed on the other

Every approach to making the EVM faster makes a trade between how much you trust the result and how much speed you buy. Plotting them this way is clarifying — and it shows exactly the empty corner our work targets. Click a point.

← unverified machine-checked → slower ··· faster

Compiling by Proving

Verified block-opt (Coq)

SuperStack

revmc (JIT/AOT)

◆ us (refinement)

Zhao, Hildenbrandt, Conejero, Zhao · arXiv 2509.21793 · Sept 2025

Compiling by Proving

The proof is the compiler. Construct a correctness proof by symbolic execution, then compile the proof's graph into fast rewrite rules. Top-right: very fast and correct-by-construction — but only for code you've paid to prove. Everett Hildenbrandt is a KEVM original author — this is built on the same K/matching-logic foundation as our reference oracle.

Positions are qualitative. The diagonal from bottom-left (fast, untrusted) to top-right (proven, fast) is the field's frontier; the bottom-right corner — fast native execution with a machine-checked certificate — is largely empty, and is where a refinement-based checker would sit.

02 · The headline result

Compiling by Proving — when the proof is the optimization

The most striking recent paper, and the one closest to us. The traditional pipeline is "write a compiler, then (maybe) verify it." This inverts it: the act of proving a program correct produces the optimized code as a by-product, so optimizer and proof can never disagree — they're the same artifact.

The mechanism, intuitively

A K semantics is modular: one EVM opcode is many small rules. The paper builds an All-Path Reachability Proof (APRP) — a directed graph whose vertices are execution states and whose edges are steps (sequential), covers (subsumption), and branches (conditionals) — then compiles that graph back into code by step compression: adjacent edges v_A →ᴹ v_B →ᴺ v_C collapse into one v_A →ᴹ⁺ᴺ v_C, erasing all the intermediate pattern-matching.

Their worked example is ADD. Executed modularly it fires five rules; compiled, those five become one:

1. validate stack depth

2. charge gas (G_verylow)

3. pop a, b · push a+b mod 2²⁵⁶

4. stack-pointer bookkeeping

5. advance program counter

⟶

one compiled
ADD rule
all 5 effects, atomically, no intermediate matching

Why it's correct by construction: the compiled rule is extracted from a proof that the five-rule sequence has exactly that effect. Compiled rules are installed at higher priority than the original modular rules, which stay as a trusted fallback — so the formal semantics is never abandoned, only short-circuited where a proof exists. This is the "one definition, many tools" K philosophy with the interpreter↔model gap closed: the fast interpreter and the formal model are derived from the same proof.

What it actually buys (their numbers)

2.19×

geo-mean speedup, EVM concrete execution (2,756 tests; median 2.00×)

5.29×

top-decile EVM opcode speedup

1.44×

symbolic execution, 70 proof tasks

522×

whole-program Rust/MIR loop (2438s → 4.7s)

The scaling is the real message: opcode-granularity compilation gives consistent modest speedups; compiling whole programs by proof gives orders of magnitude — because the bigger the proven region, the more intermediate steps get compressed away.

Why this matters to us specifically

It lives on the same K / matching-logic substrate as KEVM — our reference oracle — and is co-authored by a KEVM originator. We already have a K→Lean importer (task K0). Their thesis ("proofs and fast code are the same object") is the performance-side mirror of ours ("proofs and sound analysis are the same object"). The shared foundation means their proof artifacts are, in principle, in our universe.

03 · Verified optimization

Translation validation — let an unverified tool optimize, then check it

A different, very pragmatic philosophy from the Albert & Rubio group (Complutense Madrid et al., CAV 2023; extended 2025). Don't verify the optimizer — verify each optimization it produces. This is exactly the "untrusted search, verified certificate" split, and it's the strand most directly reusable by us.

① Optimizer (untrusted)

GASOL / SuperStack / solc peephole rewrites a jump-free block

→

② Coq checker

symbolically execute both blocks → symbolic states → simplify → compare

→

③ Verdict

✓ equivalent (accept) · ✗ not equivalent (reject + bug report)

The Coq framework has three pieces: a symbolic execution engine that runs a block to a symbolic state; a set of simplification lemmas that normalize symbolic states; and a checker that compares the two normalized states. If the optimized block's symbolic state matches the original's, the optimization is certified semantically equivalent — over all inputs, not sampled.

The payoff that proves the point: applied as a checker against real optimizers — the standard Solidity compiler's peephole pass, and the superoptimizers GASOL and SuperStack — it found genuine bugs, which were reported and fixed. That's the whole argument for the pattern: the clever optimizer can be wrong, and a cheap verified checker catches it without anyone having to verify the optimizer's internals.

Limitation worth noting: it's confined to jump-free blocks — straight-line sequences, no control flow. Whole-program / cross-loop optimization is out of scope, which is exactly where an invariant-aware analysis (ours) could extend it.

04 · The search engines

SuperStack & friends — squeezing gas out of stack bytecode

These are the "untrusted search" half of the pattern above. They don't prove anything; they search for the cheapest equivalent instruction sequence and lean on a separate checker (or testing) for safety.

SuperStack (PLDI 2024)

Finds optimal replacements for stack-bytecode blocks (EVM and WebAssembly). Three ideas: a greedy pass to bound the optimal length, a MaxSAT encoding of the optimization as weighted soft clauses, and domain-specific dominance/redundancy constraints to prune the search. Evaluated on 500k+ sequences; ~4× faster than prior superoptimizers with larger gas savings.

Albert, García de la Banda, Hernández-Cerezo, Ignatiev, Rubio, Stuckey.

Max-SMT superoptimizer (GASOL line)

Earlier work in the same lineage, handling memory and storage (not just the stack) via a Max-SMT formulation — important because real gas costs are dominated by SLOAD/SSTORE/MLOAD/MSTORE, which pure stack reasoning ignores.

The optimizer whose output the Coq checker (§03) validates.

Key framing: these tools are where the cleverness lives — and precisely the part you do not want to formally verify, because the search heuristics churn constantly. The pattern says: let them be brilliant and untrusted; verify the artifact, not the engine.

05 · The fast, untrusted corner

revmc & native JIT — speed bought with a sandbox, not a proof

At the opposite extreme from "compiling by proving": compile EVM bytecode straight to native machine code for raw throughput, and get safety from isolation rather than correctness proof.

revmc (Paradigm, 2024)

JIT/AOT-compiles EVM bytecode to native code via the LLVM backend, eliminating interpreter dispatch overhead. ~6.9× on compute-heavy workloads; integrated into reth behind --experimental.compiler; validated by syncing a node and checking the state root.

The trust model

The native code is generated from untrusted bytecode, so safety comes from a sandbox: the compiled code can't touch host memory and can only call a narrow, validated ABI. There is no proof that the native code matches the bytecode's semantics — correctness rests on the compiler being right and the state-root check at the tip.

The gap this leaves: a JIT this fast has no machine-checked equivalence to the EVM semantics — a miscompile would be a consensus bug. That missing certificate is exactly the bottom-right corner of the landscape map, and exactly what a refinement-based checker could supply.

06 · Where we fit

Our refinement machinery is a translation validator

The connecting insight: we built bN_refines / kevm_refines to prove our concrete interpreter ≈ KEVM. That is structurally identical to proving optimized block ≈ original block. The same observable-equivalence relation that powers Pillar B is a ready-made verified optimizer checker — and we'd be checking against a KEVM-faithful semantics, not a hand-rolled one.

field: translation validation (Coq, §03)

ours: ObsRel / obsOf / bN_refines

Our per-bucket observable-equivalence relation already relates two step semantics over a shared Config. Repoint it from "ours vs KEVM" to "original vs optimized" and it certifies block rewrites.

have the machinery

field: untrusted search + verified check

ours: tier-C certificate pattern

The exact split our abstract-interpreter doc names as tier-C. SuperStack/GASOL = the untrusted search; a Lean refinement proof = the certificate. Lean kernel is the only trusted base.

same philosophy

field: jump-free blocks only (§03 limit)

ours: whole-program invariants (Pillar A)

Our analyzer computes reachable-state invariants. A proven invariant licenses optimizations a block-local checker can't justify — e.g. "operands < 2¹²⁸ here ⇒ drop the wrap-check." Invariant-aware verified optimization.

extends the field

field: revmc native JIT, no proof (§05)

ours: proof-carrying compilation

Emit a Lean-checkable certificate that JIT output matches bytecode semantics — filling revmc's empty corner. The compiled artifact ships with its own proof.

research lane

field: Compiling by Proving (K, §02)

ours: K→Lean importer (K0)

Same K/matching-logic substrate as KEVM and our oracle. Their proof-derived rules and our Lean refinements share a source of truth; the importer is the bridge.

shared foundation

field: gas as the optimization target

ours: gas/refund refinement (D11)

Verified gas optimization needs a verified gas model. Our D11 gas refinement is exactly the cost semantics a certified gas-optimizer checker would measure against.

on the roadmap

The honest scope note

Our charter is machine-checked soundness of analysis, off the product critical path — not shipping a faster client. So this literature is relevant as a consumer of our certificates and a validation of our architecture, not a pivot. The strongest concrete play is positioning the refinement machinery as a verified checker for unverified EVM optimizers — a natural extension of the B-track that turns "we proved our interpreter = KEVM" into "we can prove anyone's optimization preserves KEVM semantics."

07 · A concrete design sketch

Refinement-as-optimizer-checker, worked end to end

Here's how the B-track relation becomes a block-equivalence checker, with a real peephole rewrite and the negative test that makes it trustworthy. This reuses the exact shape of RefB7.lean — just block-to-block instead of step-to-KEVM.

Run both blocks from the same symbolic Config via our KEVM-faithful step, read off the observable (final stack / memory / storage / gas), and demand they agree. The relation is the one we already wrote for Pillar B.

-- the checker is just our observable-equivalence relation, block-to-block
def blockObs (ops : List Op) (c : Config) : Option (Obs Word) :=
  obsOf (EvmAlgebra.run aluConcrete (ops.length) (c.withCode ops))

def Optimizes (orig opt : List Op) : Prop :=
  ∀ c : Config, ObsRel (blockObs orig c) (blockObs opt c)   -- equal over ALL configs
                                                       -- (a gas-aware Obs adds: gas(opt) ≤ gas(orig))

Prove Optimizes orig opt axiom-clean and you have a machine-checked certificate that the rewrite preserves KEVM semantics — the same trust level as every bN_refines.

The classic peephole win: adding zero is a no-op. x PUSH1 0 ADD ≡ x. The optimizer drops two instructions and the G_verylow gas they'd cost.

original

...        ; stack top = x
PUSH1 0x00 ; stack: x, 0
ADD        ; stack: x+0 = x

optimized

...        ; stack top = x
(removed)
(removed)

theorem add_zero_optimizes : Optimizes [.push 0, .add] [] := by
  -- both leave the stack as `x :: rest`; ADD's denotation is `x + 0 = x` in Word
  intro c; simp [blockObs, EvmAlgebra.run, aluConcrete, Word.add_zero, ObsRel]
#print axioms add_zero_optimizes   -- ⊆ {propext, Classical.choice, Quot.sound} ⇒ certified

Accept. The proof goes through because Word.add_zero is a theorem. The optimizer's rewrite is now machine-checked equivalent over every input — exactly what the Coq checker does, but against our KEVM-faithful semantics.

Now a plausible-looking but wrong rewrite — the kind real superoptimizers have shipped. Someone "simplifies" SWAP1 SWAP1 (a true no-op) to nothing... but mistakenly applies the same rule to SWAP1 SWAP2, which is not the identity. The checker must reject it.

original — stack: a, b, c

SWAP1  ; b, a, c
SWAP2  ; c, a, b   (NOT a permutation-identity)

optimized (BUGGY)

(removed — wrongly treated as no-op)
(removed)          ; claims: a, b, c

-- the kill-#4-style negative test: the bad rewrite must FAIL to certify
theorem swap12_not_optimizable_to_nil :
    ¬ Optimizes [.swap 1, .swap 2] [] := by
  -- exhibit one config where observables differ: stack [a,b,c] with a≠c
  intro h
  have hbad := h (Config.ofStack [1, 2, 3])
  simp [blockObs, EvmAlgebra.run, aluConcrete, ObsRel] at hbad   ; [3,2,1] ≠ [1,2,3] ⇒ contradiction

Reject. The certificate fails to prove — lake build goes red — precisely because the rewrite changes the observable on [a,b,c]. This is the same red-on-unsound discipline as every bucket's negative test: a wrong optimization cannot certify itself.

What this would take to build: the relation and run already exist; the work is (1) a block-to-block Obs with a gas component (rides on the D11 gas refinement), (2) an ingest path from an optimizer's (orig, opt) pairs through our decoder, and (3) for non-trivial blocks, a symbolic Config + the simplification lemmas the Coq group built — the genuinely new proof engineering. The result: a verified checker that accepts SuperStack/GASOL/solc output (or rejects it with a counterexample), backed by the Lean kernel and KEVM-faithful semantics.