From analysis to a certified compiler — the more you prove, the less you run

01 · The thesis, made tangible

Optimization is runtime work you proved you don't have to do

Here is one EVM basic block and the runtime work an interpreter does to execute it. Each proven fact below is something an abstract interpreter (or a refinement proof) in our system could establish. Toggle the facts and watch runtime work get struck off — the gas/dispatch you no longer have to pay because a proof already settled the outcome.

Runtime work for one block (push 0; calldataload; push 0x20; add; iszero; jumpi → dead branch)

Each line is work the bare interpreter pays on every execution. Cost in arbitrary "interpreter units".

Proven facts (click to apply)

Each fact is a machine-checked theorem our framework can produce.

100of 100 units · 0% proved away

No facts proved — the interpreter does all the work, every time.

The principle: soundness lets the analysis run; precision is what deletes work. A fact only pays when it collapses a runtime choice into a static certainty — a singleton value, an unreachable branch, a no-overflow guarantee, a statically-known gas total. "Prove more" must mean "prove more of those."

02 · Why it's almost "compiling by proving"

Two different proofs, one enterprise

"Is abstract interpretation the same as compiling by proving?" Almost — they're the two halves. One produces facts; the other turns facts into faster code while preserving meaning.

produces facts

Abstract interpretation

Proves a property: "reachable states here ⊆ γ(a)". A one-sided over-approximation. It constrains traces but does not, by itself, change the program. This is run_sound / analyze_sound in our framework.

→

preserves meaning

Partial evaluation / compiling-by-proving

Proves an equivalence: "this fast residual = running the interpreter on the program". Specializes the interpreter to the program (Futamura's 1st projection). Needs no input facts at all.

The hinge: a precise-enough over-approximation becomes an equivalence

If the interval domain proves the top of stack is the singleton {0}, the abstraction has collapsed to a point — and now folding [push k, add] → [push k] isn't merely licensed, it's forced. The invariant got strong enough to specialize the code. That is where the two halves meet: AI hands you the fact, partial-evaluation-style rewriting cashes it in, and the rewrite itself must be proven semantics-preserving.

Futamura (1971) showed specialize(interp, program) = a compiled program; specialize(specialize, interp) = a compiler. GraalVM/Truffle is the production realization. Abstract interpretation supplies the static input knowledge that makes specialization residualize less code — so the two stack: compress the interpreter to the program, then specialize further using proven invariants.

03 · The whole optimization catalogue, re-read

Every optimization is "a proof → deleted runtime work"

This isn't a metaphor. Walk the textbook list and each entry is literally a fact an analysis proves plus the runtime work that fact lets you remove. Click one.

Select an optimization above.

Read the middle column as our domains: "index ∈ [0,len)" and "no overflow" are interval results; "guard always false" is a sign/interval result; "value is constant" is a singleton in any domain. The optimizations our analyses can license are exactly the facts our domains can prove — which is why adding a sharper domain is the same act as unlocking a new class of optimization.

04 · The field already lives here

Three disciplines for trusting an optimization

Decades of work attack the same problem — "how do I know the fast code still means the same thing?" — in three distinct ways. They differ in what you trust. Filter by discipline.

The keystone result for us: Lesbre & Lemerre, "Compiling with Abstract Interpretation" (PLDI 2024)

They prove the equivalence outright: an abstract interpreter, structured as a free algebra over the standard abstract-domain signature, is a compiler — functor domains become compiler passes, and soundness of the analysis translates directly to a forward-simulation proof of the compiler (completeness ↔ backward simulation). This is the precise, mechanizable form of the intuition driving this whole doc: the analysis and the compiler are the same artifact viewed twice.

05 · The architecture that wins

Untrusted search, trusted certificate

Across every strand — SuperStack's MaxSAT search, Alive2's SMT checker, Compiling-by-Proving's rule consolidation, CompCert's validated register allocator — the field converged on one shape. Let an aggressive, unverified optimizer propose; prove the result correct after the fact. It decouples optimization power from optimization trust.

① Analyze

Our abstract interpreter runs on the bytecode; run_sound yields certified facts per PC.

→

② Propose (untrusted)

Any optimizer — superoptimizer, revmc lowering, an LLM — emits a candidate fragment. Zero trust required.

→

③ Certify (Lean)

Prove Optimizes orig opt using the facts. This is our refinement relation, repointed.

→

④ Emit or reject

Proof checks → ship code + kernel-checkable certificate. Fails → fall back to the original.

The thing we already built: every optimization is a refinement — opt ⊑ orig observationally. That is exactly the shape of bN_refines / ObsRel, which today proves "concrete interpreter refines KEVM." Repoint the same relation at "optimized refines original" and you have an optimizing-compiler certifier with no new metatheory — the kernel that checks soundness also checks the optimization.

06 · The sharpest single instance

One block, four proofs, native-speed output

Take the block from §01. The bare interpreter dispatches each opcode, meters gas per op, does a full 256-bit modular reduction on the ADD, and tests the JUMPI branch at runtime. Four certified facts collapse all of that.

Interpreted — every execution

; per-op dispatch + gas meter + checks
PUSH1 0x00      ; dispatch, +3 gas
CALLDATALOAD    ; dispatch, +3 gas
PUSH1 0x20      ; dispatch, +3 gas
ADD             ; dispatch, +3 gas,
                ;   256-bit modular reduce
ISZERO          ; dispatch, +3 gas
PUSH2 .dest     ; dispatch, +3 gas
JUMPI           ; dispatch, +10 gas,
                ;   runtime branch test
.dest:          ; dead target — never taken
  ...revert...  ; compiled & kept anyway

Compiled — proofs cashed in

; one straight-line native block
debit 28 gas      ; D11: block gas is
                ;   statically known → 1 debit
mov   r0, [calldata+0]
add   r0, 0x20    ; interval: no overflow →
                ;   native add, no mod-reduce
; ISZERO+JUMPI: interval proves
;   guard always false → branch
;   & .dest block DELETED (D12 CFG)
fallthrough     ; no dispatch, no per-op meter

The four facts

No-overflow (Interval over Word): the add never wraps → lower to a native machine add, drop the modular reduction.
Guard always false (Sign/Interval): ISZERO result is statically 0 → JUMPI never taken → delete the branch test and the dead target block.
Static gas (D11 gas refinement): block gas is a constant → replace 8 per-op meter updates with one upfront debit.
Straight-line CFG (D12 CFG refinement): the block has a single successor → emit one native basic block, no per-op dispatch.

Why EVM is the ideal target

Proving is expensive and only pays when amortized. EVM bytecode is immutable (deployed once), executed billions of times, and every op is gas-metered — so you pay the proof cost once at deploy/compile and win on every execution forever. For an AOT contract compiler the amortization is overwhelming; this is exactly why the recent EVM-perf work clusters here.

07 · From here to a certified compiler

What we'd build, mapped to the backlog

We already have the concrete semantics, certified analyses, and a refinement relation. Three things turn that into a certifying optimizer — and two of them are already on the board.

have · certified analyses

need · run them on real bytecode

Today the abstract CFG fixpoint runs over toy WHILE/ℤ. The driver that runs it over real Config is the prerequisite for producing facts on actual contracts.

#112 D-uplift-2

have · ObsRel / bN_refines

need · Optimize/Optimizes.lean

Define Optimizes orig opt := ∀ c, ObsRel (run orig c) (run opt c) and a growing library of fact-conditioned rewrite lemmas (fold, dead-branch, no-wrap lowering, LICM).

new layer · no new metatheory

have · gas/CFG refinement designs

need · the licensing facts

D11 (static gas) and D12 (CFG) are exactly the facts that license the per-op-meter and dispatch deletions in §06. Proving them for the buckets feeds the optimizer directly.

#109 D11 · #110 D12

have · nothing yet

need · untrusted optimizer driver

Wrap an external optimizer (superoptimizer, revmc LLVM lowering) so its proposals are checked against Optimizes. Its internals stay outside the TCB.

research

have · kernel-checked proofs

need · proof-carrying output

Emit optimized bytecode/native plus the Lean certificate. A consumer re-checks the small Optimizes proof — never trusts our search.

research

have · KEVM refinement (B-track)

need · same relation, new target

The B-track already proves "concrete refines KEVM". The optimizer certifier is the identical relation aimed at "optimized refines concrete" — the metatheory transfers wholesale.

structural reuse

Sketch of the new layer (the relation already type-checks in spirit — it's ObsRel quantified over inputs):

def Optimizes (orig opt : List Op) : Prop :=
  ∀ c : Config, ObsRel (run orig c) (run opt c)

-- a good rewrite: provable, ships
theorem add_zero_optimizes : Optimizes [.push 0, .add] [] :=
  -- discharged by Word.add_zero — observably identical
  by intro c; simp [run, Word.add_zero, ObsRel]

-- a bad rewrite: the proof simply fails (kill #4, negative test)
theorem swap_not_nil : ¬ Optimizes [.swap 1] [] := -- swaps are observable

08 · The honest floor

Where "prove more, run less" stops

The principle is real but not unbounded. Four limits keep it honest.

L1Precision, not quantity, is the currency▸

You can always prove a sound but useless invariant (⊤ everywhere) — it licenses nothing. Optimization power is bounded by the precision of the analysis, which is the undecidable, expensive part. Soundness is the cheap half; the hard, open-ended research is in domains precise enough to yield singletons, unreachability, and no-wrap on real code.

L2There is an input-dependence floor▸

Work that genuinely depends on inputs — a CALLDATALOAD-driven branch, an external call's return — can be bounded by analysis but never eliminated. The residual after specialization is irreducible: it's the program's intrinsic uncertainty. "Less at runtime" asymptotes to the real information the program needs from its inputs.

L3Proof cost must be amortized▸

Proving can be slow or undecidable; it only pays when amortized over many executions. AOT contract compilation is the perfect case (prove once, run billions). A JIT is harder — the proof must be cheaper than the execution it saves. Our domain (immutable, gas-metered, hot bytecode) sits squarely in the favorable regime.

L4The trusted base is a real design choice▸

revmc lowers to LLVM and trusts LLVM + a sandbox — fast to build, large TCB. Our framing puts the kernel in the TCB instead: the optimization certificate is checked by the same minimal kernel that checks soundness, so no new trust is added. The cost is that every optimization must produce a kernel-checkable proof — which is exactly why the "untrusted search + verified certificate" split (§05) matters: it keeps the clever part outside the proof.

The one-sentence version: abstract interpretation is the fact-production half, compiling-by-proving is the fact-consumption half, and our ObsRel refinement is already the relation that certifies the consumption — so "the more we prove, the less we run" is, for us, less a slogan than a build plan: #112 to produce facts, an Optimize/ layer to consume them, and the existing kernel to keep it all honest.