Compiling Down to Physics: The Tower of Interpreters, All the Way Down
A while ago I argued that a compiler optimization is just runtime work you proved you don't have to do — that a precise-enough abstract interpretation hands you the proofs that license deleting it, and so "the more you prove, the less you run." That story stopped, politely, at native machine code. This is the piece where I refuse to stop there and ask the obvious question: why stop at the instruction set?
The answer turns out to reach all the way down to thermodynamics.
The interpretation tax, all the way down
A general-purpose CPU is the most expensive interpreter ever built — expensive because it is general. It interprets a stream of opcodes it has never seen before, so on every single instruction it must fetch, decode, check dependencies, rename registers, predict branches, and route operands through a datapath wide enough for any program. None of that work computes your answer. It is the cost of being ready to run any program — a tax you pay, billions of times, for flexibility you used once.
And it is not one tax but a stack of them. Your program is interpreted by the ISA; the ISA is interpreted by the microarchitecture; the microarchitecture is a pattern of gates; the gates are an abstraction over transistors switching; the transistors are an abstraction over charge moving through silicon. Every layer in that tower is one interpreter standing on another, and each adds its own overhead and its own act of forgetting what the layer above actually meant.
Once you see the tower, the thesis rewrites itself in one line: specialization is collapsing levels of the tower. Each level you collapse — by proving enough about the program to bake it into the layer below — is interpretation overhead deleted permanently. Compiling to native code collapses the top two rungs. Compiling to a circuit collapses most of them. Compiling to physics collapses the lot. Followed honestly, the endpoint of "prove more, run less" is not a faster instruction stream. It is no instruction stream at all — a physical artifact whose behaviour, by construction and by proof, is the program.
The descent, in numbers
This is not hand-waving; the orders of magnitude are measured, and they climb the further down and the more specialized you go.
- A GPU is the mildest move: a different general engine, trading scalar instruction dispatch for thousands of lanes running one kernel in lockstep — roughly 10× the perf-per-watt of a CPU on data-parallel work.
- A TPU collapses the microarchitecture rung for one kernel family. Jouppi et al.'s 2017 systolic-array paper measured 15–30× performance and 30–80× performance-per-watt over contemporary CPUs and GPUs — because a matrix-multiply done as pure spatial dataflow has no per-operation fetch, decode, or register file.
- An FPGA turns the computation into reconfigurable structure rather than an instruction stream. One study clocked a SHA-256 datapath beating an i7 / Xeon / Tesla V100 by 513× / 493× / 93× in latency.
- An ASIC removes everything general: just the exact gates the task needs. Bitcoin SHA-256 ASICs are over 100,000× more energy-efficient than the CPU/GPU hardware of Bitcoin's first year. The price is enormous, irreversible engineering cost, frozen at tape-out.
Every rung down is the same bargain in different proportions: you give up generality and pay a steeper one-time engineering cost, in exchange for efficiency and — past a point — a harder but more rewarding proof. Which is exactly why EVM bytecode is such a good target, again: the descent only pays when amortized over enormous execution multiplicity, and deployed contracts are immutable and run billions of times. The economics that make custom silicon absurd for a throwaway script make it overwhelming for a hot, frozen, gas-metered contract.
The proof keeps going down
The natural objection is that the machine-checked guarantee must evaporate the moment you leave the comfortable world of software semantics. It does not. The same discipline that proves "this optimized block refines the original" — a refinement/simulation relation checked by a kernel — has already been carried clean through high-level synthesis, RTL, and onto an FPGA. The tower has a verified spine.
The shape is invariant under the descent. At every rung the correctness statement is the same: lower(p) ⊑ p — the artifact one level down refines the meaning one level up, observationally. Stack the refinements and you get an end-to-end theorem from the source semantics to the gate-level netlist, each link independently kernel-checked. That is precisely the observational-refinement relation a project like mine already proves between its concrete interpreter and KEVM — repointed one rung lower, and then again, and again. CakeML running on the verified Silver processor, Lutsig (a verified Verilog compiler producing technology-mapped FPGA netlists), and Kami (modular Coq hardware proofs) are not thought experiments; they are shipping, machine-checked technology.
And the best leverage point survives the descent too: untrusted search, trusted certificate. Lakeroad lets an unverified program-synthesis engine propose an FPGA technology mapping, then formally verifies the proposal against the primitive semantics. ROVER lets aggressive e-graph rewriting propose a faster datapath, then validates each rewrite — and beats state-of-the-art EDA tools by ~30% in speed while doing it. You never trust the optimizer's cleverness, only a small re-checkable certificate. The clever part stays outside the trusted base, exactly as it did for software optimization.
Off the digital cliff
Everything above stayed digital — exact, discrete, bit-faithful. But the tower doesn't bottom out at gates. Underneath the digital abstraction is continuous physics, and you can compile directly into it, skipping the entire digital reconstruction.
Sara Achour's Arco (PLDI 2016) and Legno (ASPLOS 2020) take a dynamical system written as first-order ODEs and compile it onto a reconfigurable analog device — wiring up integrators, multipliers, and adders so that the device's physical evolution in real time is a solution to the equations. There is no clock, no instruction, no numerical integration loop. The capacitor charging is the integral. This is the purest form of the thesis: a digital ODE solver interprets the equation, representing state as bits and grinding an integration rule; the analog circuit is the equation. The program didn't run faster — it stopped being a program and became a law of motion.
Further still, physical neural networks (McMahon, Wright et al., Nature 2022) treat an arbitrary physical system — coupled optics, a vibrating plate, a nonlinear circuit — as a trainable computer, using physics-aware training to shape the substrate's own messy dynamics into doing the inference, automatically learning to exploit and tolerate its noise rather than fight it. The computation isn't mapped onto the physics; the physics is shaped until it is the computation.
The price of leaving the cliff is exactness. Above it, refinement is bit-for-bit equality. Below it, "correct" becomes "within a tolerance, under a noise model" — the proof obligation changes from an equation to an approximation bound, and the trusted base now includes a model of the device physics. A spectacular trade for a dynamical simulation; a category error for a ledger. Knowing which is which is the whole engineering judgement.
The physical floor
Is there any floor at all? Yes — and it is thermodynamic, not technological. Landauer's principle says erasing one bit of information must dissipate at least kT ln 2 of energy as heat — about 2.9 × 10⁻²¹ joules at room temperature. Crucially the floor applies only to irreversible operations: the ones that discard information. A logically reversible computation, in principle, need dissipate nothing. Modern processors run roughly a billion times above this limit.
That gives the thesis its deepest restatement. Dissipation is the cost of forgetting; every bit your computation throws away is a minimum quantum of heat you are forced to pay. So "prove more, run less" bottoms out as prove more about which information is genuinely discardable, and dissipate less — which is exactly what reversible and adiabatic computing chase. It closes the loop with the top of the tower beautifully: up there, a proof lets you delete a runtime operation; down here, the deepest reason an operation costs energy is that it erases information, and a proof that the information was redundant is, thermodynamically, a license not to pay for erasing it. Same idea — the units just changed from cycles to joules.
Futamura, generalized
The Futamura projections are the classical statement that an interpreter, a compiler, and a compiler-generator are not different kinds of thing — they are one operation, specialization, applied at different times. The whole arc of this piece is that the projection doesn't have to stop at software. A CPU is the residual of specializing physics to "be ready for any program." A circuit is the residual of specializing it to your program. An analog device is the residual of specializing it to your program's denotation. At each step you partially-evaluate one more layer of the tower against the fixed program, and the interpreter for that layer vanishes into structure. The logical endpoint of compiling-by-proving is:
Find a physical system whose natural evolution is the program's denotation — and let the soundness proof be the refinement that certifies the physics implements the spec. The program isn't run. It's built, and proven to be what was meant.
The honest accounting is that the costs are exactly the inverse of the gains. The fully-verified spine is mature down to RTL and FPGAs; verified-by-construction high-level synthesis is real but research-grade; analog, physical, and Ising-machine compilation genuinely work, but proving the physics matches the spec under a realistic noise model is the open frontier, not a checked theorem. The thesis is sound the whole way down. The proof technology is mature at the top and a research program at the bottom — and that gap is exactly the interesting work.
I built this out as a visual, interactive explainer — with a clickable tower of interpreters you can descend rung by rung to see the tax each one charges, a target selector that puts real measured numbers (TPU, FPGA, ASIC, analog) against what each collapse buys and costs, a slider that tracks the generality / efficiency / engineering-cost / proof-difficulty trade as you go down, and the full verified-synthesis lineage laid out as a spine.
→ Open the interactive explainer
It's the fourth in a small series. Where one behavior, many views asks in what sense all these tools talk about the same program, one substrate, many instruments turns that into a build recipe, and prove more, run less follows soundness into a certified optimizing compiler — this one follows the same thread one tower further, past the instruction set and off the digital cliff, to where "optimize the program" finally becomes "build the physics that is the program, and prove it." The flashcards below are the TL;DR.
Flashcards Available
This post includes 8 flashcards for spaced repetition. Download the deck to import into Anki.
How to import into Anki
- Download the .txt file above
- Open Anki and click Import
- Select the downloaded file
- Choose Basic card type
- Ensure fields map to Front and Back
- Click Import