Giving RISC-V 1024 registers for zkVMs

General purpose zkVMs like OpenVM, SP1, ZisK pick RISC-V as their main source ISA because of its simplicity. It is a modern, minimal, well-understood ISA with mature toolchains. But RISC-V was designed for hardware: 32 general-purpose registers is a reasonable number when spills go to cache and cost you a few cycles.

In a zkVM, memory is more expensive, and the usual compiler toolchains are not aware of the vastly different cost-model. Every load and store has to be proven, and memory consistency constraints are proof costly for compute-heavy programs. That 32-register limit, harmless on hardware, becomes an issue.

At powdr we built crush, which compiles from WebAssembly to a custom ISA with infinite registers and zero spills, leading to much faster proof times than RISC-V. I’ve recently presented crush and powdr-wasm at EthProofs Beast Mode Day. My slides can be found here. We have also published a specific post about crush and powdr-wasm, so I am not going to focus on that here.

Instead of compiling from WebAssembly to a custom ISA, there are at least two other similar ways to try to remove spills that we have ruled out previously:

During Beast Mode Day I had a few discussions about these not-chosen ideas, and I was especially curious about item 2. What happens if we keep RISC-V, but try to give it infinite (or a lot more) registers? LLVM IR already uses infinite virtual registers internally. The register allocator’s job is to map them to a finite physical register set. If we give it a large number of registers instead of 32, it should succeed without spills for most real-world functions. We chose 1024 registers because it fits in 10 bits, which doubles the 5 bits that RISC-V reserves for registers in the instruction encoding, and we thought doubling the format was an easy hack to allow for more registers.

Doing this as a weekend experiment would not have been an option last year when we first thought about this, but fortunately we can now rely on AI for that. I asked Claude to implement this in LLVM and wire it up to OpenVM. The experiment was successful, and we got proofs for our custom RISC-V, which Claude named RISCV-X.

What spills look like on RISC-V

The benchmark uses tiny_sha3 (Saarinen’s FIPS 202 reference implementation in C, commit dcbb319, byte-identical to the upstream source) plus a freestanding driver that runs 1000 SHA3-256 iterations on a buffer starting as 32 zero bytes. We compare instructions, zkVM trace cells and proof time for the standard RISC-V version vs our RISCV-X extension.

Here’s the opening of sha3_keccakf (the keccak permutation) compiled to standard RISC-V:

sha3_keccakf:
    addi sp, sp, -272
    sw   ra,  268(sp)          # callee-save spill
    sw   s0,  264(sp)
    sw   s1,  260(sp)
    sw   s2,  256(sp)
    sw   s3,  252(sp)
    sw   s4,  248(sp)
    sw   s5,  244(sp)
    sw   s6,  240(sp)
    sw   s7,  236(sp)
    sw   s8,  232(sp)
    sw   s9,  228(sp)
    sw   s10, 224(sp)
    sw   s11, 220(sp)          # 13 callee-save spills
    li   t3, 0
    lw   t2, 0(a0)
    lw   t4, 4(a0)
    lui  a1, %hi(.L__const.sha3_keccakf.keccakf_rotc)
    addi a1, a1, %lo(.L__const.sha3_keccakf.keccakf_rotc)
    addi a2, a1, 4
    sw   a2, 12(sp)            # immediately spill a constant pointer
    addi a1, a1, 96
    sw   a1, 8(sp)             # spill another
    ...

The function opens with 13 callee-save spills, then starts loading the keccak state and constant pointers and immediately spilling them, because it’s already out of registers. The keccak state is 25 × 64-bit lanes (50 × 32-bit on RV32). The function has to juggle all of them plus temporaries and constant pointers into 32 architectural registers.

Counting across the whole sha3_keccakf:

Every one of those 222 memory ops is pure overhead in a zkVM: they exist only because we ran out of registers.

The same code on RISC-V with 1024 registers

Claude added a +xregs1024 subtarget feature to LLVM’s RISC-V backend: 1024 GPRs, a 64-bit instruction encoding (to fit 10-bit register fields), and a calling convention where everything except ra is caller-saved. We used the same tiny_sha3 C source with -mattr=+xregs1024 added to the llc invocation.

Here’s the same sha3_keccakf prologue:

sha3_keccakf:                  # no stack frame at all
    li   a1, 0
    lw   t1, 0(a0)
    lw   t2, 4(a0)
    lui  a6, %hi(.L__const.sha3_keccakf.keccakf_rotc)
    addi a6, a6, %lo(.L__const.sha3_keccakf.keccakf_rotc)
    lui  a2, %hi(.L__const.sha3_keccakf.keccakf_piln+4)
    addi a2, a2, %lo(.L__const.sha3_keccakf.keccakf_piln+4)
    li   a3, 64
    li   a4, 32
    addi a5, a6, 4
    addi a6, a6, 96
    lui  a7, %hi(.L__const.sha3_keccakf.keccakf_rndc)
    addi a7, a7, %lo(.L__const.sha3_keccakf.keccakf_rndc)
    li   t0, 24
    j    .LBB0_2
    ...

No prologue and no spills. All the constants live in registers, the loop counter lives in a register, and inside the round body the 50 state words sit in x32x80.

The full numbers for this function:

So sha3_keccakf needs about 80 registers to run without spilling, and when you give the LLVM register allocator that many to work with, it actually uses them. This was somewhat expected but still nice to see the register allocation being that generic. The standard 32-register RISC-V forces it to shuffle values back and forth through memory for the entire function.

Running it in OpenVM

The interesting part is what happens when a zkVM tries to prove this. Since the wider registers don’t fit in the standard 32-bit RISC-V instruction format, the binary uses a 64-bit encoding for every instruction.

Claude had to modify OpenVM to accept the new binary format: a new transpiler extension that decodes the 64-bit instructions, widened register byte offsets internally (a u8 → u16 cleanup across the circuit code), and an expanded register file. The instruction set itself didn’t change. We use the same opcodes, same semantics, just wider register identifiers.

Then we fed both binaries through OpenVM (v1) and generated actual proofs for a benchmark that runs SHA3-256 1000 times in a row over a 32-byte buffer that starts as all zeros. The final hash is 52cf48e88ce4dea40f272b6aaf083675ade26504a0129f51ec30204a2fdb1c5b. Both the baseline and extended ELFs produce exactly this, matching Python’s hashlib.sha3_256.

For the OpenVM guest specifically, we are using freestanding C and linking directly with ld.lld; wiring the full Rust + OpenVM toolchain through our modified backend needs a custom rustc sysroot, which is a lot more work and I felt the current results were enough for the experiment.

The proof metrics

Aggregated across all AIRs, from the raw metrics JSON of the two runs (1000 SHA3-256 iterations, 4 proof segments, STARK app proofs without recursion):

  Baseline (32 regs) Extended (1024 regs) Change
Executed instructions 35,896,008 31,067,004 −13.5%
total_cells (after padding) 4,049,774,760 3,853,796,520 −4.8%
total_cells_used (filled trace) 3,275,912,933 2,829,414,701 −13.6%
Total proof time 130.8s 127.7s −2.4%

We get ~14% fewer trace cells actually used in the proof. The total_cells allocated number drops less because of padding to powers of two, and proof time generally follows the savings in final trace cells.

The savings show up in fewer memory operations proved: every spill/reload in the baseline is a load and a store that have to be accounted for in the memory argument, and those are gone. It is a real improvement, but not that large.

A detailed analysis can be seen in our metrics viewer.

Where the spills are (and aren’t)

The assembly-level data showed 222 memory ops eliminated inside sha3_keccakf alone. The end-to-end reduction is only ~14%, where crush can achieve up to 50% savings vs RV32. Why not more?

Because this experiment solves the easy half of the problem: spills inside basic blocks. A function with plenty of live values and nowhere to put them. Give the register allocator a bigger register file and it stops juggling.

What it doesn’t solve is the other half: spills across function calls. Even with 1024 caller-saved registers, when one function calls another, the caller has to save any live value it wants to see after the call returns. The callee is free to clobber everything. So if there are values live across a call, they land on the stack. In particular, ra always does, because every call overwrites it.

This is where crush is structurally different. In crush, each function gets its own disjoint slice of the infinite register space via a frame pointer. Caller and callee never share registers (except in optimized cases for outputs, which is desired), so there’s nothing to save and nothing to restore. It’s a true frame separation, not a convention on top of a shared register file.

RISC-V, like every low-level ISA, has a single flat register file shared across every stack frame. The calling convention is a cooperative agreement about who saves what, but it’s always a convention on top of a shared resource. You can make all registers caller-saved (which we did) or all callee-saved, or split them, but you can’t escape the fact that a call boundary is a place where live values have to move somewhere if the two functions want to use the same registers. And the register allocator can’t coordinate across function boundaries without whole-program analysis.

Frame-based register partitioning, like crush does, needs to be designed into the ISA from the start. You need some equivalent of a frame pointer that shifts the register window, or literally infinite registers with SSA-style naming. Retrofitting it onto RISC-V through the existing calling convention mechanism doesn’t work, because the mechanism itself is what leaks spills across call sites.

Conclusions

Results

The experiment succeeded at what it set out to do. Register pressure inside functions goes to zero. That’s a meaningful improvement on real workloads (~14% fewer trace cells on keccak, and likely more on compute-heavier workloads). However the calling convention problem remains (which crush solves well), and for that you need a different kind of ISA.

LLVM 🤝 Claude

The current diff to LLVM contains less than 500 LoC. I did not think this experiment was possible with so little code, so kudos to LLVM!

Claude was impressive at this task. It seemed to understand LLVM quite well, and got a prototype working with small C programs in less than 10 minutes (!!!). It feels insane to me that I can spawn an experiment like this in a day without writing a single line of code (I still read the code and checked that the computed hashes were correct). I’m truly excited about the improvements we will be able to make just from raw ideas and markdown files.

All code and reports were written by Claude.