Chapter 10 — Sparkle Architecture

The previous nine chapters used Sparkle from the outside — writing designs, synthesising, proving. This chapter looks at the insides: how a Lean source file becomes SystemVerilog, when to choose JIT simulation versus Verilator, where to extend the compiler, and the cheat-sheet of macros / commands you'll reach for in real work.

It's a tour, not a tutorial — no exercises, just orientation.

import Sparkle
import Display

namespace Notebooks.Ch10

10.1 The compilation pipeline

  .lean source (Signal DSL)
         │
         │  Sparkle/Compiler/Elab.lean
         │  (Lean metaprogramming → IR)
         ▼
   Signal IR  (Sparkle/IR/AST.lean)
         │
         ├──→ Sparkle/Backend/Verilog.lean → SystemVerilog (Yosys, Vivado, ...)
         ├──→ Sparkle/Backend/CppSim.lean  → C++ JIT (Verilator-class speed)
         └──→ Sparkle/Compiler/DRC.lean    → Design Rule Check (timing safety)

All three backends share the same IR. Adding a fourth backend (e.g. PTX, OpenCL, Vivado HLS) means a new file in Sparkle/Backend/, no upstream-IR changes.

What Signal dom α actually is

A Signal dom α is, semantically, a Nat → α — a stream of per-cycle values. Operationally it's a thin wrapper around a closure or a memo table; the compiler decides which. The type carries a dom : DomainConfig parameter so signals from different clock domains can't be accidentally mixed (CDC safety at the type level).

10.2 JIT vs Verilator vs synthesis — when to use each

Use casePickSpeedPre-reqs
Quick #eval on a 1k-cycle traceSignal.sampleµs/cyclenothing
Million-cycle simulationSignal.loopMemons/cyclenothing
Long simulation in C++ JITSparkle.Backend.CppSimC-classg++ on PATH
Cycle-accurate trace + waveformsVerilator pipelineC-classVerilator on PATH
Real siliconYosys + nextpnrofflinetoolchain (Ch 8/9)

Rule of thumb: author with `circuit do`, debug with `.sample`, scale with `loopMemo`, ship with Verilog. The JIT path is for when you need real-time-ish simulation (RV32 SoC running Linux runs at ~14M cycles/s on the JIT — fast enough that booting Linux is feasible inside CI).

10.3 Reset and clock conventions

Sparkle's generated SystemVerilog has a single clock (clk) and an active-high reset (rst). The reset is either synchronous or asynchronous depending on the register's owning DomainConfig.resetKind:

-- .synchronous (the standard domains: defaultDomain / domain50MHz / domain200MHz)
always_ff @(posedge clk) begin
  if (rst) <output> <= <init>;
  else     <output> <= <next>;
end

-- .asynchronous (write `{ ..., resetKind := .asynchronous }`)
always_ff @(posedge clk or posedge rst) begin
  if (rst) <output> <= <init>;
  else     <output> <= <next>;
end

The kind is wired through the IR: Stmt.register carries (reset : String × ResetKind), the elaborator reads Reset dom's dom.resetKind and stamps it onto the IR, and Sparkle/Backend/Verilog.lean switches the sensitivity list on the kind. No global flag — the choice is per-register, set by the domain.

Why the synchronous default? It matches what most ASIC flows expect (synchronous reset = fewer timing-arc surprises), is easy to retime, and DC- / Yosys-friendly. Async is available when the FPGA primitive (e.g. iCE40 SB_DFFR) wants it directly.

There is no multi-clock support in the generated Verilog itself — every module uses the single clk port. Sparkle's CDC module (§10.3b above; docs/architecture/CDC.md) handles cross-domain communication at simulation time via the lock-free SPSC runtime, not via multi-clock RTL.

10.3b CDC — lock-free multi-clock simulation

§10.3 said Sparkle's generated Verilog is single-clock. But real chips often have several clocks — a 100 MHz CPU talking to a 50 MHz peripheral, an audio domain, an FPGA-board PLL. Sparkle's simulator handles that today with a small lock-free runtime; this section explains how, why it works, and where the trade-offs live.

The user-facing entry point is JIT.runCDC:

def JIT.runCDC
    (handleA  handleB  : JITHandle)
    (cyclesA  cyclesB  : UInt64)
    (outPortA : UInt32) (inPortB : UInt32)
    : IO (UInt64 × UInt64 × UInt64)
    -- = (messagesSent, messagesReceived, rollbacks)

Each handle is a JIT-built simulator (Ch 8b §8b.3). runCDC spawns two threads — one per domain — connected by a single-producer single-consumer queue.

How it works in one diagram

        Thread A (e.g. 100 MHz)              Thread B (e.g. 50 MHz)
        ┌─────────────────────┐              ┌─────────────────────┐
        │  jit_eval_tick      │              │  pop CDCMessage     │
        │  read outPortA      │              │  if msg.ts < localT │
        │  push {ts,payload}  │   SPSC queue │     restore(snap)   │
        │     to queue        │ ───────────▶ │     localT = msg.ts │
        │  ts += 1            │              │  jit_set_input      │
        │                     │              │  jit_eval_tick      │
        └─────────────────────┘              └─────────────────────┘
              producer                              consumer
              (no locks)                            (no locks)

Two ideas make this fast and correct simultaneously:

1. The queue itself is lock-freec_src/cdc/spsc_queue.hpp, ARM64-tuned, ~210 M ops/sec. The producer holds write_idx, the consumer holds read_idx; each lives on its own cache line so the two cores don't ping-pong false-sharing traffic. 2. Only the *consumer's* simulation state ever rolls back — never the queue. When a message arrives with a timestamp earlier than the consumer's local clock, the consumer restores the most recent snapshot, replays from that point with the late message, and continues popping. The producer is unaffected — and crucially, the queue's read_idx / write_idx keep advancing monotonically.

Why "no locks" is safe — the SPSC invariant

A general MPMC (multi-producer, multi-consumer) queue needs mutexes or compare-and-swap loops. An SPSC queue (one producer thread, one consumer thread) doesn't, because:

release-store is enough to publish a new message.

release-store is enough to publish that a slot is free.

acquire-load — that pairs with the release on the other side and gives a happens-before edge through the C++20 memory model.

Sparkle/Verification/CDCProps.lean has 12 formal proofs covering exactly these properties — spsc_no_overflow, spsc_no_underflow, push_preserves_nonempty, rollback_guarantee, consume_preserves_write_idx, etc. No sorrys. The Lean proofs model the abstract state machine (writeIdx : Nat, readIdx : Nat, monotone advancement); the C++ implementation matches that model bit-for-bit, modulo the power-of-two ring-buffer wrap.

Why rollback is bounded

Time-warping looks dangerous — what if a thread keeps rolling back forever? Two structural reasons it doesn't:

message's timestamp.** After rollback, the consumer's clock matches what the producer thought was already published, so the next pop can only land at or after that point.

the consumer's simulation state rewinds, the inter-thread pointers don't, so the producer's view of "how full is the queue" stays monotone — the producer never re-sends old data.

Combined, the consumer's local_time rises monotonically over the long run; rollback events shorten individual cycles but can't stop forward progress. In the multi-clock test (lake exe cdc-multi-clock-test) a 100 MHz / 50 MHz pair runs ~75 K messages with 0 rollbacks because the producer's timestamps are emitted in order and the consumer is the slower side — there's nothing to invert. Rollbacks only trigger when domains run truly concurrently and a transient schedules order them out of sequence.

Where the trade-offs are

StrengthCost
~10⁸-class queue throughput; no kernel mutex hopsOne queue per direction — N×N mesh for N domains
Simulation state rolls back; queue indices don'tSnapshots cost memory (one full register/RAM image)
Each domain runs full-speed JIT independentlyProducer outpaces consumer → queue-full back-pressure
12 formal proofs (Sparkle.Verification.CDCProps)Proofs model the abstract queue, not the C++ atomics
Single-producer / single-consumer ⇒ atomics sufficeMulti-producer per direction needs a different queue

Two practical guidelines fall out of this:

That naturally keeps timestamps in order and rollbacks at zero (the multi-clock test demonstrates this).

scratchpad), expensive when state is huge (a 16 MiB cache).** For very large designs, place CDC boundaries before the cache, not after.

Where to read more

commented)

10.4 Where to extend Sparkle

that builds out of existing Signal.map / Signal.register / Signal.loop. No backend change required.

backend consumes the Signal IR (Sparkle/IR/AST.lean) and emits whatever you want. The Verilog and C++-JIT backends are short; copy one as a template.

following the BitNet / RV32 / YOLOv8 / Drone / Humanoid precedents in IP/.

10.5 Reference card — top macros and commands

FormPurpose
circuit do { ... }Imperative-style HDL with <~ register assignment
Signal.loop (fun s => ...)Dataflow-style HDL with explicit recurrence
Signal.register init nextA single flip-flop
Signal.mux cond a bA multiplexer (use this, never if-then-else)
declare_signal_state FooGenerate a named record I/O type with accessors
Foo.mk (field := sig)Construct the named record
Signal.lit dom vLift a constant when type inference can't reach it
Signal.sample nSample the first n cycles as a list
#synthesizeVerilog nameGenerate SystemVerilog for name
#synthesizeVerilogDesign nameSame, plus design-rule check + report
#verify_eq sigA sigB nCompare two signals over n cycles
#verify_eq_at sigA sigB tCompare at one specific cycle
#verify_eq_git sigA sigB ref nCompare against a committed reference trace
bv_decideTactic: SAT-style proof of bit-vector goals
decide / native_decideTactic: exhaustive enumeration of small types

Full details: docs/reference/SignalDSL_Syntax.md, docs/reference/Verification_Framework.md, docs/reference/Troubleshooting_Synthesis.md.

10.5b Notebook helper commands

Inside JupyterLab the xeus-lean kernel adds a handful of notebook-only commands. The Sparkle tutorial chapters use them sparingly so the same source builds under headless lake build — the Display shim provides parsing-only stubs so each command is syntactically valid even when its real behaviour is absent.

CommandKernel behaviourShim behaviour
#mermaid "flowchart LR; A --> B"renders Mermaid SVGemits a <div class="mermaid"> payload (HTML viewers render it)
#help_xlists every kernel-only commandprints a one-liner pointer
#findDecl "kw1" "kw2" 0 20substring-search the active environmentechoes the search args
#listNs Sparkle.Core.Signaltree of declarations under a namespaceprints (#listNs shim: ...)
#sig Sparkle.Signal.registerpretty-prints a declaration's typesuggests #check
#bash "ls -1 docs/tutorial"runs bash -c and inlines stdout/stderrhard no-op

The Display.* rendering helpers (Display.waveform, Display.boolWave, Display.verilog, Display.blockDiagram, Display.writeWdb, Display.waveformFromWdb) follow the same two-tier pattern: real implementations under xeus-lean, MIME- emitting shims under lake build.

#help_x
#listNs Sparkle.Core.Signal

You can use these commands freely while exploring the codebase in JupyterLab; the chapter Markdown sources stay buildable even when the kernel isn't available.

10.6 Where to go from here

The course covers the foundation. Real Sparkle work happens in the IP repos and the architecture docs:

with Linux boot. Read it as a worked example of every chapter applied at scale.

accelerator, with FPGA results.

development log (current status, benchmarks, completed phases).

integrates with Lean.

Welcome to Sparkle. Have fun.

end Notebooks.Ch10