Rosenblatt: project scaffold for RK3588 NPU on mainline

Codename: Frank Rosenblatt — Mark I Perceptron 1958, the first hardware neural network. This project lights up the RK3588 NPU on mainline Linux so the OSS world finally owns the silicon-side of inference on that chip. Phase-1 scope: small LLM running CPU + NPU mix on boltzmann (Rock 5 ITX+). Backend: llama.cpp with a new rknpu ggml backend offloading INT8 GEMM (attention + FFN matmuls) to the NPU's tile-MAC array while leaving dequant / RoPE / softmax / sampling / embedding on A76 NEON. Target model: qwen2.5-1.5B-instruct Q4_K_M GGUF. Scaffold layout: README.md (frame + 9+1-phase plan), TODO.md (rolling punch-list), docs/{npu-mainline-status,architecture}.md, kernel/ for DT bindings + driver tweaks, userspace/{npu-probe,llm-runtime}/, fleet/boltzmann.yaml. Next: Phase-1 substrate audit — fill the TBDs in docs/npu-mainline-status.md with the actual state of Tomeu Vizoso's rknpu / DRM-accel work on the boltzmann-running kernel.
2026-05-19 11:57:48 +00:00
commit 24adc74812
8 changed files with 578 additions and 0 deletions
@@ -0,0 +1,107 @@
+# Rosenblatt
+
+**Codename:** Frank Rosenblatt built the Mark I Perceptron in 1958 — the first
+hardware neural network (400 photocells, stepper-motor-tunable analog weights).
+This project lights up the RK3588 NPU on mainline Linux, so the OSS world
+finally owns the silicon-side of inference on that chip.
+
+**Scope (Phase 1):** small LLM running CPU + NPU mix on `boltzmann` (Rock 5
+ITX+, RK3588, 32 GB DDR4). Backend: `llama.cpp` with a new `rknpu` device that
+offloads the heavy GEMM (matmul in attention + FFN) to the NPU's INT8 path
+while leaving dequant / RoPE / softmax / sampling / embedding lookup on the
+A76 NEON cores.
+
+**Target model (Phase 1):**
+`qwen2.5-1.5B-instruct` Q4_K_M GGUF. Fits in NPU's accessible memory
+budget, has chat tuning, public license. Stretch: `qwen2.5-3B`,
+`gemma3-2B`.
+
+**Out of scope (Phase 1, capture separately if pursued):**
+- Vision helper (object detection / OCR / face-blur) — different op mix,
+  re-scope after Phase-1 numbers
+- RKNPU vendor SDK adoption — we want mainline-clean, not vendor-blob
+- Other Rockchip NPUs (RK3576 has the same NPU IP block — should port for
+  free once the RK3588 path lands, but defer until Phase-1 closes)
+
+**Not goal: parity with rknn-llm vendor stack on day 1.** Vendor has
+hand-tuned tensor layouts + quantization; we'll be slower at first. Goal
+is *credible* — defined as ≥1 tok/s sustained on qwen-1.5B Q4 with the
+NPU actually doing the bulk of the GEMM work. The number itself isn't the
+point; the open path to it is.
+
+---
+
+## Phases (9 + 1 loop)
+
+| # | Phase | Deliverable |
+|---|---|---|
+| 1 | **Substrate** | Audit mainline NPU driver state (Tomeu Vizoso's rknpu / DRM-accel series); `/dev/accel/*` probe on boltzmann; running kernel + module inventory. `docs/npu-mainline-status.md` snapshot. |
+| 2 | Formulate | Pick the exact matmul shape that fits the NPU's tile-MAC array. Identify the smallest-possible op-set llama.cpp can offload. |
+| 3 | Analyze | Read the RKNPU2 SDK + Tomeu's rknpu uAPI to learn: register layout, DMA tensor format, INT8 quant scheme. Don't lift code — extract the spec. |
+| 4 | Baseline | llama.cpp pure-CPU tok/s on boltzmann for qwen-1.5B Q4_K_M. Three runs, median. Reproducible bench script in `benchmarks/`. |
+| 5 | Plan | rknpu backend interface design — where it plugs into ggml's compute graph; memory mapping strategy (dmabuf vs userptr); fallback path. |
+| 6 | Review | Janet (ARM/DRM specialist agent) reviews the NPU register-write + DMA fence strategy. Cold-eyes pass. |
+| 7 | Implement | rknpu ggml backend skeleton + first INT8 matmul. Bit-exact against CPU reference (Q4_K dequant + fp32 matmul). |
+| 8 | Verify | Compare tok/s vs Phase-4 baseline. Profile: % time in NPU vs % in CPU vs % stalled on DMA. |
+| 9 | Closing | Writeup at `dokuwiki.reauktion.de/doku.php?id=rosenblatt`. Benchmarks rendered. Send-to-upstream cover letter draft if quality is there. |
+| 10 | Memory | `project_rosenblatt.md` in claude-memory: what worked, what to avoid for the next NPU campaign (RK3576 port). |
+
+Per `feedback_dev_process.md`: rewind to Phase 1 on blocker, Phase 4 on
+direction change, Phase 0 on scope change.
+
+---
+
+## Repo layout
+
+```
+rosenblatt/
+├── README.md              this file
+├── TODO.md                rolling punch-list
+├── docs/
+│   ├── npu-mainline-status.md    Phase-1 audit
+│   ├── architecture.md           CPU+NPU split, ggml backend shape
+│   └── phases.md                 per-phase log (analog to ~/src/bin/phases/)
+├── kernel/                       mainline-bound patches (DT bindings, rknpu driver tweaks)
+├── userspace/
+│   ├── npu-probe/                smallest-possible "open device + run trivial matmul" sanity
+│   └── llm-runtime/              llama.cpp fork with rknpu backend
+├── fleet/
+│   └── boltzmann.yaml            host manifest (kernel + NPU driver pin, baseline measurement)
+└── benchmarks/                   reproducible bench scripts + recorded results (JSON + plots)
+```
+
+---
+
+## Host
+
+Primary: **boltzmann** (Rock 5 ITX+, RK3588, 32 GB DDR4-2666, NVMe rootfs).
+- Already runs mainline ~v7.0 with most peripheral drivers working.
+- Has the Quark UEFI / Neutron kernel stack — NPU is the next missing peripheral.
+- Other RK3588 hosts (`ampere` = CoolPi GenBook) come later for port-validation.
+
+Why not `ampere`: laptop, intermittent power, in-use for other campaigns.
+Boltzmann is always-on with 32 GB headroom — right substrate for kernel
+hacking with serial-console fallback (when [Quark](https://git.reauktion.de/marfrit/quark) exposes one).
+
+---
+
+## Codename rationale
+
+Rosenblatt's Mark I was custom analog hardware doing fixed-function matmul-
+adjacent work (weighted-sum + threshold), with weights tunable per slot via
+mechanical control. The RK3588 NPU is fixed-function INT8 matmul/conv hardware
+with weights loaded per inference. Same shape, 67 years later, with the same
+"how do we drive this thing from a general-purpose computer?" problem. The
+1958 paper's answer was: build a control panel. The 2026 answer is: a DRM
+accelerator driver + a userspace runtime that maps tensor ops to MMIO + DMA.
+We're writing the second half.
+
+---
+
+## Status
+
+| Phase | State | Date |
+|---|---|---|
+| 0 — bootstrap | done | 2026-05-19 |
+| 1 — substrate audit | open | |
+| 2..10 | pending | |
@@ -0,0 +1,75 @@
+# TODO — Rosenblatt
+
+Rolling punch-list. Older items at bottom (move done → DONE.md when noisy).
+
+---
+
+## Phase 1 — substrate audit
+
+- [ ] On boltzmann: `uname -r` → record in `fleet/boltzmann.yaml:kernel.running_version`
+- [ ] `find / -path '*accel*' -name '*.ko' 2>/dev/null` — check if accel framework is built
+- [ ] `ls /dev/accel/ /dev/dri/` — what's exposed?
+- [ ] `lsmod | grep -iE 'rknpu|accel'` — what's loaded?
+- [ ] `dmesg | grep -iE 'rknpu|npu|accel'` since boot — driver bringup log
+- [ ] Tomeu's rknpu series — find on lore.kernel.org/dri-devel, capture latest
+      patch-set version + state (merged / in-review / dropped) → fill table in
+      `docs/npu-mainline-status.md`
+- [ ] Check `drivers/accel/` in current torvalds tree — list in-tree
+      accelerators, confirm rknpu's mainline state
+- [ ] Check DT bindings: `Documentation/devicetree/bindings/npu/rockchip,*.yaml`
+- [ ] Inspect `arch/arm64/boot/dts/rockchip/rk3588.dtsi` for `npu` node
+- [ ] If a userspace shim exists (rkneural?), capture repo URL + try
+      hello-world against the running kernel
+- [ ] Spec-extract from BSP vendor `rockchip-npu` source — register map,
+      DMA descriptor format, irq handling. No code lift; spec only.
+
+Phase exit criteria: `docs/npu-mainline-status.md` table fully populated;
+clear answer to "do we drive via accel uAPI or write our own MMIO driver."
+
+---
+
+## Phase 2 — formulate
+
+- [ ] List llama.cpp ops by wallclock %, profiling qwen-1.5B Q4_K_M on CPU
+      (use llama.cpp's built-in perf-timer or perf record)
+- [ ] Pick the exact INT8 matmul tile size the NPU prefers (read from BSP source)
+- [ ] Spec out the smallest backend interface: which ops we MUST handle,
+      which the framework falls back to CPU
+- [ ] Write `docs/op-coverage.md`
+
+---
+
+## Phase 3 — analyze
+
+- [ ] RKNPU2 SDK: trace through `librknnrt.so` user-API → kernel ioctl shapes
+      (objdump + strings, no actual reverse-engineering of vendor blob — just
+      the syscall surface)
+- [ ] Tomeu's accel uAPI: read driver source, understand:
+  - submit-job ioctl shape
+  - dmabuf import path
+  - fence-wait mechanism
+  - error reporting
+- [ ] BSP vendor `rockchip-npu` source: register layout, DMA descriptor
+      struct, irq handling sequence
+
+---
+
+## Phase 4 — baseline
+
+- [ ] Build vanilla llama.cpp on boltzmann (mainline branch)
+- [ ] Pull qwen2.5-1.5b-instruct Q4_K_M GGUF
+- [ ] `llama-bench -m qwen2.5-1.5b -p 512 -n 128` × 3 runs
+- [ ] Capture JSON to `benchmarks/$(date +%F)_boltzmann_qwen1.5b_cpu_baseline.json`
+- [ ] Record into `fleet/boltzmann.yaml:baseline_measurement`
+
+---
+
+## Cross-phase / standing items
+
+- [ ] Mirror Tomeu's WIP branch into a local clone for kernel hacking
+- [ ] Set up serial console on boltzmann for kernel-panic recovery (Quark
+      umbrella; check current state)
+- [ ] Add `project_rosenblatt.md` to claude-memory once Phase 1 closes (so
+      future sessions don't re-discover the campaign)
+- [ ] Decide repo home: marfrit/rosenblatt on git.reauktion.de (probably yes,
+      after Phase-1 substrate is captured and the README isn't embarrassing)
@@ -0,0 +1,127 @@
+# Architecture — CPU + NPU mix for llama.cpp on RK3588
+
+## The split
+
+llama.cpp's compute graph is built around ggml ops. We don't replace
+llama.cpp's whole engine — we register a new **device backend** (in
+ggml's `ggml-backend` abstraction) named `rknpu` and selectively offload
+the ops that are worth the round-trip cost.
+
+### Goes to NPU (heavy, dense, INT8-friendly)
+
+- `MUL_MAT` (matrix-matrix multiply) — the workhorse, dominates wall
+  time. Both attention `Q @ K^T`, `attn @ V`, and FFN `up_proj`,
+  `down_proj`, `gate_proj` are this shape.
+- `MUL_MAT_ID` (MoE-style mixture matmul) — when we eventually try a
+  mixture-of-experts model. Phase-1+ scope.
+
+### Stays on CPU (small, op-specific, or per-token)
+
+- Embedding lookup (`GET_ROWS`) — random-access gather, NPU has no
+  fast path
+- `RMS_NORM` / `LAYER_NORM` — per-token reduction + element-wise
+- `ROPE` — small, per-head, lots of trig
+- `SOFT_MAX` — small, per-head
+- Activations (`SILU`, `GELU`) — element-wise, cheap on NEON
+- `SCALE`, `ADD`, `MUL` (element-wise) — cheap on NEON
+- Sampling, KV cache update, tokenization — entirely host
+
+### KV cache: open question
+
+Two options:
+1. **CPU-resident:** lives in normal Linux memory; NPU pulls
+   activations from CPU and pushes results back per layer.
+2. **NPU-resident:** allocated in dmabuf, NPU reads K/V across layers
+   without round-trips. Cheaper per-layer but constrains model size
+   to NPU-accessible memory.
+
+Phase 5 (Plan) picks based on Phase-3 (Analyze) findings on the DMA
+cost.
+
+---
+
+## Memory mapping
+
+ggml's `ggml_backend_buffer_t` abstracts the buffer pool. We implement:
+- `alloc_buffer(size)` → allocate a dmabuf of `size` bytes
+- `free_buffer(buffer)` → release dmabuf
+- `set_tensor` / `get_tensor` → CPU → NPU memcpy
+- `cpy_tensor` → device-internal copy
+
+The dmabuf approach lets us share buffers between CPU producer (e.g.
+embedding lookup output) and NPU consumer (matmul) without an extra
+copy — `mmap` on CPU side, DMA-import on NPU side.
+
+If Tomeu's accel uAPI uses dmabuf natively, we follow that. If it
+doesn't, we go through `/dev/dri/renderD*` with a thin shim.
+
+---
+
+## Quantization strategy
+
+llama.cpp ships Q4_K_M as the default for ~2B models. Q4_K_M is a
+4-bit weight quantization with per-group scale + min, no per-channel
+scale. The NPU expects INT8 (or INT16) tensors with per-channel scale
+factors.
+
+Two paths:
+1. **Dequantize on CPU per-layer:** unpack Q4_K_M → INT8 right before
+   the matmul; ship INT8 to NPU. Adds a per-layer CPU pre-pass.
+2. **Dequantize once at load time:** unpack the entire weight tensor
+   to INT8 + per-channel scales at model-load. Permanent ~2x memory
+   cost (Q4_K_M is ~5 bits/weight effective; INT8 is 8 bits/weight),
+   but no per-layer CPU work.
+
+Phase-1 choice: (2) — straightforward, makes the NPU path the only
+thing happening at inference time, easier to profile. The memory cost
+on 1.5B is ~1.5 GB INT8 weights vs ~900 MB Q4_K_M — boltzmann has
+32 GB, this isn't the constraint.
+
+Phase-2+: revisit (1) if we go for larger models or if INT8 turns out
+to be quality-loss-meaningful on the small ones.
+
+---
+
+## Backend interface — concrete
+
+Mirroring ggml's existing `ggml-cuda.cu` / `ggml-metal.m` shape, we add:
+
+```
+ggml-rknpu.h       — public API: ggml_backend_rknpu_init() etc.
+ggml-rknpu.c       — backend implementation: device registration, op
+                     dispatch table, memory management
+ggml-rknpu-ops.c   — per-op kernels: matmul tiled to NPU's preferred
+                     shape, INT8 quant pre-pass
+```
+
+In `llama.cpp/ggml/src/ggml-rknpu/`. Out-of-tree initially; if
+upstream-acceptable, send for review after Phase 8.
+
+---
+
+## Failure handling
+
+llama.cpp's backend abstraction already supports falling back to CPU
+on per-op basis — `ggml_backend_dev_supports_op()`. We declare
+`MUL_MAT` supported for INT8 / FP16 inputs with the right shape
+constraints, and let the framework route everything else to CPU.
+
+If the NPU driver returns an error mid-inference (timeout, DMA
+fence wait fail, etc.), the strategy is **abort the inference, log,
+return error to caller**. We don't try to silently fall back to CPU
+mid-stream because the state would be corrupted (NPU may have
+partially written to the dmabuf).
+
+---
+
+## Phase-1 milestone
+
+A `npu-probe` userspace binary that:
+1. Opens the NPU device (whatever the mainline path is — likely
+   `/dev/accel/accelN` or `/dev/dri/renderD*`)
+2. Allocates two small INT8 input tensors + one output (e.g. 64x64)
+3. Submits a matmul via the uAPI
+4. Waits, reads back, compares to a CPU reference
+
+This proves the substrate is alive before we touch llama.cpp. If it
+doesn't work, we're back in kernel-driver land, not llama.cpp land.
@@ -0,0 +1,125 @@
+# RK3588 NPU on mainline — status audit (Phase 1)
+
+> **Phase 1 deliverable.** This file is the canonical snapshot of where
+> mainline kernel + DRM/accel + userspace stand for the RK3588 NPU as of
+> the audit date. Re-audit on each kernel-version bump or when a
+> blocker hits.
+
+**Audit date:** _(pending — first Phase-1 task)_
+**Kernel under audit:** _(boltzmann's running kernel, capture with `uname -r`)_
+
+---
+
+## Hardware quick-spec
+
+- **NPU:** Rockchip "RKNPU2" — three independent NPU cores, each ~2 TOPS
+  INT8 (combined 6 TOPS). Also supports INT16, FP16, and BF16 (with
+  reduced throughput). Per-core local memory (~2 MB SRAM each).
+- **Bus:** AXI interconnect, dedicated DMA engine, integrated power
+  domain (`pd_npu`).
+- **TRM section:** Chapter 23 ("NPU") in the RK3588 TRM v1.0 — local
+  copy referenced in `reference_rk3588_trm` memory.
+- **Vendor docs source-of-truth:**
+  https://github.com/airockchip/rknn-toolkit2 (SDK + ops list +
+  quantization scheme).
+
+---
+
+## Upstream work in flight
+
+To audit on Phase 1:
+
+1. **Tomeu Vizoso's rknpu / DRM-accel series**
+   - Branch: search lore.kernel.org for "rknpu" + "Tomeu Vizoso" in
+     `linux-rockchip` / `dri-devel`. Check `linux-rockchip` ml cutoff
+     date per `reference_linux_rockchip_ml` memory.
+   - State to capture: which patches merged, which are review-pending,
+     which dropped on the floor.
+2. **`drivers/accel/` mainline state**
+   - Check `drivers/accel/` in current torvalds tree — list of in-tree
+     accelerators (habanalabs, ivpu, qaic, etc.). Note if rknpu is
+     there yet.
+3. **DT bindings**
+   - `Documentation/devicetree/bindings/npu/rockchip,rk3588-npu.yaml`
+     — does it exist in mainline?
+4. **Userspace bridge**
+   - Tomeu has a userspace shim ("rkneural"?) for the DRM-accel uAPI.
+     Capture repo URL + current state.
+
+Fill the table below during Phase-1 audit:
+
+| Component | Mainline state | URL | Notes |
+|---|---|---|---|
+| `drivers/accel/rknpu/` | _TBD_ | | |
+| DT bindings `rockchip,rk3588-npu.yaml` | _TBD_ | | |
+| `pd_npu` power-domain in `rk3588.dtsi` | _TBD_ | | |
+| `npu` node in `rk3588.dtsi` | _TBD_ | | |
+| `npu` node in `rk3588-rock-5-itx-plus.dts` (boltzmann) | _TBD_ | | |
+| userspace uAPI consumer | _TBD_ | | |
+| Tomeu's WIP branch | _TBD_ | | |
+
+---
+
+## Vendor stack (for spec extraction only — don't lift code)
+
+- **RKNPU2 SDK:** `airockchip/rknn-toolkit2`. Provides:
+  - `librknnrt.so` runtime (closed, vendor binary)
+  - `rknn-convert` ONNX → `.rknn` transcoder
+  - rknn-llm fork (rknn-llm) for LLM-specific quant + scheduling
+- **Kernel module:** vendor `rockchip-npu` (BSP). Not mainline-compatible
+  (uses Rockchip's iommu/dma shim). Spec-extract per `feedback_megabitchip_semantic_match` —
+  read the source, don't link the binary.
+
+Operational rule: **mainline-clean from day 1.** No vendor blob runtime,
+no vendor-kernel-only module. If we need to copy a register-layout table
+from BSP source to get started, that's fine; copying a `.so` is not.
+
+---
+
+## Hardware capability bounds
+
+What the NPU CAN do (Phase-1-relevant):
+- INT8 GEMM up to ~2 TOPS/core × 3 cores = 6 TOPS peak (theoretical)
+- INT8 tensors with per-channel quantization scales
+- 3D conv (mostly unused in LLM workloads)
+- LayerNorm + activation fusion in some configs
+
+What the NPU CANNOT do that LLM needs (so stays on CPU):
+- fp32 / fp16 mixed-precision (NPU is INT8/INT16-first; some FP16 but
+  with throughput penalty)
+- Sampling (multinomial, top-k, top-p) — pure CPU
+- Sparse attention — no sparse-tile support in the NPU op set
+- Embedding lookup (rare access pattern; gather not in NPU op set)
+- RoPE, softmax, RMSNorm — these run on CPU because the NPU's
+  fixed-function pipeline doesn't have these as first-class ops
+  without going through a generic matmul shoehorn
+
+Phase-2 question: which of those CAN move to NPU later, with op-fusion or
+custom kernels.
+
+---
+
+## Throughput sanity check (theoretical only — not measured)
+
+qwen2.5-1.5B Q4_K_M:
+- ~1.5e9 weights, attention + FFN dominated by matmul
+- Per-token: roughly 2 × 1.5e9 = 3 GMAC (forward pass, weight×activation)
+- INT8 path: 3 GMAC / 6 TOPS = 0.5 ms/token compute-bound
+- → ~2000 tok/s if everything fit + zero overhead, ABSOLUTELY UNREALISTIC
+
+Realistic upper bound on first cut: probably 5-15 tok/s. Memory
+bandwidth + DMA setup + CPU side will dominate. Phase-4 baseline pulls
+the actual CPU-only number; Phase-8 measures how close we get with NPU
+in the loop. **The goal is "credible CPU+NPU mix that's faster than
+pure CPU," not "saturate the 6 TOPS rating."**
+
+---
+
+## What this audit unblocks
+
+After Phase 1 completes (the table above is filled), Phase 2 can pick:
+- whether to drive the NPU via Tomeu's accel uAPI (if it's far enough
+  along) or write directly against the MMIO regs (if not)
+- which kernel branch we baseline against (mainline-rc vs Tomeu's WIP)
+- which userspace shim we use (Tomeu's, write our own, or fork
+  llama.cpp's CUDA-style backend pattern for ggml)
@@ -0,0 +1,55 @@
+# rosenblatt fleet manifest — boltzmann (Rock 5 ITX+, RK3588)
+#
+# Phase-1 audit host. Always-on, 32 GB DDR4, NVMe rootfs. NPU silicon
+# present + accessible via Rockchip-BSP vendor module today; mainline
+# path TBD (see docs/npu-mainline-status.md).
+
+host: boltzmann
+arch: arm64
+soc: rockchip/rk3588
+board: rock-5-itx-plus
+distro: archlinuxarm  # ALARM aarch64; boltzmann is the umbrella RK3588 host
+role: primary-development  # not yet primary-target (laptop targets land later)
+
+hardware:
+  cpu: 4×Cortex-A76 (2.4 GHz) + 4×Cortex-A55 (1.8 GHz)
+  ram: 32 GB DDR4-2666
+  storage: NVMe (rootfs) + microSD (recovery)
+  npu:
+    cores: 3
+    tops_int8_per_core: 2  # ~2 TOPS INT8 per core, 6 TOPS aggregate (theoretical peak)
+    local_sram_per_core_mib: 2
+    power_domain: pd_npu
+
+# Phase-1 audit fills these (pending boltzmann inspection)
+kernel:
+  running_version: TBD  # uname -r snapshot at audit time
+  source: TBD          # mainline torvalds / mmind-rockchip / custom
+  npu_driver: TBD      # vendor rockchip-npu / mainline rknpu / none
+
+userspace:
+  rknn_vendor_runtime_installed: false  # commitment: stay mainline-clean
+  llama_cpp_installed: TBD              # via marfrit-packages or built-from-source
+
+baseline_measurement:
+  pending: true
+  target: |
+    llama.cpp pure-CPU tok/s on qwen2.5-1.5b-instruct-q4_k_m.gguf,
+    3 runs, median wallclock. Use llama-bench from llama.cpp/build/bin.
+  ground_truth_file: benchmarks/2026-XX-XX_boltzmann_qwen1.5b_cpu_baseline.json
+
+bringup_sequence:
+  1: substrate audit (docs/npu-mainline-status.md table filled)
+  2: npu-probe runs successfully (open device → 64×64 INT8 matmul → bit-match CPU ref)
+  3: llama.cpp pure-CPU baseline captured
+  4: rknpu ggml backend skeleton compiles
+  5: first llama.cpp matmul offload working on a single layer
+  6: full forward pass via NPU for one decode step
+  7: tok/s vs baseline measured
+
+backup_host: ampere  # CoolPi GenBook — port-validation target. Phase-2+ scope.
+
+reverse_dependencies:
+  - Quark (boltzmann UEFI) — must stay bootable across kernel-rev experiments
+  - Neutron (boltzmann kernel build) — provides the kernel we tweak for rknpu
+  - Volta (boltzmann umbrella) — Rosenblatt is the third Volta-child after Quark + Neutron
@@ -0,0 +1,24 @@
+# kernel/
+
+Mainline-bound kernel patches: DT bindings, rknpu driver tweaks,
+power-domain wiring for boltzmann's `rk3588-rock-5-itx-plus.dts`.
+
+Empty until Phase 1 audit identifies what's actually missing in mainline.
+
+If Tomeu Vizoso's rknpu series is far enough along to use as-is, this
+directory may stay nearly empty — we'd just carry a small DT-overlay
+patch for boltzmann's board file.
+
+If we end up writing a driver from scratch (worst case), the structure
+will mirror an upstream submission layout:
+```
+0001-dt-bindings-npu-add-rockchip-rk3588-npu.patch
+0002-arm64-dts-rockchip-rk3588-add-npu-node.patch
+0003-arm64-dts-rockchip-rock-5-itx-plus-enable-npu.patch
+0004-accel-rknpu-add-rockchip-rk3588-driver.patch
+...
+```
+
+Tracking-wise these go through `marfrit/kernel-agent` as scope
+`patches/driver/accel/rknpu/`, with `fleet/boltzmann-rosenblatt.yaml`
+as the consuming manifest.
@@ -0,0 +1,32 @@
+# llm-runtime
+
+llama.cpp fork (or out-of-tree backend) with the rknpu ggml backend.
+
+Code lands here starting at Phase 5 (Plan) — too early in Phase 1.
+
+Until then, this directory holds:
+- design notes (`docs/architecture.md` from project root is authoritative)
+- the eventual `ggml-rknpu/` backend source
+- patch series for upstream submission if quality reaches that bar
+
+## Approach
+
+Two paths to consider in Phase 5:
+
+1. **Fork llama.cpp, add backend in tree.** Easier to keep in sync;
+   harder to upstream because llama.cpp may not want a Rockchip-specific
+   backend that depends on a still-WIP mainline driver.
+2. **Out-of-tree backend, load via llama.cpp's plugin API
+   (`-DGGML_BACKEND_DL=ON`).** Cleaner separation; tracks llama.cpp
+   upstream without our diff being in the way. Recommended unless we
+   need to patch core llama.cpp logic.
+
+Decision deferred to Phase 5.
+
+## Model
+
+Phase-1 target: `qwen2.5-1.5b-instruct-q4_k_m.gguf`. Source:
+hf.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF or built locally with
+llama-quantize.
+
+Stretch: qwen2.5-3B (if memory + NPU SRAM allow), gemma3-2B.
@@ -0,0 +1,33 @@
+# npu-probe
+
+Smallest-possible userspace binary that:
+1. Opens the NPU device (path TBD per Phase-1 audit)
+2. Allocates two INT8 input tensors (64×64) + one output (64×64)
+3. Submits a matmul via the uAPI in use (Tomeu's accel ioctl OR our own
+   shim around vendor MMIO if accel-mainline isn't ready)
+4. Waits for completion (DMA fence or polled completion register)
+5. Reads back the output
+6. Compares to a CPU INT8 matmul reference; reports pass/fail
+
+**Phase-1 deliverable.** Until this works, nothing else in this repo
+can be exercised against real silicon.
+
+## Build
+
+_(filled when Phase-1 audit picks the uAPI shape — `meson` or `cmake`,
+no autotools)_
+
+## Run
+
+```
+./npu-probe                      # default 64×64 INT8 matmul
+./npu-probe --shape 128,128,128  # M,N,K override
+./npu-probe --device /dev/accel/accel0    # override device path
+./npu-probe --golden golden_64x64.bin     # provide expected output for diff
+```
+
+## Why C, not Python
+
+Direct ioctl + dmabuf + mmap. Python wrapper layer would obscure the
+exact syscall sequence we need to understand. Once npu-probe works,
+a Python binding for benchmark scripts is fine.