Rosenblatt: project scaffold for RK3588 NPU on mainline

Codename: Frank Rosenblatt — Mark I Perceptron 1958, the first hardware neural network. This project lights up the RK3588 NPU on mainline Linux so the OSS world finally owns the silicon-side of inference on that chip. Phase-1 scope: small LLM running CPU + NPU mix on boltzmann (Rock 5 ITX+). Backend: llama.cpp with a new rknpu ggml backend offloading INT8 GEMM (attention + FFN matmuls) to the NPU's tile-MAC array while leaving dequant / RoPE / softmax / sampling / embedding on A76 NEON. Target model: qwen2.5-1.5B-instruct Q4_K_M GGUF. Scaffold layout: README.md (frame + 9+1-phase plan), TODO.md (rolling punch-list), docs/{npu-mainline-status,architecture}.md, kernel/ for DT bindings + driver tweaks, userspace/{npu-probe,llm-runtime}/, fleet/boltzmann.yaml. Next: Phase-1 substrate audit — fill the TBDs in docs/npu-mainline-status.md with the actual state of Tomeu Vizoso's rknpu / DRM-accel work on the boltzmann-running kernel.
2026-05-19 11:57:48 +00:00
commit 24adc74812
8 changed files with 578 additions and 0 deletions
@@ -0,0 +1,125 @@
+# RK3588 NPU on mainline — status audit (Phase 1)
+
+> **Phase 1 deliverable.** This file is the canonical snapshot of where
+> mainline kernel + DRM/accel + userspace stand for the RK3588 NPU as of
+> the audit date. Re-audit on each kernel-version bump or when a
+> blocker hits.
+
+**Audit date:** _(pending — first Phase-1 task)_
+**Kernel under audit:** _(boltzmann's running kernel, capture with `uname -r`)_
+
+---
+
+## Hardware quick-spec
+
+- **NPU:** Rockchip "RKNPU2" — three independent NPU cores, each ~2 TOPS
+  INT8 (combined 6 TOPS). Also supports INT16, FP16, and BF16 (with
+  reduced throughput). Per-core local memory (~2 MB SRAM each).
+- **Bus:** AXI interconnect, dedicated DMA engine, integrated power
+  domain (`pd_npu`).
+- **TRM section:** Chapter 23 ("NPU") in the RK3588 TRM v1.0 — local
+  copy referenced in `reference_rk3588_trm` memory.
+- **Vendor docs source-of-truth:**
+  https://github.com/airockchip/rknn-toolkit2 (SDK + ops list +
+  quantization scheme).
+
+---
+
+## Upstream work in flight
+
+To audit on Phase 1:
+
+1. **Tomeu Vizoso's rknpu / DRM-accel series**
+   - Branch: search lore.kernel.org for "rknpu" + "Tomeu Vizoso" in
+     `linux-rockchip` / `dri-devel`. Check `linux-rockchip` ml cutoff
+     date per `reference_linux_rockchip_ml` memory.
+   - State to capture: which patches merged, which are review-pending,
+     which dropped on the floor.
+2. **`drivers/accel/` mainline state**
+   - Check `drivers/accel/` in current torvalds tree — list of in-tree
+     accelerators (habanalabs, ivpu, qaic, etc.). Note if rknpu is
+     there yet.
+3. **DT bindings**
+   - `Documentation/devicetree/bindings/npu/rockchip,rk3588-npu.yaml`
+     — does it exist in mainline?
+4. **Userspace bridge**
+   - Tomeu has a userspace shim ("rkneural"?) for the DRM-accel uAPI.
+     Capture repo URL + current state.
+
+Fill the table below during Phase-1 audit:
+
+| Component | Mainline state | URL | Notes |
+|---|---|---|---|
+| `drivers/accel/rknpu/` | _TBD_ | | |
+| DT bindings `rockchip,rk3588-npu.yaml` | _TBD_ | | |
+| `pd_npu` power-domain in `rk3588.dtsi` | _TBD_ | | |
+| `npu` node in `rk3588.dtsi` | _TBD_ | | |
+| `npu` node in `rk3588-rock-5-itx-plus.dts` (boltzmann) | _TBD_ | | |
+| userspace uAPI consumer | _TBD_ | | |
+| Tomeu's WIP branch | _TBD_ | | |
+
+---
+
+## Vendor stack (for spec extraction only — don't lift code)
+
+- **RKNPU2 SDK:** `airockchip/rknn-toolkit2`. Provides:
+  - `librknnrt.so` runtime (closed, vendor binary)
+  - `rknn-convert` ONNX → `.rknn` transcoder
+  - rknn-llm fork (rknn-llm) for LLM-specific quant + scheduling
+- **Kernel module:** vendor `rockchip-npu` (BSP). Not mainline-compatible
+  (uses Rockchip's iommu/dma shim). Spec-extract per `feedback_megabitchip_semantic_match` —
+  read the source, don't link the binary.
+
+Operational rule: **mainline-clean from day 1.** No vendor blob runtime,
+no vendor-kernel-only module. If we need to copy a register-layout table
+from BSP source to get started, that's fine; copying a `.so` is not.
+
+---
+
+## Hardware capability bounds
+
+What the NPU CAN do (Phase-1-relevant):
+- INT8 GEMM up to ~2 TOPS/core × 3 cores = 6 TOPS peak (theoretical)
+- INT8 tensors with per-channel quantization scales
+- 3D conv (mostly unused in LLM workloads)
+- LayerNorm + activation fusion in some configs
+
+What the NPU CANNOT do that LLM needs (so stays on CPU):
+- fp32 / fp16 mixed-precision (NPU is INT8/INT16-first; some FP16 but
+  with throughput penalty)
+- Sampling (multinomial, top-k, top-p) — pure CPU
+- Sparse attention — no sparse-tile support in the NPU op set
+- Embedding lookup (rare access pattern; gather not in NPU op set)
+- RoPE, softmax, RMSNorm — these run on CPU because the NPU's
+  fixed-function pipeline doesn't have these as first-class ops
+  without going through a generic matmul shoehorn
+
+Phase-2 question: which of those CAN move to NPU later, with op-fusion or
+custom kernels.
+
+---
+
+## Throughput sanity check (theoretical only — not measured)
+
+qwen2.5-1.5B Q4_K_M:
+- ~1.5e9 weights, attention + FFN dominated by matmul
+- Per-token: roughly 2 × 1.5e9 = 3 GMAC (forward pass, weight×activation)
+- INT8 path: 3 GMAC / 6 TOPS = 0.5 ms/token compute-bound
+- → ~2000 tok/s if everything fit + zero overhead, ABSOLUTELY UNREALISTIC
+
+Realistic upper bound on first cut: probably 5-15 tok/s. Memory
+bandwidth + DMA setup + CPU side will dominate. Phase-4 baseline pulls
+the actual CPU-only number; Phase-8 measures how close we get with NPU
+in the loop. **The goal is "credible CPU+NPU mix that's faster than
+pure CPU," not "saturate the 6 TOPS rating."**
+
+---
+
+## What this audit unblocks
+
+After Phase 1 completes (the table above is filled), Phase 2 can pick:
+- whether to drive the NPU via Tomeu's accel uAPI (if it's far enough
+  along) or write directly against the MMIO regs (if not)
+- which kernel branch we baseline against (mainline-rc vs Tomeu's WIP)
+- which userspace shim we use (Tomeu's, write our own, or fork
+  llama.cpp's CUDA-style backend pattern for ggml)