Rosenblatt: project scaffold for RK3588 NPU on mainline

Codename: Frank Rosenblatt — Mark I Perceptron 1958, the first hardware neural network. This project lights up the RK3588 NPU on mainline Linux so the OSS world finally owns the silicon-side of inference on that chip. Phase-1 scope: small LLM running CPU + NPU mix on boltzmann (Rock 5 ITX+). Backend: llama.cpp with a new rknpu ggml backend offloading INT8 GEMM (attention + FFN matmuls) to the NPU's tile-MAC array while leaving dequant / RoPE / softmax / sampling / embedding on A76 NEON. Target model: qwen2.5-1.5B-instruct Q4_K_M GGUF. Scaffold layout: README.md (frame + 9+1-phase plan), TODO.md (rolling punch-list), docs/{npu-mainline-status,architecture}.md, kernel/ for DT bindings + driver tweaks, userspace/{npu-probe,llm-runtime}/, fleet/boltzmann.yaml. Next: Phase-1 substrate audit — fill the TBDs in docs/npu-mainline-status.md with the actual state of Tomeu Vizoso's rknpu / DRM-accel work on the boltzmann-running kernel.
2026-05-19 11:57:48 +00:00
commit 24adc74812
8 changed files with 578 additions and 0 deletions
@@ -0,0 +1,127 @@
+# Architecture — CPU + NPU mix for llama.cpp on RK3588
+
+## The split
+
+llama.cpp's compute graph is built around ggml ops. We don't replace
+llama.cpp's whole engine — we register a new **device backend** (in
+ggml's `ggml-backend` abstraction) named `rknpu` and selectively offload
+the ops that are worth the round-trip cost.
+
+### Goes to NPU (heavy, dense, INT8-friendly)
+
+- `MUL_MAT` (matrix-matrix multiply) — the workhorse, dominates wall
+  time. Both attention `Q @ K^T`, `attn @ V`, and FFN `up_proj`,
+  `down_proj`, `gate_proj` are this shape.
+- `MUL_MAT_ID` (MoE-style mixture matmul) — when we eventually try a
+  mixture-of-experts model. Phase-1+ scope.
+
+### Stays on CPU (small, op-specific, or per-token)
+
+- Embedding lookup (`GET_ROWS`) — random-access gather, NPU has no
+  fast path
+- `RMS_NORM` / `LAYER_NORM` — per-token reduction + element-wise
+- `ROPE` — small, per-head, lots of trig
+- `SOFT_MAX` — small, per-head
+- Activations (`SILU`, `GELU`) — element-wise, cheap on NEON
+- `SCALE`, `ADD`, `MUL` (element-wise) — cheap on NEON
+- Sampling, KV cache update, tokenization — entirely host
+
+### KV cache: open question
+
+Two options:
+1. **CPU-resident:** lives in normal Linux memory; NPU pulls
+   activations from CPU and pushes results back per layer.
+2. **NPU-resident:** allocated in dmabuf, NPU reads K/V across layers
+   without round-trips. Cheaper per-layer but constrains model size
+   to NPU-accessible memory.
+
+Phase 5 (Plan) picks based on Phase-3 (Analyze) findings on the DMA
+cost.
+
+---
+
+## Memory mapping
+
+ggml's `ggml_backend_buffer_t` abstracts the buffer pool. We implement:
+- `alloc_buffer(size)` → allocate a dmabuf of `size` bytes
+- `free_buffer(buffer)` → release dmabuf
+- `set_tensor` / `get_tensor` → CPU → NPU memcpy
+- `cpy_tensor` → device-internal copy
+
+The dmabuf approach lets us share buffers between CPU producer (e.g.
+embedding lookup output) and NPU consumer (matmul) without an extra
+copy — `mmap` on CPU side, DMA-import on NPU side.
+
+If Tomeu's accel uAPI uses dmabuf natively, we follow that. If it
+doesn't, we go through `/dev/dri/renderD*` with a thin shim.
+
+---
+
+## Quantization strategy
+
+llama.cpp ships Q4_K_M as the default for ~2B models. Q4_K_M is a
+4-bit weight quantization with per-group scale + min, no per-channel
+scale. The NPU expects INT8 (or INT16) tensors with per-channel scale
+factors.
+
+Two paths:
+1. **Dequantize on CPU per-layer:** unpack Q4_K_M → INT8 right before
+   the matmul; ship INT8 to NPU. Adds a per-layer CPU pre-pass.
+2. **Dequantize once at load time:** unpack the entire weight tensor
+   to INT8 + per-channel scales at model-load. Permanent ~2x memory
+   cost (Q4_K_M is ~5 bits/weight effective; INT8 is 8 bits/weight),
+   but no per-layer CPU work.
+
+Phase-1 choice: (2) — straightforward, makes the NPU path the only
+thing happening at inference time, easier to profile. The memory cost
+on 1.5B is ~1.5 GB INT8 weights vs ~900 MB Q4_K_M — boltzmann has
+32 GB, this isn't the constraint.
+
+Phase-2+: revisit (1) if we go for larger models or if INT8 turns out
+to be quality-loss-meaningful on the small ones.
+
+---
+
+## Backend interface — concrete
+
+Mirroring ggml's existing `ggml-cuda.cu` / `ggml-metal.m` shape, we add:
+
+```
+ggml-rknpu.h       — public API: ggml_backend_rknpu_init() etc.
+ggml-rknpu.c       — backend implementation: device registration, op
+                     dispatch table, memory management
+ggml-rknpu-ops.c   — per-op kernels: matmul tiled to NPU's preferred
+                     shape, INT8 quant pre-pass
+```
+
+In `llama.cpp/ggml/src/ggml-rknpu/`. Out-of-tree initially; if
+upstream-acceptable, send for review after Phase 8.
+
+---
+
+## Failure handling
+
+llama.cpp's backend abstraction already supports falling back to CPU
+on per-op basis — `ggml_backend_dev_supports_op()`. We declare
+`MUL_MAT` supported for INT8 / FP16 inputs with the right shape
+constraints, and let the framework route everything else to CPU.
+
+If the NPU driver returns an error mid-inference (timeout, DMA
+fence wait fail, etc.), the strategy is **abort the inference, log,
+return error to caller**. We don't try to silently fall back to CPU
+mid-stream because the state would be corrupted (NPU may have
+partially written to the dmabuf).
+
+---
+
+## Phase-1 milestone
+
+A `npu-probe` userspace binary that:
+1. Opens the NPU device (whatever the mainline path is — likely
+   `/dev/accel/accelN` or `/dev/dri/renderD*`)
+2. Allocates two small INT8 input tensors + one output (e.g. 64x64)
+3. Submits a matmul via the uAPI
+4. Waits, reads back, compares to a CPU reference
+
+This proves the substrate is alive before we touch llama.cpp. If it
+doesn't work, we're back in kernel-driver land, not llama.cpp land.