Rosenblatt: project scaffold for RK3588 NPU on mainline
Codename: Frank Rosenblatt — Mark I Perceptron 1958, the first
hardware neural network. This project lights up the RK3588 NPU on
mainline Linux so the OSS world finally owns the silicon-side of
inference on that chip.
Phase-1 scope: small LLM running CPU + NPU mix on boltzmann (Rock 5
ITX+). Backend: llama.cpp with a new rknpu ggml backend offloading
INT8 GEMM (attention + FFN matmuls) to the NPU's tile-MAC array while
leaving dequant / RoPE / softmax / sampling / embedding on A76 NEON.
Target model: qwen2.5-1.5B-instruct Q4_K_M GGUF.
Scaffold layout: README.md (frame + 9+1-phase plan), TODO.md (rolling
punch-list), docs/{npu-mainline-status,architecture}.md, kernel/ for
DT bindings + driver tweaks, userspace/{npu-probe,llm-runtime}/,
fleet/boltzmann.yaml.
Next: Phase-1 substrate audit — fill the TBDs in docs/npu-mainline-status.md
with the actual state of Tomeu Vizoso's rknpu / DRM-accel work on
the boltzmann-running kernel.
This commit is contained in:
@@ -0,0 +1,127 @@
|
||||
# Architecture — CPU + NPU mix for llama.cpp on RK3588
|
||||
|
||||
## The split
|
||||
|
||||
llama.cpp's compute graph is built around ggml ops. We don't replace
|
||||
llama.cpp's whole engine — we register a new **device backend** (in
|
||||
ggml's `ggml-backend` abstraction) named `rknpu` and selectively offload
|
||||
the ops that are worth the round-trip cost.
|
||||
|
||||
### Goes to NPU (heavy, dense, INT8-friendly)
|
||||
|
||||
- `MUL_MAT` (matrix-matrix multiply) — the workhorse, dominates wall
|
||||
time. Both attention `Q @ K^T`, `attn @ V`, and FFN `up_proj`,
|
||||
`down_proj`, `gate_proj` are this shape.
|
||||
- `MUL_MAT_ID` (MoE-style mixture matmul) — when we eventually try a
|
||||
mixture-of-experts model. Phase-1+ scope.
|
||||
|
||||
### Stays on CPU (small, op-specific, or per-token)
|
||||
|
||||
- Embedding lookup (`GET_ROWS`) — random-access gather, NPU has no
|
||||
fast path
|
||||
- `RMS_NORM` / `LAYER_NORM` — per-token reduction + element-wise
|
||||
- `ROPE` — small, per-head, lots of trig
|
||||
- `SOFT_MAX` — small, per-head
|
||||
- Activations (`SILU`, `GELU`) — element-wise, cheap on NEON
|
||||
- `SCALE`, `ADD`, `MUL` (element-wise) — cheap on NEON
|
||||
- Sampling, KV cache update, tokenization — entirely host
|
||||
|
||||
### KV cache: open question
|
||||
|
||||
Two options:
|
||||
1. **CPU-resident:** lives in normal Linux memory; NPU pulls
|
||||
activations from CPU and pushes results back per layer.
|
||||
2. **NPU-resident:** allocated in dmabuf, NPU reads K/V across layers
|
||||
without round-trips. Cheaper per-layer but constrains model size
|
||||
to NPU-accessible memory.
|
||||
|
||||
Phase 5 (Plan) picks based on Phase-3 (Analyze) findings on the DMA
|
||||
cost.
|
||||
|
||||
---
|
||||
|
||||
## Memory mapping
|
||||
|
||||
ggml's `ggml_backend_buffer_t` abstracts the buffer pool. We implement:
|
||||
- `alloc_buffer(size)` → allocate a dmabuf of `size` bytes
|
||||
- `free_buffer(buffer)` → release dmabuf
|
||||
- `set_tensor` / `get_tensor` → CPU → NPU memcpy
|
||||
- `cpy_tensor` → device-internal copy
|
||||
|
||||
The dmabuf approach lets us share buffers between CPU producer (e.g.
|
||||
embedding lookup output) and NPU consumer (matmul) without an extra
|
||||
copy — `mmap` on CPU side, DMA-import on NPU side.
|
||||
|
||||
If Tomeu's accel uAPI uses dmabuf natively, we follow that. If it
|
||||
doesn't, we go through `/dev/dri/renderD*` with a thin shim.
|
||||
|
||||
---
|
||||
|
||||
## Quantization strategy
|
||||
|
||||
llama.cpp ships Q4_K_M as the default for ~2B models. Q4_K_M is a
|
||||
4-bit weight quantization with per-group scale + min, no per-channel
|
||||
scale. The NPU expects INT8 (or INT16) tensors with per-channel scale
|
||||
factors.
|
||||
|
||||
Two paths:
|
||||
1. **Dequantize on CPU per-layer:** unpack Q4_K_M → INT8 right before
|
||||
the matmul; ship INT8 to NPU. Adds a per-layer CPU pre-pass.
|
||||
2. **Dequantize once at load time:** unpack the entire weight tensor
|
||||
to INT8 + per-channel scales at model-load. Permanent ~2x memory
|
||||
cost (Q4_K_M is ~5 bits/weight effective; INT8 is 8 bits/weight),
|
||||
but no per-layer CPU work.
|
||||
|
||||
Phase-1 choice: (2) — straightforward, makes the NPU path the only
|
||||
thing happening at inference time, easier to profile. The memory cost
|
||||
on 1.5B is ~1.5 GB INT8 weights vs ~900 MB Q4_K_M — boltzmann has
|
||||
32 GB, this isn't the constraint.
|
||||
|
||||
Phase-2+: revisit (1) if we go for larger models or if INT8 turns out
|
||||
to be quality-loss-meaningful on the small ones.
|
||||
|
||||
---
|
||||
|
||||
## Backend interface — concrete
|
||||
|
||||
Mirroring ggml's existing `ggml-cuda.cu` / `ggml-metal.m` shape, we add:
|
||||
|
||||
```
|
||||
ggml-rknpu.h — public API: ggml_backend_rknpu_init() etc.
|
||||
ggml-rknpu.c — backend implementation: device registration, op
|
||||
dispatch table, memory management
|
||||
ggml-rknpu-ops.c — per-op kernels: matmul tiled to NPU's preferred
|
||||
shape, INT8 quant pre-pass
|
||||
```
|
||||
|
||||
In `llama.cpp/ggml/src/ggml-rknpu/`. Out-of-tree initially; if
|
||||
upstream-acceptable, send for review after Phase 8.
|
||||
|
||||
---
|
||||
|
||||
## Failure handling
|
||||
|
||||
llama.cpp's backend abstraction already supports falling back to CPU
|
||||
on per-op basis — `ggml_backend_dev_supports_op()`. We declare
|
||||
`MUL_MAT` supported for INT8 / FP16 inputs with the right shape
|
||||
constraints, and let the framework route everything else to CPU.
|
||||
|
||||
If the NPU driver returns an error mid-inference (timeout, DMA
|
||||
fence wait fail, etc.), the strategy is **abort the inference, log,
|
||||
return error to caller**. We don't try to silently fall back to CPU
|
||||
mid-stream because the state would be corrupted (NPU may have
|
||||
partially written to the dmabuf).
|
||||
|
||||
---
|
||||
|
||||
## Phase-1 milestone
|
||||
|
||||
A `npu-probe` userspace binary that:
|
||||
1. Opens the NPU device (whatever the mainline path is — likely
|
||||
`/dev/accel/accelN` or `/dev/dri/renderD*`)
|
||||
2. Allocates two small INT8 input tensors + one output (e.g. 64x64)
|
||||
3. Submits a matmul via the uAPI
|
||||
4. Waits, reads back, compares to a CPU reference
|
||||
|
||||
This proves the substrate is alive before we touch llama.cpp. If it
|
||||
doesn't work, we're back in kernel-driver land, not llama.cpp land.
|
||||
@@ -0,0 +1,125 @@
|
||||
# RK3588 NPU on mainline — status audit (Phase 1)
|
||||
|
||||
> **Phase 1 deliverable.** This file is the canonical snapshot of where
|
||||
> mainline kernel + DRM/accel + userspace stand for the RK3588 NPU as of
|
||||
> the audit date. Re-audit on each kernel-version bump or when a
|
||||
> blocker hits.
|
||||
|
||||
**Audit date:** _(pending — first Phase-1 task)_
|
||||
**Kernel under audit:** _(boltzmann's running kernel, capture with `uname -r`)_
|
||||
|
||||
---
|
||||
|
||||
## Hardware quick-spec
|
||||
|
||||
- **NPU:** Rockchip "RKNPU2" — three independent NPU cores, each ~2 TOPS
|
||||
INT8 (combined 6 TOPS). Also supports INT16, FP16, and BF16 (with
|
||||
reduced throughput). Per-core local memory (~2 MB SRAM each).
|
||||
- **Bus:** AXI interconnect, dedicated DMA engine, integrated power
|
||||
domain (`pd_npu`).
|
||||
- **TRM section:** Chapter 23 ("NPU") in the RK3588 TRM v1.0 — local
|
||||
copy referenced in `reference_rk3588_trm` memory.
|
||||
- **Vendor docs source-of-truth:**
|
||||
https://github.com/airockchip/rknn-toolkit2 (SDK + ops list +
|
||||
quantization scheme).
|
||||
|
||||
---
|
||||
|
||||
## Upstream work in flight
|
||||
|
||||
To audit on Phase 1:
|
||||
|
||||
1. **Tomeu Vizoso's rknpu / DRM-accel series**
|
||||
- Branch: search lore.kernel.org for "rknpu" + "Tomeu Vizoso" in
|
||||
`linux-rockchip` / `dri-devel`. Check `linux-rockchip` ml cutoff
|
||||
date per `reference_linux_rockchip_ml` memory.
|
||||
- State to capture: which patches merged, which are review-pending,
|
||||
which dropped on the floor.
|
||||
2. **`drivers/accel/` mainline state**
|
||||
- Check `drivers/accel/` in current torvalds tree — list of in-tree
|
||||
accelerators (habanalabs, ivpu, qaic, etc.). Note if rknpu is
|
||||
there yet.
|
||||
3. **DT bindings**
|
||||
- `Documentation/devicetree/bindings/npu/rockchip,rk3588-npu.yaml`
|
||||
— does it exist in mainline?
|
||||
4. **Userspace bridge**
|
||||
- Tomeu has a userspace shim ("rkneural"?) for the DRM-accel uAPI.
|
||||
Capture repo URL + current state.
|
||||
|
||||
Fill the table below during Phase-1 audit:
|
||||
|
||||
| Component | Mainline state | URL | Notes |
|
||||
|---|---|---|---|
|
||||
| `drivers/accel/rknpu/` | _TBD_ | | |
|
||||
| DT bindings `rockchip,rk3588-npu.yaml` | _TBD_ | | |
|
||||
| `pd_npu` power-domain in `rk3588.dtsi` | _TBD_ | | |
|
||||
| `npu` node in `rk3588.dtsi` | _TBD_ | | |
|
||||
| `npu` node in `rk3588-rock-5-itx-plus.dts` (boltzmann) | _TBD_ | | |
|
||||
| userspace uAPI consumer | _TBD_ | | |
|
||||
| Tomeu's WIP branch | _TBD_ | | |
|
||||
|
||||
---
|
||||
|
||||
## Vendor stack (for spec extraction only — don't lift code)
|
||||
|
||||
- **RKNPU2 SDK:** `airockchip/rknn-toolkit2`. Provides:
|
||||
- `librknnrt.so` runtime (closed, vendor binary)
|
||||
- `rknn-convert` ONNX → `.rknn` transcoder
|
||||
- rknn-llm fork (rknn-llm) for LLM-specific quant + scheduling
|
||||
- **Kernel module:** vendor `rockchip-npu` (BSP). Not mainline-compatible
|
||||
(uses Rockchip's iommu/dma shim). Spec-extract per `feedback_megabitchip_semantic_match` —
|
||||
read the source, don't link the binary.
|
||||
|
||||
Operational rule: **mainline-clean from day 1.** No vendor blob runtime,
|
||||
no vendor-kernel-only module. If we need to copy a register-layout table
|
||||
from BSP source to get started, that's fine; copying a `.so` is not.
|
||||
|
||||
---
|
||||
|
||||
## Hardware capability bounds
|
||||
|
||||
What the NPU CAN do (Phase-1-relevant):
|
||||
- INT8 GEMM up to ~2 TOPS/core × 3 cores = 6 TOPS peak (theoretical)
|
||||
- INT8 tensors with per-channel quantization scales
|
||||
- 3D conv (mostly unused in LLM workloads)
|
||||
- LayerNorm + activation fusion in some configs
|
||||
|
||||
What the NPU CANNOT do that LLM needs (so stays on CPU):
|
||||
- fp32 / fp16 mixed-precision (NPU is INT8/INT16-first; some FP16 but
|
||||
with throughput penalty)
|
||||
- Sampling (multinomial, top-k, top-p) — pure CPU
|
||||
- Sparse attention — no sparse-tile support in the NPU op set
|
||||
- Embedding lookup (rare access pattern; gather not in NPU op set)
|
||||
- RoPE, softmax, RMSNorm — these run on CPU because the NPU's
|
||||
fixed-function pipeline doesn't have these as first-class ops
|
||||
without going through a generic matmul shoehorn
|
||||
|
||||
Phase-2 question: which of those CAN move to NPU later, with op-fusion or
|
||||
custom kernels.
|
||||
|
||||
---
|
||||
|
||||
## Throughput sanity check (theoretical only — not measured)
|
||||
|
||||
qwen2.5-1.5B Q4_K_M:
|
||||
- ~1.5e9 weights, attention + FFN dominated by matmul
|
||||
- Per-token: roughly 2 × 1.5e9 = 3 GMAC (forward pass, weight×activation)
|
||||
- INT8 path: 3 GMAC / 6 TOPS = 0.5 ms/token compute-bound
|
||||
- → ~2000 tok/s if everything fit + zero overhead, ABSOLUTELY UNREALISTIC
|
||||
|
||||
Realistic upper bound on first cut: probably 5-15 tok/s. Memory
|
||||
bandwidth + DMA setup + CPU side will dominate. Phase-4 baseline pulls
|
||||
the actual CPU-only number; Phase-8 measures how close we get with NPU
|
||||
in the loop. **The goal is "credible CPU+NPU mix that's faster than
|
||||
pure CPU," not "saturate the 6 TOPS rating."**
|
||||
|
||||
---
|
||||
|
||||
## What this audit unblocks
|
||||
|
||||
After Phase 1 completes (the table above is filled), Phase 2 can pick:
|
||||
- whether to drive the NPU via Tomeu's accel uAPI (if it's far enough
|
||||
along) or write directly against the MMIO regs (if not)
|
||||
- which kernel branch we baseline against (mainline-rc vs Tomeu's WIP)
|
||||
- which userspace shim we use (Tomeu's, write our own, or fork
|
||||
llama.cpp's CUDA-style backend pattern for ggml)
|
||||
Reference in New Issue
Block a user