Files
rosenblatt/docs/npu-mainline-status.md
T
marfrit 24adc74812 Rosenblatt: project scaffold for RK3588 NPU on mainline
Codename: Frank Rosenblatt — Mark I Perceptron 1958, the first
hardware neural network.  This project lights up the RK3588 NPU on
mainline Linux so the OSS world finally owns the silicon-side of
inference on that chip.

Phase-1 scope: small LLM running CPU + NPU mix on boltzmann (Rock 5
ITX+).  Backend: llama.cpp with a new rknpu ggml backend offloading
INT8 GEMM (attention + FFN matmuls) to the NPU's tile-MAC array while
leaving dequant / RoPE / softmax / sampling / embedding on A76 NEON.

Target model: qwen2.5-1.5B-instruct Q4_K_M GGUF.

Scaffold layout: README.md (frame + 9+1-phase plan), TODO.md (rolling
punch-list), docs/{npu-mainline-status,architecture}.md, kernel/ for
DT bindings + driver tweaks, userspace/{npu-probe,llm-runtime}/,
fleet/boltzmann.yaml.

Next: Phase-1 substrate audit — fill the TBDs in docs/npu-mainline-status.md
with the actual state of Tomeu Vizoso's rknpu / DRM-accel work on
the boltzmann-running kernel.
2026-05-19 11:57:48 +00:00

126 lines
4.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# RK3588 NPU on mainline — status audit (Phase 1)
> **Phase 1 deliverable.** This file is the canonical snapshot of where
> mainline kernel + DRM/accel + userspace stand for the RK3588 NPU as of
> the audit date. Re-audit on each kernel-version bump or when a
> blocker hits.
**Audit date:** _(pending — first Phase-1 task)_
**Kernel under audit:** _(boltzmann's running kernel, capture with `uname -r`)_
---
## Hardware quick-spec
- **NPU:** Rockchip "RKNPU2" — three independent NPU cores, each ~2 TOPS
INT8 (combined 6 TOPS). Also supports INT16, FP16, and BF16 (with
reduced throughput). Per-core local memory (~2 MB SRAM each).
- **Bus:** AXI interconnect, dedicated DMA engine, integrated power
domain (`pd_npu`).
- **TRM section:** Chapter 23 ("NPU") in the RK3588 TRM v1.0 — local
copy referenced in `reference_rk3588_trm` memory.
- **Vendor docs source-of-truth:**
https://github.com/airockchip/rknn-toolkit2 (SDK + ops list +
quantization scheme).
---
## Upstream work in flight
To audit on Phase 1:
1. **Tomeu Vizoso's rknpu / DRM-accel series**
- Branch: search lore.kernel.org for "rknpu" + "Tomeu Vizoso" in
`linux-rockchip` / `dri-devel`. Check `linux-rockchip` ml cutoff
date per `reference_linux_rockchip_ml` memory.
- State to capture: which patches merged, which are review-pending,
which dropped on the floor.
2. **`drivers/accel/` mainline state**
- Check `drivers/accel/` in current torvalds tree — list of in-tree
accelerators (habanalabs, ivpu, qaic, etc.). Note if rknpu is
there yet.
3. **DT bindings**
- `Documentation/devicetree/bindings/npu/rockchip,rk3588-npu.yaml`
— does it exist in mainline?
4. **Userspace bridge**
- Tomeu has a userspace shim ("rkneural"?) for the DRM-accel uAPI.
Capture repo URL + current state.
Fill the table below during Phase-1 audit:
| Component | Mainline state | URL | Notes |
|---|---|---|---|
| `drivers/accel/rknpu/` | _TBD_ | | |
| DT bindings `rockchip,rk3588-npu.yaml` | _TBD_ | | |
| `pd_npu` power-domain in `rk3588.dtsi` | _TBD_ | | |
| `npu` node in `rk3588.dtsi` | _TBD_ | | |
| `npu` node in `rk3588-rock-5-itx-plus.dts` (boltzmann) | _TBD_ | | |
| userspace uAPI consumer | _TBD_ | | |
| Tomeu's WIP branch | _TBD_ | | |
---
## Vendor stack (for spec extraction only — don't lift code)
- **RKNPU2 SDK:** `airockchip/rknn-toolkit2`. Provides:
- `librknnrt.so` runtime (closed, vendor binary)
- `rknn-convert` ONNX → `.rknn` transcoder
- rknn-llm fork (rknn-llm) for LLM-specific quant + scheduling
- **Kernel module:** vendor `rockchip-npu` (BSP). Not mainline-compatible
(uses Rockchip's iommu/dma shim). Spec-extract per `feedback_megabitchip_semantic_match`
read the source, don't link the binary.
Operational rule: **mainline-clean from day 1.** No vendor blob runtime,
no vendor-kernel-only module. If we need to copy a register-layout table
from BSP source to get started, that's fine; copying a `.so` is not.
---
## Hardware capability bounds
What the NPU CAN do (Phase-1-relevant):
- INT8 GEMM up to ~2 TOPS/core × 3 cores = 6 TOPS peak (theoretical)
- INT8 tensors with per-channel quantization scales
- 3D conv (mostly unused in LLM workloads)
- LayerNorm + activation fusion in some configs
What the NPU CANNOT do that LLM needs (so stays on CPU):
- fp32 / fp16 mixed-precision (NPU is INT8/INT16-first; some FP16 but
with throughput penalty)
- Sampling (multinomial, top-k, top-p) — pure CPU
- Sparse attention — no sparse-tile support in the NPU op set
- Embedding lookup (rare access pattern; gather not in NPU op set)
- RoPE, softmax, RMSNorm — these run on CPU because the NPU's
fixed-function pipeline doesn't have these as first-class ops
without going through a generic matmul shoehorn
Phase-2 question: which of those CAN move to NPU later, with op-fusion or
custom kernels.
---
## Throughput sanity check (theoretical only — not measured)
qwen2.5-1.5B Q4_K_M:
- ~1.5e9 weights, attention + FFN dominated by matmul
- Per-token: roughly 2 × 1.5e9 = 3 GMAC (forward pass, weight×activation)
- INT8 path: 3 GMAC / 6 TOPS = 0.5 ms/token compute-bound
- → ~2000 tok/s if everything fit + zero overhead, ABSOLUTELY UNREALISTIC
Realistic upper bound on first cut: probably 5-15 tok/s. Memory
bandwidth + DMA setup + CPU side will dominate. Phase-4 baseline pulls
the actual CPU-only number; Phase-8 measures how close we get with NPU
in the loop. **The goal is "credible CPU+NPU mix that's faster than
pure CPU," not "saturate the 6 TOPS rating."**
---
## What this audit unblocks
After Phase 1 completes (the table above is filled), Phase 2 can pick:
- whether to drive the NPU via Tomeu's accel uAPI (if it's far enough
along) or write directly against the MMIO regs (if not)
- which kernel branch we baseline against (mainline-rc vs Tomeu's WIP)
- which userspace shim we use (Tomeu's, write our own, or fork
llama.cpp's CUDA-style backend pattern for ggml)