Files

T

marfrit 24adc74812 Rosenblatt: project scaffold for RK3588 NPU on mainline

Codename: Frank Rosenblatt — Mark I Perceptron 1958, the first
hardware neural network.  This project lights up the RK3588 NPU on
mainline Linux so the OSS world finally owns the silicon-side of
inference on that chip.

Phase-1 scope: small LLM running CPU + NPU mix on boltzmann (Rock 5
ITX+).  Backend: llama.cpp with a new rknpu ggml backend offloading
INT8 GEMM (attention + FFN matmuls) to the NPU's tile-MAC array while
leaving dequant / RoPE / softmax / sampling / embedding on A76 NEON.

Target model: qwen2.5-1.5B-instruct Q4_K_M GGUF.

Scaffold layout: README.md (frame + 9+1-phase plan), TODO.md (rolling
punch-list), docs/{npu-mainline-status,architecture}.md, kernel/ for
DT bindings + driver tweaks, userspace/{npu-probe,llm-runtime}/,
fleet/boltzmann.yaml.

Next: Phase-1 substrate audit — fill the TBDs in docs/npu-mainline-status.md
with the actual state of Tomeu Vizoso's rknpu / DRM-accel work on
the boltzmann-running kernel.

2026-05-19 11:57:48 +00:00

4.9 KiB

Raw Blame History

RK3588 NPU on mainline — status audit (Phase 1)

Phase 1 deliverable. This file is the canonical snapshot of where mainline kernel + DRM/accel + userspace stand for the RK3588 NPU as of the audit date. Re-audit on each kernel-version bump or when a blocker hits.

Audit date: (pending — first Phase-1 task) Kernel under audit: (boltzmann's running kernel, capture with uname -r)

Hardware quick-spec

NPU: Rockchip "RKNPU2" — three independent NPU cores, each ~2 TOPS INT8 (combined 6 TOPS). Also supports INT16, FP16, and BF16 (with reduced throughput). Per-core local memory (~2 MB SRAM each).
Bus: AXI interconnect, dedicated DMA engine, integrated power domain (pd_npu).
TRM section: Chapter 23 ("NPU") in the RK3588 TRM v1.0 — local copy referenced in reference_rk3588_trm memory.
Vendor docs source-of-truth: https://github.com/airockchip/rknn-toolkit2 (SDK + ops list + quantization scheme).

Upstream work in flight

To audit on Phase 1:

Tomeu Vizoso's rknpu / DRM-accel series
- Branch: search lore.kernel.org for "rknpu" + "Tomeu Vizoso" in linux-rockchip / dri-devel. Check linux-rockchip ml cutoff date per reference_linux_rockchip_ml memory.
- State to capture: which patches merged, which are review-pending, which dropped on the floor.
drivers/accel/ mainline state
- Check drivers/accel/ in current torvalds tree — list of in-tree accelerators (habanalabs, ivpu, qaic, etc.). Note if rknpu is there yet.
DT bindings
- Documentation/devicetree/bindings/npu/rockchip,rk3588-npu.yaml — does it exist in mainline?
Userspace bridge
- Tomeu has a userspace shim ("rkneural"?) for the DRM-accel uAPI. Capture repo URL + current state.

Fill the table below during Phase-1 audit:

Component	Mainline state	URL	Notes
`drivers/accel/rknpu/`	TBD
DT bindings `rockchip,rk3588-npu.yaml`	TBD
`pd_npu` power-domain in `rk3588.dtsi`	TBD
`npu` node in `rk3588.dtsi`	TBD
`npu` node in `rk3588-rock-5-itx-plus.dts` (boltzmann)	TBD
userspace uAPI consumer	TBD
Tomeu's WIP branch	TBD

Vendor stack (for spec extraction only — don't lift code)

RKNPU2 SDK: airockchip/rknn-toolkit2. Provides:
- librknnrt.so runtime (closed, vendor binary)
- rknn-convert ONNX → .rknn transcoder
- rknn-llm fork (rknn-llm) for LLM-specific quant + scheduling
Kernel module: vendor rockchip-npu (BSP). Not mainline-compatible (uses Rockchip's iommu/dma shim). Spec-extract per feedback_megabitchip_semantic_match — read the source, don't link the binary.

Operational rule: mainline-clean from day 1. No vendor blob runtime, no vendor-kernel-only module. If we need to copy a register-layout table from BSP source to get started, that's fine; copying a .so is not.

Hardware capability bounds

What the NPU CAN do (Phase-1-relevant):

INT8 GEMM up to ~2 TOPS/core × 3 cores = 6 TOPS peak (theoretical)
INT8 tensors with per-channel quantization scales
3D conv (mostly unused in LLM workloads)
LayerNorm + activation fusion in some configs

What the NPU CANNOT do that LLM needs (so stays on CPU):

fp32 / fp16 mixed-precision (NPU is INT8/INT16-first; some FP16 but with throughput penalty)
Sampling (multinomial, top-k, top-p) — pure CPU
Sparse attention — no sparse-tile support in the NPU op set
Embedding lookup (rare access pattern; gather not in NPU op set)
RoPE, softmax, RMSNorm — these run on CPU because the NPU's fixed-function pipeline doesn't have these as first-class ops without going through a generic matmul shoehorn

Phase-2 question: which of those CAN move to NPU later, with op-fusion or custom kernels.

Throughput sanity check (theoretical only — not measured)

qwen2.5-1.5B Q4_K_M:

~1.5e9 weights, attention + FFN dominated by matmul
Per-token: roughly 2 × 1.5e9 = 3 GMAC (forward pass, weight×activation)
INT8 path: 3 GMAC / 6 TOPS = 0.5 ms/token compute-bound
→ ~2000 tok/s if everything fit + zero overhead, ABSOLUTELY UNREALISTIC

Realistic upper bound on first cut: probably 5-15 tok/s. Memory bandwidth + DMA setup + CPU side will dominate. Phase-4 baseline pulls the actual CPU-only number; Phase-8 measures how close we get with NPU in the loop. The goal is "credible CPU+NPU mix that's faster than pure CPU," not "saturate the 6 TOPS rating."

What this audit unblocks

After Phase 1 completes (the table above is filled), Phase 2 can pick:

whether to drive the NPU via Tomeu's accel uAPI (if it's far enough along) or write directly against the MMIO regs (if not)
which kernel branch we baseline against (mainline-rc vs Tomeu's WIP)
which userspace shim we use (Tomeu's, write our own, or fork llama.cpp's CUDA-style backend pattern for ggml)

4.9 KiB Raw Blame History Unescape Escape