# RK3588 NPU on mainline — status audit (Phase 1) > **Phase 1 deliverable.** This file is the canonical snapshot of where > mainline kernel + DRM/accel + userspace stand for the RK3588 NPU as of > the audit date. Re-audit on each kernel-version bump or when a > blocker hits. **Audit date:** _(pending — first Phase-1 task)_ **Kernel under audit:** _(boltzmann's running kernel, capture with `uname -r`)_ --- ## Hardware quick-spec - **NPU:** Rockchip "RKNPU2" — three independent NPU cores, each ~2 TOPS INT8 (combined 6 TOPS). Also supports INT16, FP16, and BF16 (with reduced throughput). Per-core local memory (~2 MB SRAM each). - **Bus:** AXI interconnect, dedicated DMA engine, integrated power domain (`pd_npu`). - **TRM section:** Chapter 23 ("NPU") in the RK3588 TRM v1.0 — local copy referenced in `reference_rk3588_trm` memory. - **Vendor docs source-of-truth:** https://github.com/airockchip/rknn-toolkit2 (SDK + ops list + quantization scheme). --- ## Upstream work in flight To audit on Phase 1: 1. **Tomeu Vizoso's rknpu / DRM-accel series** - Branch: search lore.kernel.org for "rknpu" + "Tomeu Vizoso" in `linux-rockchip` / `dri-devel`. Check `linux-rockchip` ml cutoff date per `reference_linux_rockchip_ml` memory. - State to capture: which patches merged, which are review-pending, which dropped on the floor. 2. **`drivers/accel/` mainline state** - Check `drivers/accel/` in current torvalds tree — list of in-tree accelerators (habanalabs, ivpu, qaic, etc.). Note if rknpu is there yet. 3. **DT bindings** - `Documentation/devicetree/bindings/npu/rockchip,rk3588-npu.yaml` — does it exist in mainline? 4. **Userspace bridge** - Tomeu has a userspace shim ("rkneural"?) for the DRM-accel uAPI. Capture repo URL + current state. Fill the table below during Phase-1 audit: | Component | Mainline state | URL | Notes | |---|---|---|---| | `drivers/accel/rknpu/` | _TBD_ | | | | DT bindings `rockchip,rk3588-npu.yaml` | _TBD_ | | | | `pd_npu` power-domain in `rk3588.dtsi` | _TBD_ | | | | `npu` node in `rk3588.dtsi` | _TBD_ | | | | `npu` node in `rk3588-rock-5-itx-plus.dts` (boltzmann) | _TBD_ | | | | userspace uAPI consumer | _TBD_ | | | | Tomeu's WIP branch | _TBD_ | | | --- ## Vendor stack (for spec extraction only — don't lift code) - **RKNPU2 SDK:** `airockchip/rknn-toolkit2`. Provides: - `librknnrt.so` runtime (closed, vendor binary) - `rknn-convert` ONNX → `.rknn` transcoder - rknn-llm fork (rknn-llm) for LLM-specific quant + scheduling - **Kernel module:** vendor `rockchip-npu` (BSP). Not mainline-compatible (uses Rockchip's iommu/dma shim). Spec-extract per `feedback_megabitchip_semantic_match` — read the source, don't link the binary. Operational rule: **mainline-clean from day 1.** No vendor blob runtime, no vendor-kernel-only module. If we need to copy a register-layout table from BSP source to get started, that's fine; copying a `.so` is not. --- ## Hardware capability bounds What the NPU CAN do (Phase-1-relevant): - INT8 GEMM up to ~2 TOPS/core × 3 cores = 6 TOPS peak (theoretical) - INT8 tensors with per-channel quantization scales - 3D conv (mostly unused in LLM workloads) - LayerNorm + activation fusion in some configs What the NPU CANNOT do that LLM needs (so stays on CPU): - fp32 / fp16 mixed-precision (NPU is INT8/INT16-first; some FP16 but with throughput penalty) - Sampling (multinomial, top-k, top-p) — pure CPU - Sparse attention — no sparse-tile support in the NPU op set - Embedding lookup (rare access pattern; gather not in NPU op set) - RoPE, softmax, RMSNorm — these run on CPU because the NPU's fixed-function pipeline doesn't have these as first-class ops without going through a generic matmul shoehorn Phase-2 question: which of those CAN move to NPU later, with op-fusion or custom kernels. --- ## Throughput sanity check (theoretical only — not measured) qwen2.5-1.5B Q4_K_M: - ~1.5e9 weights, attention + FFN dominated by matmul - Per-token: roughly 2 × 1.5e9 = 3 GMAC (forward pass, weight×activation) - INT8 path: 3 GMAC / 6 TOPS = 0.5 ms/token compute-bound - → ~2000 tok/s if everything fit + zero overhead, ABSOLUTELY UNREALISTIC Realistic upper bound on first cut: probably 5-15 tok/s. Memory bandwidth + DMA setup + CPU side will dominate. Phase-4 baseline pulls the actual CPU-only number; Phase-8 measures how close we get with NPU in the loop. **The goal is "credible CPU+NPU mix that's faster than pure CPU," not "saturate the 6 TOPS rating."** --- ## What this audit unblocks After Phase 1 completes (the table above is filled), Phase 2 can pick: - whether to drive the NPU via Tomeu's accel uAPI (if it's far enough along) or write directly against the MMIO regs (if not) - which kernel branch we baseline against (mainline-rc vs Tomeu's WIP) - which userspace shim we use (Tomeu's, write our own, or fork llama.cpp's CUDA-style backend pattern for ggml)