# RK3588 NPU on mainline — status audit (Phase 1) > **Phase 1 deliverable.** This file is the canonical snapshot of where > mainline kernel + DRM/accel + userspace stand for the RK3588 NPU as of > the audit date. Re-audit on each kernel-version bump or when a > blocker hits. **Audit date:** 2026-05-19 **Kernel under audit:** boltzmann, `7.0.0-rc3-ARCH+ #2 SMP PREEMPT_DYNAMIC Wed Apr 29 11:16:17 CEST 2026 aarch64` (image `linux-rk3588-marfrit-A1`) --- ## Local-host findings (boltzmann) The custom `linux-rk3588-marfrit-A1` build is mainline-rc3-based and **already ships Tomeu Vizoso's upstream in-tree NPU accel driver** as a loadable module: | Probe | Result | |---|---| | `CONFIG_DRM_ACCEL` | `y` (DRM-accel core built-in; major 261 = `accel` in `/proc/devices`) | | `CONFIG_DRM_ACCEL_ROCKET` | `m` (Tomeu's `rocket` driver, intree, author "Tomeu Vizoso") | | Module file | `/lib/modules/7.0.0-rc3-ARCH+/kernel/drivers/accel/rocket/rocket.ko` | | Compat aliases | `rockchip,rk3588-rknn-core` (per-core NPU node) | | Module deps | `gpu-sched`, `drm_shmem_helper` | | DT nodes | `npu@fdab0000`, `npu@fdac0000`, `npu@fdad0000` — all `compatible = "rockchip,rk3588-rknn-core"`, **`status = "disabled"`** | | `/dev/accel/` | does not exist (no device bound — rocket isn't loaded and nodes are disabled) | | `dmesg` rknpu/npu/accel lines | zero since boot | **Net:** the driver-side path is already there. The blocker is the device-tree: `pd_npu` is wired but the three `rknn-core` nodes are `status = "disabled"`. Enabling them (DT overlay or rebuilt board DTB) should let `rocket` probe and instantiate `/dev/accel/accel0..2`. This decisively answers the Phase-1 exit question: **drive via the DRM-accel uAPI through `rocket`, not a from-scratch MMIO driver.** Display/GPU side, for completeness: `card0` + `renderD128` are bound to `panthor` (mainline Mali G610), `card1` to `rockchip-drm` — no contention with the NPU stack. The "Tomeu's rknpu" framing in the original Phase-1 plan is outdated: the work landed under the name **`rocket`** in `drivers/accel/`. References below use both names; consider `rknpu` → `rocket` going forward. --- --- ## Hardware quick-spec - **NPU:** Rockchip "RKNPU2" — three independent NPU cores, each ~2 TOPS INT8 (combined 6 TOPS). Also supports INT16, FP16, and BF16 (with reduced throughput). Per-core local memory (~2 MB SRAM each). - **Bus:** AXI interconnect, dedicated DMA engine, integrated power domain (`pd_npu`). - **TRM section:** Chapter 23 ("NPU") in the RK3588 TRM v1.0 — local copy referenced in `reference_rk3588_trm` memory. - **Vendor docs source-of-truth:** https://github.com/airockchip/rknn-toolkit2 (SDK + ops list + quantization scheme). --- ## Upstream state (audited 2026-05-19) Tomeu Vizoso's RK3588 NPU work landed in **Linux 6.18 (Nov 2025)** under the codename **`rocket`** (not `rknpu`). The matching userspace shipped in **Mesa 25.3** (Rocket Gallium driver + Teflon TFLite delegate). Boltzmann at 7.0.0-rc3 is one major release past the merge. | Component | Mainline state | URL | Notes | |---|---|---|---| | `drivers/accel/rocket/` | **in-tree since 6.18** | · [kernel docs](https://docs.kernel.org/accel/rocket/index.html) | Author Tomeu Vizoso. Files: `rocket_core.c`, `_device.c`, `_drv.c`, `_gem.c`, `_job.c`, `_registers.h`. Kconfig `DRM_ACCEL_ROCKET`. | | DT binding `rockchip,rk3588-rknn-core.yaml` | in-tree (6.18 pull) | [v8 06/10 patch](https://www.mail-archive.com/dri-devel@lists.freedesktop.org/msg554406.html) | Path `Documentation/devicetree/bindings/npu/`. Requires 3 reg blocks (pc/cna/core), 4 clocks, IOMMU, power-domain, resets. | | `pd_npu` power-domain in `rk3588-base.dtsi` | in-tree (6.18 pull) | [v8 07/10](https://www.mail-archive.com/dri-devel@lists.freedesktop.org/msg554409.html) | Label added so board files can wire a regulator instead of leaving the domain permanently on. | | `npu@fdab/c/d_0000` nodes in `rk3588.dtsi` | in-tree (boltzmann's DT has all 3) | observed at `/sys/firmware/devicetree/base/npu@fdab0000` etc. | All `compatible = "rockchip,rk3588-rknn-core"`, **`status = "disabled"`** in the DT boltzmann is booting. | | Per-board enable on Rock 5 ITX+ (boltzmann) | **not done** | n/a | Three `rknn-core` nodes ship disabled; need a board overlay or DTS edit to set `status = "okay"` before `rocket` can probe. | | Userspace consumer (Mesa Rocket + Teflon) | merged in **Mesa 25.3** | [Tomeu blog "we are in mainline"](https://blog.tomeuvizoso.net/2025/07/rockchip-npu-update-6-we-are-in-mainline.html) · [Mesa Teflon docs](https://docs.mesa3d.org/teflon.html) | TFLite delegate; the authoritative reference for how jobs are encoded and submitted over the rocket uAPI. | | v8 series cover letter (final revision) | **merged via DRM-misc-next → 6.18** | [v8 cover](https://www.mail-archive.com/dri-devel@lists.freedesktop.org/msg554401.html) · [Patchew](https://patchew.org/linux/20250713-6-10-rocket-v8-0-64fa3115e910@tomeuvizoso.net/) · [LWN coverage](https://lwn.net/Articles/1029800/) | 10 patches. | | BSP `rockchip-linux/rknpu2` (vendor) | out-of-tree, deprecated, superseded by `airockchip/rknn-toolkit2` | · | BSP `rknpu` driver under `drivers/rknpu/` in 5.10/6.1 BSP branches. Register-map partly readable from source; TRM fills gaps. Use as spec-extraction reference per `feedback_megabitchip_semantic_match`. | | `airockchip/rknn-toolkit2` (vendor userspace) | latest **v2.3.2** (Apr 9 2025) | | `librknnrt.so` is a closed pre-compiled aarch64 blob (~7 MB), restrictive Rockchip license. **Not usable.** `.rknn` model-format spec is the only interesting piece. | | `airockchip/rknn-llm` (vendor LLM stack) | last release **v1.2.3** (Nov 2024) | | No release in >1 yr — looks stalled. Monitor only. | --- ## Decision: drive via the Rocket DRM-accel uAPI The Phase-1 exit question — "accel uAPI vs own MMIO driver" — answers itself with the upstream state we just inventoried: - `rocket.ko` is already on disk on boltzmann (`/lib/modules/7.0.0-rc3-ARCH+/kernel/drivers/accel/rocket/rocket.ko`, `intree: Y`) and aliased to the exact `rockchip,rk3588-rknn-core` compatible string the DT uses. - The uAPI (`rocket_accel.h`) is stable enough that Mesa is shipping production consumers against it. - Writing our own MMIO driver would mean bypassing IOMMU integration, reimplementing the (already-reviewed) `pd_npu` power-domain sequencing, and owning forward-compatibility burden with zero upstream path. **Unblock path:** patch the boltzmann board DTS (or apply an overlay) to flip the three `npu@fdab/c/d_0000` nodes to `status = "okay"`, rebuild the DTB, reboot, `modprobe rocket`, expect `/dev/accel/accel0..2` to appear and corresponding `dmesg` probe lines. That's a Phase-2 task. --- ## Op-coverage gotcha — direct hit on Phase-2 scope The agent flagged a structural risk worth surfacing before Phase 2: > Op coverage is limited today. Convolutions + tensor add + fused ReLU > covers MobileNet-class models but not a transformer attention block. Rosenblatt's premise is offloading the heavy **GEMM** of attention + FFN to the NPU. If `rocket`'s userspace path today is conv-centric (Teflon demos MobileNetV1/V2 + MobileDet) and the kernel-side op set doesn't yet expose a clean matmul primitive, we will need to either: 1. **Shoehorn matmul through conv-1×1** with appropriate tensor reshape (legit on this silicon — RKNPU2 vendor stack does exactly this — but it depends on what shapes the rocket uAPI lets us submit), or 2. **Land NPU matmul / additional op coverage upstream** ourselves (a real contribution, but adds scope), or 3. **Drop to a thinner shim** that submits raw NPU command buffers while still going through the rocket GEM + job-submit lifecycle (so we keep IOMMU + power-domain correctness for free). This is exactly the kind of question Phase 2 was scoped to answer ("smallest-possible op-set llama.cpp can offload"). The decision now becomes "what does rocket's submit-job ioctl actually accept?" which we answer by reading `drivers/accel/rocket/rocket_job.c` and the Mesa Rocket Gallium driver. --- ## IOMMU v1.0 hazard — blocks naive Phase-2 unblock Surfaced from a `linux-rockchip` thread of 2026-04-03 ("Re: [PATCH] iommu/rockchip: fix page table allocation flags for v2 IOMMU", Midgy BALON → Simon Xue / Jonas Karlman). The RK356x SoCs integrate **two distinct IOMMU IP versions, v1.0 and v2.0**. The NPU and ISP use v1.0; VOP2 uses v2.0. Both map 40-bit physical pages, but **v1.0 does not support placing the first-level page table (DTE) above 4 GB**. The current DT binding `rockchip,rk3568-iommu` does **not** distinguish the two — same compat for both IP versions. On `boltzmann` all three NPU IOMMUs are bound to this compat (with `rockchip,rk3588-iommu` fallback), so they will be driven by the v2.0 code path: ``` /sys/firmware/devicetree/base/iommu@fdab9000 compatible: rockchip,rk3588-iommu rockchip,rk3568-iommu status: disabled /sys/firmware/devicetree/base/iommu@fdaca000 compatible: rockchip,rk3588-iommu rockchip,rk3568-iommu status: disabled /sys/firmware/devicetree/base/iommu@fdada000 compatible: rockchip,rk3588-iommu rockchip,rk3568-iommu status: (none → okay default) ``` With `MemTotal = 32 GB` on boltzmann, two failure modes are reachable the moment we enable the NPU: 1. **DTE allocated above `0x100000000`** → v1.0 hardware silently truncates or errors. Page-walk fails, every NPU job faults. 2. **SWIOTLB bounce-buffer PTE poisoning**. With `DMA_BIT_MASK(32)` on the NPU device, bounce buffers land below 4 GB, `phys_to_virt(bounce_addr)` is used as the PTE write target, and the following `dma_sync_single_for_device(DMA_TO_DEVICE)` overwrites the live PTEs with zeros from the original buffer. **Planned upstream fix** (two-step, in flight per the thread): 1. Simon Xue, per-device-ops patch — `` 2. Midgy BALON, "Add `rockchip,rk3568-iommu-v1` compatible for IOMMU v1.0 blocks (NPU, ISP/VICAP) on RK3568" — adds ops with `.gfp_flags = GFP_DMA32` and `.dma_bit_mask = DMA_BIT_MASK(40)`. 3. DT update: NPU + VICAP MMU nodes flipped to the new compat. **Mainline state of the fix (verified via cgit, 2026-05-19):** | Patch | Posted | Merged to torvalds master? | |---|---|---| | Simon Xue per-device-ops `[20260310105303.128859-1]` | 2026-03-10 | **no** (most recent rockchip-iommu commit is the 2025-06-27 dead-loop fix) | | Midgy `-v1` discriminator (`[1/2]`) | not yet posted as of 2026-04-03 | **no** | | DT update for `rknpu_mmu` / `vicap_mmu` (`[2/2]`) | not yet posted | **no** | | Treewide context: `kmalloc_obj` / `alloc_obj GFP_KERNEL` defaults | 2026-02-21 | merged — possible aggravator of the original bug, may force Midgy's series to be rebased | **Subtlety:** Midgy's earlier patch (the standalone one referenced as `In-Reply-To: <20260331075010.1463-1-midgy971@gmail.com>`) modified `iommu_data_ops_v2` directly and was withdrawn after Simon's analysis — it would have over-constrained VOP2 and other v2 users. The discriminator-compat approach is correct but not yet upstream. **Implication for Rosenblatt Phase 2:** the README's Phase-2 unblock plan ("enable DT nodes → `modprobe rocket` → `/dev/accel/accel0..2`") will **almost certainly trigger one of the two failure modes** on 32 GB boltzmann. Three viable workarounds: - **A. Boot with `mem=4GB`** (degraded but valid Phase-2 path; lets us validate the rocket bringup end-to-end without addressing the IOMMU bug). - **B. Carry Midgy's discriminator + DT changes as local patches** on the marfrit kernel branch. Track upstream landing; rebase or drop. - **C. Wait for upstream** — given last visible activity was 2026-04-03 and the v8 series isn't yet posted, this could stall Rosenblatt arbitrarily. Not recommended. Recommended: **start with (A) for Phase-2 bringup**, switch to (B) when we want to use the full 32 GB and need to benchmark realistic working sets. Track upstream as standing item. References (Anubis-gated; fetch with a JS-capable browser or via `lore.kernel.org/lkml/` mirrors): - Original Midgy patch: `` - Simon Xue per-device-ops: `` --- ## Vendor stack — spec source only, not a runtime - `librknnrt.so`: proprietary aarch64 binary, restrictive Rockchip license (no reverse-engineering, no redistribution). Do not link. - `.rknn` model format: the one interface we may need to understand if we want to consume pre-quantized vendor model weights. Otherwise we quantize from GGUF ourselves. - BSP `drivers/rknpu/` in `rockchip-linux/kernel`: usable as a register-layout reference. Read; don't lift. --- ## Vendor stack (for spec extraction only — don't lift code) - **RKNPU2 SDK:** `airockchip/rknn-toolkit2`. Provides: - `librknnrt.so` runtime (closed, vendor binary) - `rknn-convert` ONNX → `.rknn` transcoder - rknn-llm fork (rknn-llm) for LLM-specific quant + scheduling - **Kernel module:** vendor `rockchip-npu` (BSP). Not mainline-compatible (uses Rockchip's iommu/dma shim). Spec-extract per `feedback_megabitchip_semantic_match` — read the source, don't link the binary. Operational rule: **mainline-clean from day 1.** No vendor blob runtime, no vendor-kernel-only module. If we need to copy a register-layout table from BSP source to get started, that's fine; copying a `.so` is not. --- ## Hardware capability bounds What the NPU CAN do (Phase-1-relevant): - INT8 GEMM up to ~2 TOPS/core × 3 cores = 6 TOPS peak (theoretical) - INT8 tensors with per-channel quantization scales - 3D conv (mostly unused in LLM workloads) - LayerNorm + activation fusion in some configs What the NPU CANNOT do that LLM needs (so stays on CPU): - fp32 / fp16 mixed-precision (NPU is INT8/INT16-first; some FP16 but with throughput penalty) - Sampling (multinomial, top-k, top-p) — pure CPU - Sparse attention — no sparse-tile support in the NPU op set - Embedding lookup (rare access pattern; gather not in NPU op set) - RoPE, softmax, RMSNorm — these run on CPU because the NPU's fixed-function pipeline doesn't have these as first-class ops without going through a generic matmul shoehorn Phase-2 question: which of those CAN move to NPU later, with op-fusion or custom kernels. --- ## Throughput sanity check (theoretical only — not measured) qwen2.5-1.5B Q4_K_M: - ~1.5e9 weights, attention + FFN dominated by matmul - Per-token: roughly 2 × 1.5e9 = 3 GMAC (forward pass, weight×activation) - INT8 path: 3 GMAC / 6 TOPS = 0.5 ms/token compute-bound - → ~2000 tok/s if everything fit + zero overhead, ABSOLUTELY UNREALISTIC Realistic upper bound on first cut: probably 5-15 tok/s. Memory bandwidth + DMA setup + CPU side will dominate. Phase-4 baseline pulls the actual CPU-only number; Phase-8 measures how close we get with NPU in the loop. **The goal is "credible CPU+NPU mix that's faster than pure CPU," not "saturate the 6 TOPS rating."** --- ## What this audit unblocks After Phase 1 completes (the table above is filled), Phase 2 can pick: - whether to drive the NPU via Tomeu's accel uAPI (if it's far enough along) or write directly against the MMIO regs (if not) - which kernel branch we baseline against (mainline-rc vs Tomeu's WIP) - which userspace shim we use (Tomeu's, write our own, or fork llama.cpp's CUDA-style backend pattern for ggml)