Files

T

marfrit c9a3f5c600 Rosenblatt Phase-1 closeout: rocket-driver substrate inventory

Phase-1 audit closes with a substantively different picture than the
original scaffold's TBDs:

- Tomeu Vizoso's RK3588 NPU work merged in Linux 6.18 (Nov 2025) under
  codename `rocket` (NOT `rknpu`).  All references updated.
- Boltzmann's `linux-rk3588-marfrit-A1` (7.0.0-rc3-ARCH+) already ships
  `drivers/accel/rocket/rocket.ko` as a built-but-not-loaded module.
- DT bindings + per-core nodes (`npu@fdab/c/d_0000`,
  compatible `rockchip,rk3588-rknn-core`) in mainline since 6.18 but
  ship `status = "disabled"` — board enable is the Phase-2 unblock,
  not a driver port.
- Mesa 25.3 ships Rocket Gallium + Teflon TFLite delegate as the
  authoritative userspace reference for the uAPI shape.
- Op coverage today is conv-centric (MobileNet-class); transformer
  matmul needs the conv-1×1 shoehorn (RKNPU2 BSP precedent) or rocket
  op-set additions.  Surfaced as Phase-2-load-bearing risk.
- IOMMU v1.0 hazard: 32 GB host needs `mem=4G` or local
  `rockchip,rk3568-iommu-v1` discriminator patches before the first
  NPU job, to avoid DMA-window faults.

Files:
- docs/npu-mainline-status.md: full audit table with upstream pointers
  (kernel.org / Mesa docs / dri-devel patch URLs / Tomeu's "we are in
  mainline" blog post).
- docs/phases.md: per-phase log entry for Phase-1 closeout.
- docs/op-coverage.md: matmul-vs-conv-vs-rocket-op-set framing.
- fleet/boltzmann.yaml: audited kernel + npu_driver + dt_npu_nodes
  state.
- kernel/dt-overlays/rk3588-rosenblatt-npu-enable.dtso: overlay to
  flip the three rknn-core nodes to "okay" (+ matching mmu nodes),
  carries the IOMMU-mitigation warning inline.
- kernel/README.md: kernel-agent scope wiring + anticipated local
  carry patches.
- README.md: phase-status table + "rknpu → rocket" rename note.
- TODO.md: Phase-2 unblock concrete steps + standing
  upstream-watch items.

2026-05-19 12:41:31 +00:00

16 KiB

Raw Blame History

RK3588 NPU on mainline — status audit (Phase 1)

Phase 1 deliverable. This file is the canonical snapshot of where mainline kernel + DRM/accel + userspace stand for the RK3588 NPU as of the audit date. Re-audit on each kernel-version bump or when a blocker hits.

Audit date: 2026-05-19 Kernel under audit: boltzmann, 7.0.0-rc3-ARCH+ #2 SMP PREEMPT_DYNAMIC Wed Apr 29 11:16:17 CEST 2026 aarch64 (image linux-rk3588-marfrit-A1)

Local-host findings (boltzmann)

The custom linux-rk3588-marfrit-A1 build is mainline-rc3-based and already ships Tomeu Vizoso's upstream in-tree NPU accel driver as a loadable module:

Probe	Result
`CONFIG_DRM_ACCEL`	`y` (DRM-accel core built-in; major 261 = `accel` in `/proc/devices`)
`CONFIG_DRM_ACCEL_ROCKET`	`m` (Tomeu's `rocket` driver, intree, author "Tomeu Vizoso")
Module file	`/lib/modules/7.0.0-rc3-ARCH+/kernel/drivers/accel/rocket/rocket.ko`
Compat aliases	`rockchip,rk3588-rknn-core` (per-core NPU node)
Module deps	`gpu-sched`, `drm_shmem_helper`
DT nodes	`npu@fdab0000`, `npu@fdac0000`, `npu@fdad0000` — all `compatible = "rockchip,rk3588-rknn-core"`, `status = "disabled"`
`/dev/accel/`	does not exist (no device bound — rocket isn't loaded and nodes are disabled)
`dmesg` rknpu/npu/accel lines	zero since boot

Net: the driver-side path is already there. The blocker is the device-tree: pd_npu is wired but the three rknn-core nodes are status = "disabled". Enabling them (DT overlay or rebuilt board DTB) should let rocket probe and instantiate /dev/accel/accel0..2.

This decisively answers the Phase-1 exit question: drive via the DRM-accel uAPI through rocket, not a from-scratch MMIO driver.

Display/GPU side, for completeness: card0 + renderD128 are bound to panthor (mainline Mali G610), card1 to rockchip-drm — no contention with the NPU stack.

The "Tomeu's rknpu" framing in the original Phase-1 plan is outdated: the work landed under the name rocket in drivers/accel/. References below use both names; consider rknpu → rocket going forward.

Hardware quick-spec

NPU: Rockchip "RKNPU2" — three independent NPU cores, each ~2 TOPS INT8 (combined 6 TOPS). Also supports INT16, FP16, and BF16 (with reduced throughput). Per-core local memory (~2 MB SRAM each).
Bus: AXI interconnect, dedicated DMA engine, integrated power domain (pd_npu).
TRM section: Chapter 23 ("NPU") in the RK3588 TRM v1.0 — local copy referenced in reference_rk3588_trm memory.
Vendor docs source-of-truth: https://github.com/airockchip/rknn-toolkit2 (SDK + ops list + quantization scheme).

Upstream state (audited 2026-05-19)

Tomeu Vizoso's RK3588 NPU work landed in Linux 6.18 (Nov 2025) under the codename rocket (not rknpu). The matching userspace shipped in Mesa 25.3 (Rocket Gallium driver + Teflon TFLite delegate). Boltzmann at 7.0.0-rc3 is one major release past the merge.

Component	Mainline state	URL	Notes
`drivers/accel/rocket/`	in-tree since 6.18	https://github.com/torvalds/linux/tree/master/drivers/accel/rocket · kernel docs	Author Tomeu Vizoso. Files: `rocket_core.c`, `_device.c`, `_drv.c`, `_gem.c`, `_job.c`, `_registers.h`. Kconfig `DRM_ACCEL_ROCKET`.
DT binding `rockchip,rk3588-rknn-core.yaml`	in-tree (6.18 pull)	v8 06/10 patch	Path `Documentation/devicetree/bindings/npu/`. Requires 3 reg blocks (pc/cna/core), 4 clocks, IOMMU, power-domain, resets.
`pd_npu` power-domain in `rk3588-base.dtsi`	in-tree (6.18 pull)	v8 07/10	Label added so board files can wire a regulator instead of leaving the domain permanently on.
`npu@fdab/c/d_0000` nodes in `rk3588.dtsi`	in-tree (boltzmann's DT has all 3)	observed at `/sys/firmware/devicetree/base/npu@fdab0000` etc.	All `compatible = "rockchip,rk3588-rknn-core"`, `status = "disabled"` in the DT boltzmann is booting.
Per-board enable on Rock 5 ITX+ (boltzmann)	not done	n/a	Three `rknn-core` nodes ship disabled; need a board overlay or DTS edit to set `status = "okay"` before `rocket` can probe.
Userspace consumer (Mesa Rocket + Teflon)	merged in Mesa 25.3	Tomeu blog "we are in mainline" · Mesa Teflon docs	TFLite delegate; the authoritative reference for how jobs are encoded and submitted over the rocket uAPI.
v8 series cover letter (final revision)	merged via DRM-misc-next → 6.18	v8 cover · Patchew · LWN coverage	10 patches.
BSP `rockchip-linux/rknpu2` (vendor)	out-of-tree, deprecated, superseded by `airockchip/rknn-toolkit2`	https://github.com/rockchip-linux/rknpu2 · https://github.com/rockchip-linux/kernel	BSP `rknpu` driver under `drivers/rknpu/` in 5.10/6.1 BSP branches. Register-map partly readable from source; TRM fills gaps. Use as spec-extraction reference per `feedback_megabitchip_semantic_match`.
`airockchip/rknn-toolkit2` (vendor userspace)	latest v2.3.2 (Apr 9 2025)	https://github.com/airockchip/rknn-toolkit2/releases	`librknnrt.so` is a closed pre-compiled aarch64 blob (~7 MB), restrictive Rockchip license. Not usable. `.rknn` model-format spec is the only interesting piece.
`airockchip/rknn-llm` (vendor LLM stack)	last release v1.2.3 (Nov 2024)	https://github.com/airockchip/rknn-llm/releases	No release in >1 yr — looks stalled. Monitor only.

Decision: drive via the Rocket DRM-accel uAPI

The Phase-1 exit question — "accel uAPI vs own MMIO driver" — answers itself with the upstream state we just inventoried:

rocket.ko is already on disk on boltzmann (/lib/modules/7.0.0-rc3-ARCH+/kernel/drivers/accel/rocket/rocket.ko, intree: Y) and aliased to the exact rockchip,rk3588-rknn-core compatible string the DT uses.
The uAPI (rocket_accel.h) is stable enough that Mesa is shipping production consumers against it.
Writing our own MMIO driver would mean bypassing IOMMU integration, reimplementing the (already-reviewed) pd_npu power-domain sequencing, and owning forward-compatibility burden with zero upstream path.

Unblock path: patch the boltzmann board DTS (or apply an overlay) to flip the three npu@fdab/c/d_0000 nodes to status = "okay", rebuild the DTB, reboot, modprobe rocket, expect /dev/accel/accel0..2 to appear and corresponding dmesg probe lines. That's a Phase-2 task.

Op-coverage gotcha — direct hit on Phase-2 scope

The agent flagged a structural risk worth surfacing before Phase 2:

Op coverage is limited today. Convolutions + tensor add + fused ReLU covers MobileNet-class models but not a transformer attention block.

Rosenblatt's premise is offloading the heavy GEMM of attention + FFN to the NPU. If rocket's userspace path today is conv-centric (Teflon demos MobileNetV1/V2 + MobileDet) and the kernel-side op set doesn't yet expose a clean matmul primitive, we will need to either:

Shoehorn matmul through conv-1×1 with appropriate tensor reshape (legit on this silicon — RKNPU2 vendor stack does exactly this — but it depends on what shapes the rocket uAPI lets us submit), or
Land NPU matmul / additional op coverage upstream ourselves (a real contribution, but adds scope), or
Drop to a thinner shim that submits raw NPU command buffers while still going through the rocket GEM + job-submit lifecycle (so we keep IOMMU + power-domain correctness for free).

This is exactly the kind of question Phase 2 was scoped to answer ("smallest-possible op-set llama.cpp can offload"). The decision now becomes "what does rocket's submit-job ioctl actually accept?" which we answer by reading drivers/accel/rocket/rocket_job.c and the Mesa Rocket Gallium driver.

IOMMU v1.0 hazard — blocks naive Phase-2 unblock

Surfaced from a linux-rockchip thread of 2026-04-03 ("Re: [PATCH] iommu/rockchip: fix page table allocation flags for v2 IOMMU", Midgy BALON → Simon Xue / Jonas Karlman). The RK356x SoCs integrate two distinct IOMMU IP versions, v1.0 and v2.0. The NPU and ISP use v1.0; VOP2 uses v2.0. Both map 40-bit physical pages, but v1.0 does not support placing the first-level page table (DTE) above 4 GB.

The current DT binding rockchip,rk3568-iommu does not distinguish the two — same compat for both IP versions. On boltzmann all three NPU IOMMUs are bound to this compat (with rockchip,rk3588-iommu fallback), so they will be driven by the v2.0 code path:

/sys/firmware/devicetree/base/iommu@fdab9000  compatible: rockchip,rk3588-iommu rockchip,rk3568-iommu  status: disabled
/sys/firmware/devicetree/base/iommu@fdaca000  compatible: rockchip,rk3588-iommu rockchip,rk3568-iommu  status: disabled
/sys/firmware/devicetree/base/iommu@fdada000  compatible: rockchip,rk3588-iommu rockchip,rk3568-iommu  status: (none → okay default)

With MemTotal = 32 GB on boltzmann, two failure modes are reachable the moment we enable the NPU:

DTE allocated above 0x100000000 → v1.0 hardware silently truncates or errors. Page-walk fails, every NPU job faults.
SWIOTLB bounce-buffer PTE poisoning. With DMA_BIT_MASK(32) on the NPU device, bounce buffers land below 4 GB, phys_to_virt(bounce_addr) is used as the PTE write target, and the following dma_sync_single_for_device(DMA_TO_DEVICE) overwrites the live PTEs with zeros from the original buffer.

Planned upstream fix (two-step, in flight per the thread):

Simon Xue, per-device-ops patch — <https://lore.kernel.org/all/20260310105303.128859-1-xxm@rock-chips.com/>
Midgy BALON, "Add rockchip,rk3568-iommu-v1 compatible for IOMMU v1.0 blocks (NPU, ISP/VICAP) on RK3568" — adds ops with .gfp_flags = GFP_DMA32 and .dma_bit_mask = DMA_BIT_MASK(40).
DT update: NPU + VICAP MMU nodes flipped to the new compat.

Mainline state of the fix (verified via cgit, 2026-05-19):

Patch	Posted	Merged to torvalds master?
Simon Xue per-device-ops `[20260310105303.128859-1]`	2026-03-10	no (most recent rockchip-iommu commit is the 2025-06-27 dead-loop fix)
Midgy `-v1` discriminator (`[1/2]`)	not yet posted as of 2026-04-03	no
DT update for `rknpu_mmu` / `vicap_mmu` (`[2/2]`)	not yet posted	no
Treewide context: `kmalloc_obj` / `alloc_obj GFP_KERNEL` defaults	2026-02-21	merged — possible aggravator of the original bug, may force Midgy's series to be rebased

Subtlety: Midgy's earlier patch (the standalone one referenced as In-Reply-To: <20260331075010.1463-1-midgy971@gmail.com>) modified iommu_data_ops_v2 directly and was withdrawn after Simon's analysis — it would have over-constrained VOP2 and other v2 users. The discriminator-compat approach is correct but not yet upstream.

Implication for Rosenblatt Phase 2: the README's Phase-2 unblock plan ("enable DT nodes → modprobe rocket → /dev/accel/accel0..2") will almost certainly trigger one of the two failure modes on 32 GB boltzmann. Three viable workarounds:

A. Boot with mem=4GB (degraded but valid Phase-2 path; lets us validate the rocket bringup end-to-end without addressing the IOMMU bug).
B. Carry Midgy's discriminator + DT changes as local patches on the marfrit kernel branch. Track upstream landing; rebase or drop.
C. Wait for upstream — given last visible activity was 2026-04-03 and the v8 series isn't yet posted, this could stall Rosenblatt arbitrarily. Not recommended.

Recommended: start with (A) for Phase-2 bringup, switch to (B) when we want to use the full 32 GB and need to benchmark realistic working sets. Track upstream as standing item.

References (Anubis-gated; fetch with a JS-capable browser or via lore.kernel.org/lkml/ mirrors):

Original Midgy patch: <https://lore.kernel.org/all/20260331075010.1463-1-midgy971@gmail.com/>
Simon Xue per-device-ops: <https://lore.kernel.org/all/20260310105303.128859-1-xxm@rock-chips.com/>

Vendor stack — spec source only, not a runtime

librknnrt.so: proprietary aarch64 binary, restrictive Rockchip license (no reverse-engineering, no redistribution). Do not link.
.rknn model format: the one interface we may need to understand if we want to consume pre-quantized vendor model weights. Otherwise we quantize from GGUF ourselves.
BSP drivers/rknpu/ in rockchip-linux/kernel: usable as a register-layout reference. Read; don't lift.

Vendor stack (for spec extraction only — don't lift code)

RKNPU2 SDK: airockchip/rknn-toolkit2. Provides:
- librknnrt.so runtime (closed, vendor binary)
- rknn-convert ONNX → .rknn transcoder
- rknn-llm fork (rknn-llm) for LLM-specific quant + scheduling
Kernel module: vendor rockchip-npu (BSP). Not mainline-compatible (uses Rockchip's iommu/dma shim). Spec-extract per feedback_megabitchip_semantic_match — read the source, don't link the binary.

Operational rule: mainline-clean from day 1. No vendor blob runtime, no vendor-kernel-only module. If we need to copy a register-layout table from BSP source to get started, that's fine; copying a .so is not.

Hardware capability bounds

What the NPU CAN do (Phase-1-relevant):

INT8 GEMM up to ~2 TOPS/core × 3 cores = 6 TOPS peak (theoretical)
INT8 tensors with per-channel quantization scales
3D conv (mostly unused in LLM workloads)
LayerNorm + activation fusion in some configs

What the NPU CANNOT do that LLM needs (so stays on CPU):

fp32 / fp16 mixed-precision (NPU is INT8/INT16-first; some FP16 but with throughput penalty)
Sampling (multinomial, top-k, top-p) — pure CPU
Sparse attention — no sparse-tile support in the NPU op set
Embedding lookup (rare access pattern; gather not in NPU op set)
RoPE, softmax, RMSNorm — these run on CPU because the NPU's fixed-function pipeline doesn't have these as first-class ops without going through a generic matmul shoehorn

Phase-2 question: which of those CAN move to NPU later, with op-fusion or custom kernels.

Throughput sanity check (theoretical only — not measured)

qwen2.5-1.5B Q4_K_M:

~1.5e9 weights, attention + FFN dominated by matmul
Per-token: roughly 2 × 1.5e9 = 3 GMAC (forward pass, weight×activation)
INT8 path: 3 GMAC / 6 TOPS = 0.5 ms/token compute-bound
→ ~2000 tok/s if everything fit + zero overhead, ABSOLUTELY UNREALISTIC

Realistic upper bound on first cut: probably 5-15 tok/s. Memory bandwidth + DMA setup + CPU side will dominate. Phase-4 baseline pulls the actual CPU-only number; Phase-8 measures how close we get with NPU in the loop. The goal is "credible CPU+NPU mix that's faster than pure CPU," not "saturate the 6 TOPS rating."

What this audit unblocks

After Phase 1 completes (the table above is filled), Phase 2 can pick:

whether to drive the NPU via Tomeu's accel uAPI (if it's far enough along) or write directly against the MMIO regs (if not)
which kernel branch we baseline against (mainline-rc vs Tomeu's WIP)
which userspace shim we use (Tomeu's, write our own, or fork llama.cpp's CUDA-style backend pattern for ggml)

16 KiB Raw Blame History Unescape Escape