Files
rosenblatt/docs/npu-mainline-status.md
T
marfrit c9a3f5c600 Rosenblatt Phase-1 closeout: rocket-driver substrate inventory
Phase-1 audit closes with a substantively different picture than the
original scaffold's TBDs:

- Tomeu Vizoso's RK3588 NPU work merged in Linux 6.18 (Nov 2025) under
  codename `rocket` (NOT `rknpu`).  All references updated.
- Boltzmann's `linux-rk3588-marfrit-A1` (7.0.0-rc3-ARCH+) already ships
  `drivers/accel/rocket/rocket.ko` as a built-but-not-loaded module.
- DT bindings + per-core nodes (`npu@fdab/c/d_0000`,
  compatible `rockchip,rk3588-rknn-core`) in mainline since 6.18 but
  ship `status = "disabled"` — board enable is the Phase-2 unblock,
  not a driver port.
- Mesa 25.3 ships Rocket Gallium + Teflon TFLite delegate as the
  authoritative userspace reference for the uAPI shape.
- Op coverage today is conv-centric (MobileNet-class); transformer
  matmul needs the conv-1×1 shoehorn (RKNPU2 BSP precedent) or rocket
  op-set additions.  Surfaced as Phase-2-load-bearing risk.
- IOMMU v1.0 hazard: 32 GB host needs `mem=4G` or local
  `rockchip,rk3568-iommu-v1` discriminator patches before the first
  NPU job, to avoid DMA-window faults.

Files:
- docs/npu-mainline-status.md: full audit table with upstream pointers
  (kernel.org / Mesa docs / dri-devel patch URLs / Tomeu's "we are in
  mainline" blog post).
- docs/phases.md: per-phase log entry for Phase-1 closeout.
- docs/op-coverage.md: matmul-vs-conv-vs-rocket-op-set framing.
- fleet/boltzmann.yaml: audited kernel + npu_driver + dt_npu_nodes
  state.
- kernel/dt-overlays/rk3588-rosenblatt-npu-enable.dtso: overlay to
  flip the three rknn-core nodes to "okay" (+ matching mmu nodes),
  carries the IOMMU-mitigation warning inline.
- kernel/README.md: kernel-agent scope wiring + anticipated local
  carry patches.
- README.md: phase-status table + "rknpu → rocket" rename note.
- TODO.md: Phase-2 unblock concrete steps + standing
  upstream-watch items.
2026-05-19 12:41:31 +00:00

293 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# RK3588 NPU on mainline — status audit (Phase 1)
> **Phase 1 deliverable.** This file is the canonical snapshot of where
> mainline kernel + DRM/accel + userspace stand for the RK3588 NPU as of
> the audit date. Re-audit on each kernel-version bump or when a
> blocker hits.
**Audit date:** 2026-05-19
**Kernel under audit:** boltzmann, `7.0.0-rc3-ARCH+ #2 SMP PREEMPT_DYNAMIC Wed Apr 29 11:16:17 CEST 2026 aarch64` (image `linux-rk3588-marfrit-A1`)
---
## Local-host findings (boltzmann)
The custom `linux-rk3588-marfrit-A1` build is mainline-rc3-based and **already
ships Tomeu Vizoso's upstream in-tree NPU accel driver** as a loadable module:
| Probe | Result |
|---|---|
| `CONFIG_DRM_ACCEL` | `y` (DRM-accel core built-in; major 261 = `accel` in `/proc/devices`) |
| `CONFIG_DRM_ACCEL_ROCKET` | `m` (Tomeu's `rocket` driver, intree, author "Tomeu Vizoso") |
| Module file | `/lib/modules/7.0.0-rc3-ARCH+/kernel/drivers/accel/rocket/rocket.ko` |
| Compat aliases | `rockchip,rk3588-rknn-core` (per-core NPU node) |
| Module deps | `gpu-sched`, `drm_shmem_helper` |
| DT nodes | `npu@fdab0000`, `npu@fdac0000`, `npu@fdad0000` — all `compatible = "rockchip,rk3588-rknn-core"`, **`status = "disabled"`** |
| `/dev/accel/` | does not exist (no device bound — rocket isn't loaded and nodes are disabled) |
| `dmesg` rknpu/npu/accel lines | zero since boot |
**Net:** the driver-side path is already there. The blocker is the
device-tree: `pd_npu` is wired but the three `rknn-core` nodes are
`status = "disabled"`. Enabling them (DT overlay or rebuilt board DTB)
should let `rocket` probe and instantiate `/dev/accel/accel0..2`.
This decisively answers the Phase-1 exit question: **drive via the
DRM-accel uAPI through `rocket`, not a from-scratch MMIO driver.**
Display/GPU side, for completeness: `card0` + `renderD128` are bound to
`panthor` (mainline Mali G610), `card1` to `rockchip-drm` — no
contention with the NPU stack.
The "Tomeu's rknpu" framing in the original Phase-1 plan is outdated:
the work landed under the name **`rocket`** in `drivers/accel/`.
References below use both names; consider `rknpu``rocket` going
forward.
---
---
## Hardware quick-spec
- **NPU:** Rockchip "RKNPU2" — three independent NPU cores, each ~2 TOPS
INT8 (combined 6 TOPS). Also supports INT16, FP16, and BF16 (with
reduced throughput). Per-core local memory (~2 MB SRAM each).
- **Bus:** AXI interconnect, dedicated DMA engine, integrated power
domain (`pd_npu`).
- **TRM section:** Chapter 23 ("NPU") in the RK3588 TRM v1.0 — local
copy referenced in `reference_rk3588_trm` memory.
- **Vendor docs source-of-truth:**
https://github.com/airockchip/rknn-toolkit2 (SDK + ops list +
quantization scheme).
---
## Upstream state (audited 2026-05-19)
Tomeu Vizoso's RK3588 NPU work landed in **Linux 6.18 (Nov 2025)** under
the codename **`rocket`** (not `rknpu`). The matching userspace shipped
in **Mesa 25.3** (Rocket Gallium driver + Teflon TFLite delegate).
Boltzmann at 7.0.0-rc3 is one major release past the merge.
| Component | Mainline state | URL | Notes |
|---|---|---|---|
| `drivers/accel/rocket/` | **in-tree since 6.18** | <https://github.com/torvalds/linux/tree/master/drivers/accel/rocket> · [kernel docs](https://docs.kernel.org/accel/rocket/index.html) | Author Tomeu Vizoso. Files: `rocket_core.c`, `_device.c`, `_drv.c`, `_gem.c`, `_job.c`, `_registers.h`. Kconfig `DRM_ACCEL_ROCKET`. |
| DT binding `rockchip,rk3588-rknn-core.yaml` | in-tree (6.18 pull) | [v8 06/10 patch](https://www.mail-archive.com/dri-devel@lists.freedesktop.org/msg554406.html) | Path `Documentation/devicetree/bindings/npu/`. Requires 3 reg blocks (pc/cna/core), 4 clocks, IOMMU, power-domain, resets. |
| `pd_npu` power-domain in `rk3588-base.dtsi` | in-tree (6.18 pull) | [v8 07/10](https://www.mail-archive.com/dri-devel@lists.freedesktop.org/msg554409.html) | Label added so board files can wire a regulator instead of leaving the domain permanently on. |
| `npu@fdab/c/d_0000` nodes in `rk3588.dtsi` | in-tree (boltzmann's DT has all 3) | observed at `/sys/firmware/devicetree/base/npu@fdab0000` etc. | All `compatible = "rockchip,rk3588-rknn-core"`, **`status = "disabled"`** in the DT boltzmann is booting. |
| Per-board enable on Rock 5 ITX+ (boltzmann) | **not done** | n/a | Three `rknn-core` nodes ship disabled; need a board overlay or DTS edit to set `status = "okay"` before `rocket` can probe. |
| Userspace consumer (Mesa Rocket + Teflon) | merged in **Mesa 25.3** | [Tomeu blog "we are in mainline"](https://blog.tomeuvizoso.net/2025/07/rockchip-npu-update-6-we-are-in-mainline.html) · [Mesa Teflon docs](https://docs.mesa3d.org/teflon.html) | TFLite delegate; the authoritative reference for how jobs are encoded and submitted over the rocket uAPI. |
| v8 series cover letter (final revision) | **merged via DRM-misc-next → 6.18** | [v8 cover](https://www.mail-archive.com/dri-devel@lists.freedesktop.org/msg554401.html) · [Patchew](https://patchew.org/linux/20250713-6-10-rocket-v8-0-64fa3115e910@tomeuvizoso.net/) · [LWN coverage](https://lwn.net/Articles/1029800/) | 10 patches. |
| BSP `rockchip-linux/rknpu2` (vendor) | out-of-tree, deprecated, superseded by `airockchip/rknn-toolkit2` | <https://github.com/rockchip-linux/rknpu2> · <https://github.com/rockchip-linux/kernel> | BSP `rknpu` driver under `drivers/rknpu/` in 5.10/6.1 BSP branches. Register-map partly readable from source; TRM fills gaps. Use as spec-extraction reference per `feedback_megabitchip_semantic_match`. |
| `airockchip/rknn-toolkit2` (vendor userspace) | latest **v2.3.2** (Apr 9 2025) | <https://github.com/airockchip/rknn-toolkit2/releases> | `librknnrt.so` is a closed pre-compiled aarch64 blob (~7 MB), restrictive Rockchip license. **Not usable.** `.rknn` model-format spec is the only interesting piece. |
| `airockchip/rknn-llm` (vendor LLM stack) | last release **v1.2.3** (Nov 2024) | <https://github.com/airockchip/rknn-llm/releases> | No release in >1 yr — looks stalled. Monitor only. |
---
## Decision: drive via the Rocket DRM-accel uAPI
The Phase-1 exit question — "accel uAPI vs own MMIO driver" — answers
itself with the upstream state we just inventoried:
- `rocket.ko` is already on disk on boltzmann
(`/lib/modules/7.0.0-rc3-ARCH+/kernel/drivers/accel/rocket/rocket.ko`,
`intree: Y`) and aliased to the exact `rockchip,rk3588-rknn-core`
compatible string the DT uses.
- The uAPI (`rocket_accel.h`) is stable enough that Mesa is shipping
production consumers against it.
- Writing our own MMIO driver would mean bypassing IOMMU integration,
reimplementing the (already-reviewed) `pd_npu` power-domain sequencing,
and owning forward-compatibility burden with zero upstream path.
**Unblock path:** patch the boltzmann board DTS (or apply an overlay) to
flip the three `npu@fdab/c/d_0000` nodes to `status = "okay"`, rebuild
the DTB, reboot, `modprobe rocket`, expect `/dev/accel/accel0..2` to
appear and corresponding `dmesg` probe lines. That's a Phase-2 task.
---
## Op-coverage gotcha — direct hit on Phase-2 scope
The agent flagged a structural risk worth surfacing before Phase 2:
> Op coverage is limited today. Convolutions + tensor add + fused ReLU
> covers MobileNet-class models but not a transformer attention block.
Rosenblatt's premise is offloading the heavy **GEMM** of attention + FFN
to the NPU. If `rocket`'s userspace path today is conv-centric (Teflon
demos MobileNetV1/V2 + MobileDet) and the kernel-side op set doesn't
yet expose a clean matmul primitive, we will need to either:
1. **Shoehorn matmul through conv-1×1** with appropriate tensor reshape
(legit on this silicon — RKNPU2 vendor stack does exactly this — but
it depends on what shapes the rocket uAPI lets us submit), or
2. **Land NPU matmul / additional op coverage upstream** ourselves (a
real contribution, but adds scope), or
3. **Drop to a thinner shim** that submits raw NPU command buffers
while still going through the rocket GEM + job-submit lifecycle (so
we keep IOMMU + power-domain correctness for free).
This is exactly the kind of question Phase 2 was scoped to answer
("smallest-possible op-set llama.cpp can offload"). The decision now
becomes "what does rocket's submit-job ioctl actually accept?" which we
answer by reading `drivers/accel/rocket/rocket_job.c` and the Mesa
Rocket Gallium driver.
---
## IOMMU v1.0 hazard — blocks naive Phase-2 unblock
Surfaced from a `linux-rockchip` thread of 2026-04-03 ("Re: [PATCH]
iommu/rockchip: fix page table allocation flags for v2 IOMMU", Midgy
BALON → Simon Xue / Jonas Karlman). The RK356x SoCs integrate **two
distinct IOMMU IP versions, v1.0 and v2.0**. The NPU and ISP use v1.0;
VOP2 uses v2.0. Both map 40-bit physical pages, but **v1.0 does not
support placing the first-level page table (DTE) above 4 GB**.
The current DT binding `rockchip,rk3568-iommu` does **not** distinguish
the two — same compat for both IP versions. On `boltzmann` all three
NPU IOMMUs are bound to this compat (with `rockchip,rk3588-iommu`
fallback), so they will be driven by the v2.0 code path:
```
/sys/firmware/devicetree/base/iommu@fdab9000 compatible: rockchip,rk3588-iommu rockchip,rk3568-iommu status: disabled
/sys/firmware/devicetree/base/iommu@fdaca000 compatible: rockchip,rk3588-iommu rockchip,rk3568-iommu status: disabled
/sys/firmware/devicetree/base/iommu@fdada000 compatible: rockchip,rk3588-iommu rockchip,rk3568-iommu status: (none → okay default)
```
With `MemTotal = 32 GB` on boltzmann, two failure modes are reachable
the moment we enable the NPU:
1. **DTE allocated above `0x100000000`** → v1.0 hardware silently
truncates or errors. Page-walk fails, every NPU job faults.
2. **SWIOTLB bounce-buffer PTE poisoning**. With `DMA_BIT_MASK(32)` on
the NPU device, bounce buffers land below 4 GB,
`phys_to_virt(bounce_addr)` is used as the PTE write target, and the
following `dma_sync_single_for_device(DMA_TO_DEVICE)` overwrites the
live PTEs with zeros from the original buffer.
**Planned upstream fix** (two-step, in flight per the thread):
1. Simon Xue, per-device-ops patch
`<https://lore.kernel.org/all/20260310105303.128859-1-xxm@rock-chips.com/>`
2. Midgy BALON, "Add `rockchip,rk3568-iommu-v1` compatible for IOMMU
v1.0 blocks (NPU, ISP/VICAP) on RK3568" — adds ops with
`.gfp_flags = GFP_DMA32` and `.dma_bit_mask = DMA_BIT_MASK(40)`.
3. DT update: NPU + VICAP MMU nodes flipped to the new compat.
**Mainline state of the fix (verified via cgit, 2026-05-19):**
| Patch | Posted | Merged to torvalds master? |
|---|---|---|
| Simon Xue per-device-ops `[20260310105303.128859-1]` | 2026-03-10 | **no** (most recent rockchip-iommu commit is the 2025-06-27 dead-loop fix) |
| Midgy `-v1` discriminator (`[1/2]`) | not yet posted as of 2026-04-03 | **no** |
| DT update for `rknpu_mmu` / `vicap_mmu` (`[2/2]`) | not yet posted | **no** |
| Treewide context: `kmalloc_obj` / `alloc_obj GFP_KERNEL` defaults | 2026-02-21 | merged — possible aggravator of the original bug, may force Midgy's series to be rebased |
**Subtlety:** Midgy's earlier patch (the standalone one referenced as
`In-Reply-To: <20260331075010.1463-1-midgy971@gmail.com>`) modified
`iommu_data_ops_v2` directly and was withdrawn after Simon's analysis
— it would have over-constrained VOP2 and other v2 users. The
discriminator-compat approach is correct but not yet upstream.
**Implication for Rosenblatt Phase 2:** the README's Phase-2 unblock
plan ("enable DT nodes → `modprobe rocket``/dev/accel/accel0..2`")
will **almost certainly trigger one of the two failure modes** on
32 GB boltzmann. Three viable workarounds:
- **A. Boot with `mem=4GB`** (degraded but valid Phase-2 path; lets us
validate the rocket bringup end-to-end without addressing the IOMMU
bug).
- **B. Carry Midgy's discriminator + DT changes as local patches** on
the marfrit kernel branch. Track upstream landing; rebase or drop.
- **C. Wait for upstream** — given last visible activity was
2026-04-03 and the v8 series isn't yet posted, this could stall
Rosenblatt arbitrarily. Not recommended.
Recommended: **start with (A) for Phase-2 bringup**, switch to (B) when
we want to use the full 32 GB and need to benchmark realistic working
sets. Track upstream as standing item.
References (Anubis-gated; fetch with a JS-capable browser or via
`lore.kernel.org/lkml/` mirrors):
- Original Midgy patch: `<https://lore.kernel.org/all/20260331075010.1463-1-midgy971@gmail.com/>`
- Simon Xue per-device-ops: `<https://lore.kernel.org/all/20260310105303.128859-1-xxm@rock-chips.com/>`
---
## Vendor stack — spec source only, not a runtime
- `librknnrt.so`: proprietary aarch64 binary, restrictive Rockchip
license (no reverse-engineering, no redistribution). Do not link.
- `.rknn` model format: the one interface we may need to understand if
we want to consume pre-quantized vendor model weights. Otherwise we
quantize from GGUF ourselves.
- BSP `drivers/rknpu/` in `rockchip-linux/kernel`: usable as a
register-layout reference. Read; don't lift.
---
## Vendor stack (for spec extraction only — don't lift code)
- **RKNPU2 SDK:** `airockchip/rknn-toolkit2`. Provides:
- `librknnrt.so` runtime (closed, vendor binary)
- `rknn-convert` ONNX → `.rknn` transcoder
- rknn-llm fork (rknn-llm) for LLM-specific quant + scheduling
- **Kernel module:** vendor `rockchip-npu` (BSP). Not mainline-compatible
(uses Rockchip's iommu/dma shim). Spec-extract per `feedback_megabitchip_semantic_match`
read the source, don't link the binary.
Operational rule: **mainline-clean from day 1.** No vendor blob runtime,
no vendor-kernel-only module. If we need to copy a register-layout table
from BSP source to get started, that's fine; copying a `.so` is not.
---
## Hardware capability bounds
What the NPU CAN do (Phase-1-relevant):
- INT8 GEMM up to ~2 TOPS/core × 3 cores = 6 TOPS peak (theoretical)
- INT8 tensors with per-channel quantization scales
- 3D conv (mostly unused in LLM workloads)
- LayerNorm + activation fusion in some configs
What the NPU CANNOT do that LLM needs (so stays on CPU):
- fp32 / fp16 mixed-precision (NPU is INT8/INT16-first; some FP16 but
with throughput penalty)
- Sampling (multinomial, top-k, top-p) — pure CPU
- Sparse attention — no sparse-tile support in the NPU op set
- Embedding lookup (rare access pattern; gather not in NPU op set)
- RoPE, softmax, RMSNorm — these run on CPU because the NPU's
fixed-function pipeline doesn't have these as first-class ops
without going through a generic matmul shoehorn
Phase-2 question: which of those CAN move to NPU later, with op-fusion or
custom kernels.
---
## Throughput sanity check (theoretical only — not measured)
qwen2.5-1.5B Q4_K_M:
- ~1.5e9 weights, attention + FFN dominated by matmul
- Per-token: roughly 2 × 1.5e9 = 3 GMAC (forward pass, weight×activation)
- INT8 path: 3 GMAC / 6 TOPS = 0.5 ms/token compute-bound
- → ~2000 tok/s if everything fit + zero overhead, ABSOLUTELY UNREALISTIC
Realistic upper bound on first cut: probably 5-15 tok/s. Memory
bandwidth + DMA setup + CPU side will dominate. Phase-4 baseline pulls
the actual CPU-only number; Phase-8 measures how close we get with NPU
in the loop. **The goal is "credible CPU+NPU mix that's faster than
pure CPU," not "saturate the 6 TOPS rating."**
---
## What this audit unblocks
After Phase 1 completes (the table above is filled), Phase 2 can pick:
- whether to drive the NPU via Tomeu's accel uAPI (if it's far enough
along) or write directly against the MMIO regs (if not)
- which kernel branch we baseline against (mainline-rc vs Tomeu's WIP)
- which userspace shim we use (Tomeu's, write our own, or fork
llama.cpp's CUDA-style backend pattern for ggml)