Rosenblatt Phase-1 closeout: rocket-driver substrate inventory
Phase-1 audit closes with a substantively different picture than the original scaffold's TBDs: - Tomeu Vizoso's RK3588 NPU work merged in Linux 6.18 (Nov 2025) under codename `rocket` (NOT `rknpu`). All references updated. - Boltzmann's `linux-rk3588-marfrit-A1` (7.0.0-rc3-ARCH+) already ships `drivers/accel/rocket/rocket.ko` as a built-but-not-loaded module. - DT bindings + per-core nodes (`npu@fdab/c/d_0000`, compatible `rockchip,rk3588-rknn-core`) in mainline since 6.18 but ship `status = "disabled"` — board enable is the Phase-2 unblock, not a driver port. - Mesa 25.3 ships Rocket Gallium + Teflon TFLite delegate as the authoritative userspace reference for the uAPI shape. - Op coverage today is conv-centric (MobileNet-class); transformer matmul needs the conv-1×1 shoehorn (RKNPU2 BSP precedent) or rocket op-set additions. Surfaced as Phase-2-load-bearing risk. - IOMMU v1.0 hazard: 32 GB host needs `mem=4G` or local `rockchip,rk3568-iommu-v1` discriminator patches before the first NPU job, to avoid DMA-window faults. Files: - docs/npu-mainline-status.md: full audit table with upstream pointers (kernel.org / Mesa docs / dri-devel patch URLs / Tomeu's "we are in mainline" blog post). - docs/phases.md: per-phase log entry for Phase-1 closeout. - docs/op-coverage.md: matmul-vs-conv-vs-rocket-op-set framing. - fleet/boltzmann.yaml: audited kernel + npu_driver + dt_npu_nodes state. - kernel/dt-overlays/rk3588-rosenblatt-npu-enable.dtso: overlay to flip the three rknn-core nodes to "okay" (+ matching mmu nodes), carries the IOMMU-mitigation warning inline. - kernel/README.md: kernel-agent scope wiring + anticipated local carry patches. - README.md: phase-status table + "rknpu → rocket" rename note. - TODO.md: Phase-2 unblock concrete steps + standing upstream-watch items.
This commit is contained in:
+197
-30
@@ -5,8 +5,45 @@
|
||||
> the audit date. Re-audit on each kernel-version bump or when a
|
||||
> blocker hits.
|
||||
|
||||
**Audit date:** _(pending — first Phase-1 task)_
|
||||
**Kernel under audit:** _(boltzmann's running kernel, capture with `uname -r`)_
|
||||
**Audit date:** 2026-05-19
|
||||
**Kernel under audit:** boltzmann, `7.0.0-rc3-ARCH+ #2 SMP PREEMPT_DYNAMIC Wed Apr 29 11:16:17 CEST 2026 aarch64` (image `linux-rk3588-marfrit-A1`)
|
||||
|
||||
---
|
||||
|
||||
## Local-host findings (boltzmann)
|
||||
|
||||
The custom `linux-rk3588-marfrit-A1` build is mainline-rc3-based and **already
|
||||
ships Tomeu Vizoso's upstream in-tree NPU accel driver** as a loadable module:
|
||||
|
||||
| Probe | Result |
|
||||
|---|---|
|
||||
| `CONFIG_DRM_ACCEL` | `y` (DRM-accel core built-in; major 261 = `accel` in `/proc/devices`) |
|
||||
| `CONFIG_DRM_ACCEL_ROCKET` | `m` (Tomeu's `rocket` driver, intree, author "Tomeu Vizoso") |
|
||||
| Module file | `/lib/modules/7.0.0-rc3-ARCH+/kernel/drivers/accel/rocket/rocket.ko` |
|
||||
| Compat aliases | `rockchip,rk3588-rknn-core` (per-core NPU node) |
|
||||
| Module deps | `gpu-sched`, `drm_shmem_helper` |
|
||||
| DT nodes | `npu@fdab0000`, `npu@fdac0000`, `npu@fdad0000` — all `compatible = "rockchip,rk3588-rknn-core"`, **`status = "disabled"`** |
|
||||
| `/dev/accel/` | does not exist (no device bound — rocket isn't loaded and nodes are disabled) |
|
||||
| `dmesg` rknpu/npu/accel lines | zero since boot |
|
||||
|
||||
**Net:** the driver-side path is already there. The blocker is the
|
||||
device-tree: `pd_npu` is wired but the three `rknn-core` nodes are
|
||||
`status = "disabled"`. Enabling them (DT overlay or rebuilt board DTB)
|
||||
should let `rocket` probe and instantiate `/dev/accel/accel0..2`.
|
||||
|
||||
This decisively answers the Phase-1 exit question: **drive via the
|
||||
DRM-accel uAPI through `rocket`, not a from-scratch MMIO driver.**
|
||||
|
||||
Display/GPU side, for completeness: `card0` + `renderD128` are bound to
|
||||
`panthor` (mainline Mali G610), `card1` to `rockchip-drm` — no
|
||||
contention with the NPU stack.
|
||||
|
||||
The "Tomeu's rknpu" framing in the original Phase-1 plan is outdated:
|
||||
the work landed under the name **`rocket`** in `drivers/accel/`.
|
||||
References below use both names; consider `rknpu` → `rocket` going
|
||||
forward.
|
||||
|
||||
---
|
||||
|
||||
---
|
||||
|
||||
@@ -25,38 +62,168 @@
|
||||
|
||||
---
|
||||
|
||||
## Upstream work in flight
|
||||
## Upstream state (audited 2026-05-19)
|
||||
|
||||
To audit on Phase 1:
|
||||
|
||||
1. **Tomeu Vizoso's rknpu / DRM-accel series**
|
||||
- Branch: search lore.kernel.org for "rknpu" + "Tomeu Vizoso" in
|
||||
`linux-rockchip` / `dri-devel`. Check `linux-rockchip` ml cutoff
|
||||
date per `reference_linux_rockchip_ml` memory.
|
||||
- State to capture: which patches merged, which are review-pending,
|
||||
which dropped on the floor.
|
||||
2. **`drivers/accel/` mainline state**
|
||||
- Check `drivers/accel/` in current torvalds tree — list of in-tree
|
||||
accelerators (habanalabs, ivpu, qaic, etc.). Note if rknpu is
|
||||
there yet.
|
||||
3. **DT bindings**
|
||||
- `Documentation/devicetree/bindings/npu/rockchip,rk3588-npu.yaml`
|
||||
— does it exist in mainline?
|
||||
4. **Userspace bridge**
|
||||
- Tomeu has a userspace shim ("rkneural"?) for the DRM-accel uAPI.
|
||||
Capture repo URL + current state.
|
||||
|
||||
Fill the table below during Phase-1 audit:
|
||||
Tomeu Vizoso's RK3588 NPU work landed in **Linux 6.18 (Nov 2025)** under
|
||||
the codename **`rocket`** (not `rknpu`). The matching userspace shipped
|
||||
in **Mesa 25.3** (Rocket Gallium driver + Teflon TFLite delegate).
|
||||
Boltzmann at 7.0.0-rc3 is one major release past the merge.
|
||||
|
||||
| Component | Mainline state | URL | Notes |
|
||||
|---|---|---|---|
|
||||
| `drivers/accel/rknpu/` | _TBD_ | | |
|
||||
| DT bindings `rockchip,rk3588-npu.yaml` | _TBD_ | | |
|
||||
| `pd_npu` power-domain in `rk3588.dtsi` | _TBD_ | | |
|
||||
| `npu` node in `rk3588.dtsi` | _TBD_ | | |
|
||||
| `npu` node in `rk3588-rock-5-itx-plus.dts` (boltzmann) | _TBD_ | | |
|
||||
| userspace uAPI consumer | _TBD_ | | |
|
||||
| Tomeu's WIP branch | _TBD_ | | |
|
||||
| `drivers/accel/rocket/` | **in-tree since 6.18** | <https://github.com/torvalds/linux/tree/master/drivers/accel/rocket> · [kernel docs](https://docs.kernel.org/accel/rocket/index.html) | Author Tomeu Vizoso. Files: `rocket_core.c`, `_device.c`, `_drv.c`, `_gem.c`, `_job.c`, `_registers.h`. Kconfig `DRM_ACCEL_ROCKET`. |
|
||||
| DT binding `rockchip,rk3588-rknn-core.yaml` | in-tree (6.18 pull) | [v8 06/10 patch](https://www.mail-archive.com/dri-devel@lists.freedesktop.org/msg554406.html) | Path `Documentation/devicetree/bindings/npu/`. Requires 3 reg blocks (pc/cna/core), 4 clocks, IOMMU, power-domain, resets. |
|
||||
| `pd_npu` power-domain in `rk3588-base.dtsi` | in-tree (6.18 pull) | [v8 07/10](https://www.mail-archive.com/dri-devel@lists.freedesktop.org/msg554409.html) | Label added so board files can wire a regulator instead of leaving the domain permanently on. |
|
||||
| `npu@fdab/c/d_0000` nodes in `rk3588.dtsi` | in-tree (boltzmann's DT has all 3) | observed at `/sys/firmware/devicetree/base/npu@fdab0000` etc. | All `compatible = "rockchip,rk3588-rknn-core"`, **`status = "disabled"`** in the DT boltzmann is booting. |
|
||||
| Per-board enable on Rock 5 ITX+ (boltzmann) | **not done** | n/a | Three `rknn-core` nodes ship disabled; need a board overlay or DTS edit to set `status = "okay"` before `rocket` can probe. |
|
||||
| Userspace consumer (Mesa Rocket + Teflon) | merged in **Mesa 25.3** | [Tomeu blog "we are in mainline"](https://blog.tomeuvizoso.net/2025/07/rockchip-npu-update-6-we-are-in-mainline.html) · [Mesa Teflon docs](https://docs.mesa3d.org/teflon.html) | TFLite delegate; the authoritative reference for how jobs are encoded and submitted over the rocket uAPI. |
|
||||
| v8 series cover letter (final revision) | **merged via DRM-misc-next → 6.18** | [v8 cover](https://www.mail-archive.com/dri-devel@lists.freedesktop.org/msg554401.html) · [Patchew](https://patchew.org/linux/20250713-6-10-rocket-v8-0-64fa3115e910@tomeuvizoso.net/) · [LWN coverage](https://lwn.net/Articles/1029800/) | 10 patches. |
|
||||
| BSP `rockchip-linux/rknpu2` (vendor) | out-of-tree, deprecated, superseded by `airockchip/rknn-toolkit2` | <https://github.com/rockchip-linux/rknpu2> · <https://github.com/rockchip-linux/kernel> | BSP `rknpu` driver under `drivers/rknpu/` in 5.10/6.1 BSP branches. Register-map partly readable from source; TRM fills gaps. Use as spec-extraction reference per `feedback_megabitchip_semantic_match`. |
|
||||
| `airockchip/rknn-toolkit2` (vendor userspace) | latest **v2.3.2** (Apr 9 2025) | <https://github.com/airockchip/rknn-toolkit2/releases> | `librknnrt.so` is a closed pre-compiled aarch64 blob (~7 MB), restrictive Rockchip license. **Not usable.** `.rknn` model-format spec is the only interesting piece. |
|
||||
| `airockchip/rknn-llm` (vendor LLM stack) | last release **v1.2.3** (Nov 2024) | <https://github.com/airockchip/rknn-llm/releases> | No release in >1 yr — looks stalled. Monitor only. |
|
||||
|
||||
---
|
||||
|
||||
## Decision: drive via the Rocket DRM-accel uAPI
|
||||
|
||||
The Phase-1 exit question — "accel uAPI vs own MMIO driver" — answers
|
||||
itself with the upstream state we just inventoried:
|
||||
|
||||
- `rocket.ko` is already on disk on boltzmann
|
||||
(`/lib/modules/7.0.0-rc3-ARCH+/kernel/drivers/accel/rocket/rocket.ko`,
|
||||
`intree: Y`) and aliased to the exact `rockchip,rk3588-rknn-core`
|
||||
compatible string the DT uses.
|
||||
- The uAPI (`rocket_accel.h`) is stable enough that Mesa is shipping
|
||||
production consumers against it.
|
||||
- Writing our own MMIO driver would mean bypassing IOMMU integration,
|
||||
reimplementing the (already-reviewed) `pd_npu` power-domain sequencing,
|
||||
and owning forward-compatibility burden with zero upstream path.
|
||||
|
||||
**Unblock path:** patch the boltzmann board DTS (or apply an overlay) to
|
||||
flip the three `npu@fdab/c/d_0000` nodes to `status = "okay"`, rebuild
|
||||
the DTB, reboot, `modprobe rocket`, expect `/dev/accel/accel0..2` to
|
||||
appear and corresponding `dmesg` probe lines. That's a Phase-2 task.
|
||||
|
||||
---
|
||||
|
||||
## Op-coverage gotcha — direct hit on Phase-2 scope
|
||||
|
||||
The agent flagged a structural risk worth surfacing before Phase 2:
|
||||
|
||||
> Op coverage is limited today. Convolutions + tensor add + fused ReLU
|
||||
> covers MobileNet-class models but not a transformer attention block.
|
||||
|
||||
Rosenblatt's premise is offloading the heavy **GEMM** of attention + FFN
|
||||
to the NPU. If `rocket`'s userspace path today is conv-centric (Teflon
|
||||
demos MobileNetV1/V2 + MobileDet) and the kernel-side op set doesn't
|
||||
yet expose a clean matmul primitive, we will need to either:
|
||||
|
||||
1. **Shoehorn matmul through conv-1×1** with appropriate tensor reshape
|
||||
(legit on this silicon — RKNPU2 vendor stack does exactly this — but
|
||||
it depends on what shapes the rocket uAPI lets us submit), or
|
||||
2. **Land NPU matmul / additional op coverage upstream** ourselves (a
|
||||
real contribution, but adds scope), or
|
||||
3. **Drop to a thinner shim** that submits raw NPU command buffers
|
||||
while still going through the rocket GEM + job-submit lifecycle (so
|
||||
we keep IOMMU + power-domain correctness for free).
|
||||
|
||||
This is exactly the kind of question Phase 2 was scoped to answer
|
||||
("smallest-possible op-set llama.cpp can offload"). The decision now
|
||||
becomes "what does rocket's submit-job ioctl actually accept?" which we
|
||||
answer by reading `drivers/accel/rocket/rocket_job.c` and the Mesa
|
||||
Rocket Gallium driver.
|
||||
|
||||
---
|
||||
|
||||
## IOMMU v1.0 hazard — blocks naive Phase-2 unblock
|
||||
|
||||
Surfaced from a `linux-rockchip` thread of 2026-04-03 ("Re: [PATCH]
|
||||
iommu/rockchip: fix page table allocation flags for v2 IOMMU", Midgy
|
||||
BALON → Simon Xue / Jonas Karlman). The RK356x SoCs integrate **two
|
||||
distinct IOMMU IP versions, v1.0 and v2.0**. The NPU and ISP use v1.0;
|
||||
VOP2 uses v2.0. Both map 40-bit physical pages, but **v1.0 does not
|
||||
support placing the first-level page table (DTE) above 4 GB**.
|
||||
|
||||
The current DT binding `rockchip,rk3568-iommu` does **not** distinguish
|
||||
the two — same compat for both IP versions. On `boltzmann` all three
|
||||
NPU IOMMUs are bound to this compat (with `rockchip,rk3588-iommu`
|
||||
fallback), so they will be driven by the v2.0 code path:
|
||||
|
||||
```
|
||||
/sys/firmware/devicetree/base/iommu@fdab9000 compatible: rockchip,rk3588-iommu rockchip,rk3568-iommu status: disabled
|
||||
/sys/firmware/devicetree/base/iommu@fdaca000 compatible: rockchip,rk3588-iommu rockchip,rk3568-iommu status: disabled
|
||||
/sys/firmware/devicetree/base/iommu@fdada000 compatible: rockchip,rk3588-iommu rockchip,rk3568-iommu status: (none → okay default)
|
||||
```
|
||||
|
||||
With `MemTotal = 32 GB` on boltzmann, two failure modes are reachable
|
||||
the moment we enable the NPU:
|
||||
|
||||
1. **DTE allocated above `0x100000000`** → v1.0 hardware silently
|
||||
truncates or errors. Page-walk fails, every NPU job faults.
|
||||
2. **SWIOTLB bounce-buffer PTE poisoning**. With `DMA_BIT_MASK(32)` on
|
||||
the NPU device, bounce buffers land below 4 GB,
|
||||
`phys_to_virt(bounce_addr)` is used as the PTE write target, and the
|
||||
following `dma_sync_single_for_device(DMA_TO_DEVICE)` overwrites the
|
||||
live PTEs with zeros from the original buffer.
|
||||
|
||||
**Planned upstream fix** (two-step, in flight per the thread):
|
||||
|
||||
1. Simon Xue, per-device-ops patch
|
||||
— `<https://lore.kernel.org/all/20260310105303.128859-1-xxm@rock-chips.com/>`
|
||||
2. Midgy BALON, "Add `rockchip,rk3568-iommu-v1` compatible for IOMMU
|
||||
v1.0 blocks (NPU, ISP/VICAP) on RK3568" — adds ops with
|
||||
`.gfp_flags = GFP_DMA32` and `.dma_bit_mask = DMA_BIT_MASK(40)`.
|
||||
3. DT update: NPU + VICAP MMU nodes flipped to the new compat.
|
||||
|
||||
**Mainline state of the fix (verified via cgit, 2026-05-19):**
|
||||
|
||||
| Patch | Posted | Merged to torvalds master? |
|
||||
|---|---|---|
|
||||
| Simon Xue per-device-ops `[20260310105303.128859-1]` | 2026-03-10 | **no** (most recent rockchip-iommu commit is the 2025-06-27 dead-loop fix) |
|
||||
| Midgy `-v1` discriminator (`[1/2]`) | not yet posted as of 2026-04-03 | **no** |
|
||||
| DT update for `rknpu_mmu` / `vicap_mmu` (`[2/2]`) | not yet posted | **no** |
|
||||
| Treewide context: `kmalloc_obj` / `alloc_obj GFP_KERNEL` defaults | 2026-02-21 | merged — possible aggravator of the original bug, may force Midgy's series to be rebased |
|
||||
|
||||
**Subtlety:** Midgy's earlier patch (the standalone one referenced as
|
||||
`In-Reply-To: <20260331075010.1463-1-midgy971@gmail.com>`) modified
|
||||
`iommu_data_ops_v2` directly and was withdrawn after Simon's analysis
|
||||
— it would have over-constrained VOP2 and other v2 users. The
|
||||
discriminator-compat approach is correct but not yet upstream.
|
||||
|
||||
**Implication for Rosenblatt Phase 2:** the README's Phase-2 unblock
|
||||
plan ("enable DT nodes → `modprobe rocket` → `/dev/accel/accel0..2`")
|
||||
will **almost certainly trigger one of the two failure modes** on
|
||||
32 GB boltzmann. Three viable workarounds:
|
||||
|
||||
- **A. Boot with `mem=4GB`** (degraded but valid Phase-2 path; lets us
|
||||
validate the rocket bringup end-to-end without addressing the IOMMU
|
||||
bug).
|
||||
- **B. Carry Midgy's discriminator + DT changes as local patches** on
|
||||
the marfrit kernel branch. Track upstream landing; rebase or drop.
|
||||
- **C. Wait for upstream** — given last visible activity was
|
||||
2026-04-03 and the v8 series isn't yet posted, this could stall
|
||||
Rosenblatt arbitrarily. Not recommended.
|
||||
|
||||
Recommended: **start with (A) for Phase-2 bringup**, switch to (B) when
|
||||
we want to use the full 32 GB and need to benchmark realistic working
|
||||
sets. Track upstream as standing item.
|
||||
|
||||
References (Anubis-gated; fetch with a JS-capable browser or via
|
||||
`lore.kernel.org/lkml/` mirrors):
|
||||
- Original Midgy patch: `<https://lore.kernel.org/all/20260331075010.1463-1-midgy971@gmail.com/>`
|
||||
- Simon Xue per-device-ops: `<https://lore.kernel.org/all/20260310105303.128859-1-xxm@rock-chips.com/>`
|
||||
|
||||
---
|
||||
|
||||
## Vendor stack — spec source only, not a runtime
|
||||
|
||||
- `librknnrt.so`: proprietary aarch64 binary, restrictive Rockchip
|
||||
license (no reverse-engineering, no redistribution). Do not link.
|
||||
- `.rknn` model format: the one interface we may need to understand if
|
||||
we want to consume pre-quantized vendor model weights. Otherwise we
|
||||
quantize from GGUF ourselves.
|
||||
- BSP `drivers/rknpu/` in `rockchip-linux/kernel`: usable as a
|
||||
register-layout reference. Read; don't lift.
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user