Rosenblatt Phase-1 closeout: rocket-driver substrate inventory
Phase-1 audit closes with a substantively different picture than the original scaffold's TBDs: - Tomeu Vizoso's RK3588 NPU work merged in Linux 6.18 (Nov 2025) under codename `rocket` (NOT `rknpu`). All references updated. - Boltzmann's `linux-rk3588-marfrit-A1` (7.0.0-rc3-ARCH+) already ships `drivers/accel/rocket/rocket.ko` as a built-but-not-loaded module. - DT bindings + per-core nodes (`npu@fdab/c/d_0000`, compatible `rockchip,rk3588-rknn-core`) in mainline since 6.18 but ship `status = "disabled"` — board enable is the Phase-2 unblock, not a driver port. - Mesa 25.3 ships Rocket Gallium + Teflon TFLite delegate as the authoritative userspace reference for the uAPI shape. - Op coverage today is conv-centric (MobileNet-class); transformer matmul needs the conv-1×1 shoehorn (RKNPU2 BSP precedent) or rocket op-set additions. Surfaced as Phase-2-load-bearing risk. - IOMMU v1.0 hazard: 32 GB host needs `mem=4G` or local `rockchip,rk3568-iommu-v1` discriminator patches before the first NPU job, to avoid DMA-window faults. Files: - docs/npu-mainline-status.md: full audit table with upstream pointers (kernel.org / Mesa docs / dri-devel patch URLs / Tomeu's "we are in mainline" blog post). - docs/phases.md: per-phase log entry for Phase-1 closeout. - docs/op-coverage.md: matmul-vs-conv-vs-rocket-op-set framing. - fleet/boltzmann.yaml: audited kernel + npu_driver + dt_npu_nodes state. - kernel/dt-overlays/rk3588-rosenblatt-npu-enable.dtso: overlay to flip the three rknn-core nodes to "okay" (+ matching mmu nodes), carries the IOMMU-mitigation warning inline. - kernel/README.md: kernel-agent scope wiring + anticipated local carry patches. - README.md: phase-status table + "rknpu → rocket" rename note. - TODO.md: Phase-2 unblock concrete steps + standing upstream-watch items.
This commit is contained in:
@@ -103,5 +103,12 @@ We're writing the second half.
|
|||||||
| Phase | State | Date |
|
| Phase | State | Date |
|
||||||
|---|---|---|
|
|---|---|---|
|
||||||
| 0 — bootstrap | done | 2026-05-19 |
|
| 0 — bootstrap | done | 2026-05-19 |
|
||||||
| 1 — substrate audit | open | |
|
| 1 — substrate audit | done | 2026-05-19 |
|
||||||
| 2..10 | pending | |
|
| 2 — formulate | open | |
|
||||||
|
| 3..10 | pending | |
|
||||||
|
|
||||||
|
Phase-1 closeout: `docs/phases.md` + `docs/npu-mainline-status.md`.
|
||||||
|
Headline: mainline driver name is **`rocket`** (not `rknpu`); it's
|
||||||
|
already shipped in boltzmann's kernel as a built module. Phase-2
|
||||||
|
unblock is small (DT enable + IOMMU v1.0 mitigation + modprobe),
|
||||||
|
not a driver port.
|
||||||
|
|||||||
@@ -64,12 +64,57 @@ clear answer to "do we drive via accel uAPI or write our own MMIO driver."
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Cross-phase / standing items
|
## Phase-2 unblock — prerequisites (NEW, Phase-1 outflow)
|
||||||
|
|
||||||
- [ ] Mirror Tomeu's WIP branch into a local clone for kernel hacking
|
Phase-1 audit (2026-05-19) reframed Phase-2 from "design rknpu backend
|
||||||
- [ ] Set up serial console on boltzmann for kernel-panic recovery (Quark
|
interface" to a concrete bringup sequence. See
|
||||||
umbrella; check current state)
|
`docs/npu-mainline-status.md` for full context.
|
||||||
- [ ] Add `project_rosenblatt.md` to claude-memory once Phase 1 closes (so
|
|
||||||
future sessions don't re-discover the campaign)
|
- [ ] Patch boltzmann board DTS / overlay to flip
|
||||||
- [ ] Decide repo home: marfrit/rosenblatt on git.reauktion.de (probably yes,
|
`npu@fdab0000`, `npu@fdac0000`, `npu@fdad0000` from
|
||||||
after Phase-1 substrate is captured and the README isn't embarrassing)
|
`status = "disabled"` → `"okay"`. Rebuild DTB.
|
||||||
|
- [ ] **Mitigate IOMMU v1.0 hazard before first NPU job** (32 GB host).
|
||||||
|
Pick one:
|
||||||
|
- (A) Boot with `mem=4G` for first-bringup validation, OR
|
||||||
|
- (B) Carry local patches: Simon Xue per-device-ops
|
||||||
|
(`<https://lore.kernel.org/all/20260310105303.128859-1-xxm@rock-chips.com/>`)
|
||||||
|
+ Midgy `rockchip,rk3568-iommu-v1` discriminator compat
|
||||||
|
+ DT update for `rknpu_mmu` to the new compat.
|
||||||
|
- [ ] `modprobe rocket` and confirm `/dev/accel/accel0..2` appear, no
|
||||||
|
probe errors in dmesg, IOMMU faults absent.
|
||||||
|
- [ ] Read `drivers/accel/rocket/rocket_job.c` + Mesa Rocket Gallium to
|
||||||
|
determine submit-job uAPI capabilities — specifically whether
|
||||||
|
we can express a transformer matmul as a tile/op the NPU pipeline
|
||||||
|
accepts, or whether we need additional op coverage upstream.
|
||||||
|
- [ ] Decide matmul strategy (Phase-2 deliverable):
|
||||||
|
conv-1×1 shoehorn / extend rocket op set / thinner submit shim.
|
||||||
|
|
||||||
|
## Standing items — track upstream
|
||||||
|
|
||||||
|
- [ ] Watch `drivers/iommu/rockchip-iommu.c` for the discriminator
|
||||||
|
`rockchip,rk3568-iommu-v1` compat to land; drop local patch (B)
|
||||||
|
when it does.
|
||||||
|
- [ ] Watch `linux-rockchip` for the next iteration of Midgy / Simon's
|
||||||
|
thread (last visible activity 2026-04-03).
|
||||||
|
- [ ] Watch `drivers/accel/rocket/` for matmul / GEMM op additions.
|
||||||
|
|
||||||
|
## Cross-phase / standing items (older)
|
||||||
|
|
||||||
|
- [ ] Mirror Tomeu's branch — superseded: code is now in-tree.
|
||||||
|
Keep `git.kernel.org/.../torvalds/linux.git` checkout pinned to
|
||||||
|
the boltzmann kernel rev for in-tree reading.
|
||||||
|
- [ ] Set up serial console on boltzmann for kernel-panic recovery
|
||||||
|
(Quark umbrella; check current state) — **becomes load-bearing
|
||||||
|
once we start poking IOMMU code.**
|
||||||
|
- [x] Add `project_rosenblatt_overview.md` + `project_rocket_upstream_state.md`
|
||||||
|
to claude-memory — done 2026-05-19.
|
||||||
|
- [ ] Decide repo home: marfrit/rosenblatt on git.reauktion.de
|
||||||
|
(probably yes, after Phase-1 substrate is captured).
|
||||||
|
- [ ] **Resolve board-name discrepancy.** README and
|
||||||
|
`fleet/boltzmann.yaml` say boltzmann is a "Rock 5 ITX+" /
|
||||||
|
`rock-5-itx-plus`; the running DT reports
|
||||||
|
`model = "Radxa ROCK 5 ITX"`, `compatible = "radxa,rock-5-itx"`.
|
||||||
|
Confirm physical board model (Radxa sells both SKUs) and
|
||||||
|
either correct the README + manifest, or note that we boot
|
||||||
|
the plain-ITX DT on ITX+ hardware (likely fine; ITX+ is mostly
|
||||||
|
a connectivity-refresh, same SoC + same NPU silicon).
|
||||||
|
|||||||
+197
-30
@@ -5,8 +5,45 @@
|
|||||||
> the audit date. Re-audit on each kernel-version bump or when a
|
> the audit date. Re-audit on each kernel-version bump or when a
|
||||||
> blocker hits.
|
> blocker hits.
|
||||||
|
|
||||||
**Audit date:** _(pending — first Phase-1 task)_
|
**Audit date:** 2026-05-19
|
||||||
**Kernel under audit:** _(boltzmann's running kernel, capture with `uname -r`)_
|
**Kernel under audit:** boltzmann, `7.0.0-rc3-ARCH+ #2 SMP PREEMPT_DYNAMIC Wed Apr 29 11:16:17 CEST 2026 aarch64` (image `linux-rk3588-marfrit-A1`)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Local-host findings (boltzmann)
|
||||||
|
|
||||||
|
The custom `linux-rk3588-marfrit-A1` build is mainline-rc3-based and **already
|
||||||
|
ships Tomeu Vizoso's upstream in-tree NPU accel driver** as a loadable module:
|
||||||
|
|
||||||
|
| Probe | Result |
|
||||||
|
|---|---|
|
||||||
|
| `CONFIG_DRM_ACCEL` | `y` (DRM-accel core built-in; major 261 = `accel` in `/proc/devices`) |
|
||||||
|
| `CONFIG_DRM_ACCEL_ROCKET` | `m` (Tomeu's `rocket` driver, intree, author "Tomeu Vizoso") |
|
||||||
|
| Module file | `/lib/modules/7.0.0-rc3-ARCH+/kernel/drivers/accel/rocket/rocket.ko` |
|
||||||
|
| Compat aliases | `rockchip,rk3588-rknn-core` (per-core NPU node) |
|
||||||
|
| Module deps | `gpu-sched`, `drm_shmem_helper` |
|
||||||
|
| DT nodes | `npu@fdab0000`, `npu@fdac0000`, `npu@fdad0000` — all `compatible = "rockchip,rk3588-rknn-core"`, **`status = "disabled"`** |
|
||||||
|
| `/dev/accel/` | does not exist (no device bound — rocket isn't loaded and nodes are disabled) |
|
||||||
|
| `dmesg` rknpu/npu/accel lines | zero since boot |
|
||||||
|
|
||||||
|
**Net:** the driver-side path is already there. The blocker is the
|
||||||
|
device-tree: `pd_npu` is wired but the three `rknn-core` nodes are
|
||||||
|
`status = "disabled"`. Enabling them (DT overlay or rebuilt board DTB)
|
||||||
|
should let `rocket` probe and instantiate `/dev/accel/accel0..2`.
|
||||||
|
|
||||||
|
This decisively answers the Phase-1 exit question: **drive via the
|
||||||
|
DRM-accel uAPI through `rocket`, not a from-scratch MMIO driver.**
|
||||||
|
|
||||||
|
Display/GPU side, for completeness: `card0` + `renderD128` are bound to
|
||||||
|
`panthor` (mainline Mali G610), `card1` to `rockchip-drm` — no
|
||||||
|
contention with the NPU stack.
|
||||||
|
|
||||||
|
The "Tomeu's rknpu" framing in the original Phase-1 plan is outdated:
|
||||||
|
the work landed under the name **`rocket`** in `drivers/accel/`.
|
||||||
|
References below use both names; consider `rknpu` → `rocket` going
|
||||||
|
forward.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -25,38 +62,168 @@
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Upstream work in flight
|
## Upstream state (audited 2026-05-19)
|
||||||
|
|
||||||
To audit on Phase 1:
|
Tomeu Vizoso's RK3588 NPU work landed in **Linux 6.18 (Nov 2025)** under
|
||||||
|
the codename **`rocket`** (not `rknpu`). The matching userspace shipped
|
||||||
1. **Tomeu Vizoso's rknpu / DRM-accel series**
|
in **Mesa 25.3** (Rocket Gallium driver + Teflon TFLite delegate).
|
||||||
- Branch: search lore.kernel.org for "rknpu" + "Tomeu Vizoso" in
|
Boltzmann at 7.0.0-rc3 is one major release past the merge.
|
||||||
`linux-rockchip` / `dri-devel`. Check `linux-rockchip` ml cutoff
|
|
||||||
date per `reference_linux_rockchip_ml` memory.
|
|
||||||
- State to capture: which patches merged, which are review-pending,
|
|
||||||
which dropped on the floor.
|
|
||||||
2. **`drivers/accel/` mainline state**
|
|
||||||
- Check `drivers/accel/` in current torvalds tree — list of in-tree
|
|
||||||
accelerators (habanalabs, ivpu, qaic, etc.). Note if rknpu is
|
|
||||||
there yet.
|
|
||||||
3. **DT bindings**
|
|
||||||
- `Documentation/devicetree/bindings/npu/rockchip,rk3588-npu.yaml`
|
|
||||||
— does it exist in mainline?
|
|
||||||
4. **Userspace bridge**
|
|
||||||
- Tomeu has a userspace shim ("rkneural"?) for the DRM-accel uAPI.
|
|
||||||
Capture repo URL + current state.
|
|
||||||
|
|
||||||
Fill the table below during Phase-1 audit:
|
|
||||||
|
|
||||||
| Component | Mainline state | URL | Notes |
|
| Component | Mainline state | URL | Notes |
|
||||||
|---|---|---|---|
|
|---|---|---|---|
|
||||||
| `drivers/accel/rknpu/` | _TBD_ | | |
|
| `drivers/accel/rocket/` | **in-tree since 6.18** | <https://github.com/torvalds/linux/tree/master/drivers/accel/rocket> · [kernel docs](https://docs.kernel.org/accel/rocket/index.html) | Author Tomeu Vizoso. Files: `rocket_core.c`, `_device.c`, `_drv.c`, `_gem.c`, `_job.c`, `_registers.h`. Kconfig `DRM_ACCEL_ROCKET`. |
|
||||||
| DT bindings `rockchip,rk3588-npu.yaml` | _TBD_ | | |
|
| DT binding `rockchip,rk3588-rknn-core.yaml` | in-tree (6.18 pull) | [v8 06/10 patch](https://www.mail-archive.com/dri-devel@lists.freedesktop.org/msg554406.html) | Path `Documentation/devicetree/bindings/npu/`. Requires 3 reg blocks (pc/cna/core), 4 clocks, IOMMU, power-domain, resets. |
|
||||||
| `pd_npu` power-domain in `rk3588.dtsi` | _TBD_ | | |
|
| `pd_npu` power-domain in `rk3588-base.dtsi` | in-tree (6.18 pull) | [v8 07/10](https://www.mail-archive.com/dri-devel@lists.freedesktop.org/msg554409.html) | Label added so board files can wire a regulator instead of leaving the domain permanently on. |
|
||||||
| `npu` node in `rk3588.dtsi` | _TBD_ | | |
|
| `npu@fdab/c/d_0000` nodes in `rk3588.dtsi` | in-tree (boltzmann's DT has all 3) | observed at `/sys/firmware/devicetree/base/npu@fdab0000` etc. | All `compatible = "rockchip,rk3588-rknn-core"`, **`status = "disabled"`** in the DT boltzmann is booting. |
|
||||||
| `npu` node in `rk3588-rock-5-itx-plus.dts` (boltzmann) | _TBD_ | | |
|
| Per-board enable on Rock 5 ITX+ (boltzmann) | **not done** | n/a | Three `rknn-core` nodes ship disabled; need a board overlay or DTS edit to set `status = "okay"` before `rocket` can probe. |
|
||||||
| userspace uAPI consumer | _TBD_ | | |
|
| Userspace consumer (Mesa Rocket + Teflon) | merged in **Mesa 25.3** | [Tomeu blog "we are in mainline"](https://blog.tomeuvizoso.net/2025/07/rockchip-npu-update-6-we-are-in-mainline.html) · [Mesa Teflon docs](https://docs.mesa3d.org/teflon.html) | TFLite delegate; the authoritative reference for how jobs are encoded and submitted over the rocket uAPI. |
|
||||||
| Tomeu's WIP branch | _TBD_ | | |
|
| v8 series cover letter (final revision) | **merged via DRM-misc-next → 6.18** | [v8 cover](https://www.mail-archive.com/dri-devel@lists.freedesktop.org/msg554401.html) · [Patchew](https://patchew.org/linux/20250713-6-10-rocket-v8-0-64fa3115e910@tomeuvizoso.net/) · [LWN coverage](https://lwn.net/Articles/1029800/) | 10 patches. |
|
||||||
|
| BSP `rockchip-linux/rknpu2` (vendor) | out-of-tree, deprecated, superseded by `airockchip/rknn-toolkit2` | <https://github.com/rockchip-linux/rknpu2> · <https://github.com/rockchip-linux/kernel> | BSP `rknpu` driver under `drivers/rknpu/` in 5.10/6.1 BSP branches. Register-map partly readable from source; TRM fills gaps. Use as spec-extraction reference per `feedback_megabitchip_semantic_match`. |
|
||||||
|
| `airockchip/rknn-toolkit2` (vendor userspace) | latest **v2.3.2** (Apr 9 2025) | <https://github.com/airockchip/rknn-toolkit2/releases> | `librknnrt.so` is a closed pre-compiled aarch64 blob (~7 MB), restrictive Rockchip license. **Not usable.** `.rknn` model-format spec is the only interesting piece. |
|
||||||
|
| `airockchip/rknn-llm` (vendor LLM stack) | last release **v1.2.3** (Nov 2024) | <https://github.com/airockchip/rknn-llm/releases> | No release in >1 yr — looks stalled. Monitor only. |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Decision: drive via the Rocket DRM-accel uAPI
|
||||||
|
|
||||||
|
The Phase-1 exit question — "accel uAPI vs own MMIO driver" — answers
|
||||||
|
itself with the upstream state we just inventoried:
|
||||||
|
|
||||||
|
- `rocket.ko` is already on disk on boltzmann
|
||||||
|
(`/lib/modules/7.0.0-rc3-ARCH+/kernel/drivers/accel/rocket/rocket.ko`,
|
||||||
|
`intree: Y`) and aliased to the exact `rockchip,rk3588-rknn-core`
|
||||||
|
compatible string the DT uses.
|
||||||
|
- The uAPI (`rocket_accel.h`) is stable enough that Mesa is shipping
|
||||||
|
production consumers against it.
|
||||||
|
- Writing our own MMIO driver would mean bypassing IOMMU integration,
|
||||||
|
reimplementing the (already-reviewed) `pd_npu` power-domain sequencing,
|
||||||
|
and owning forward-compatibility burden with zero upstream path.
|
||||||
|
|
||||||
|
**Unblock path:** patch the boltzmann board DTS (or apply an overlay) to
|
||||||
|
flip the three `npu@fdab/c/d_0000` nodes to `status = "okay"`, rebuild
|
||||||
|
the DTB, reboot, `modprobe rocket`, expect `/dev/accel/accel0..2` to
|
||||||
|
appear and corresponding `dmesg` probe lines. That's a Phase-2 task.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Op-coverage gotcha — direct hit on Phase-2 scope
|
||||||
|
|
||||||
|
The agent flagged a structural risk worth surfacing before Phase 2:
|
||||||
|
|
||||||
|
> Op coverage is limited today. Convolutions + tensor add + fused ReLU
|
||||||
|
> covers MobileNet-class models but not a transformer attention block.
|
||||||
|
|
||||||
|
Rosenblatt's premise is offloading the heavy **GEMM** of attention + FFN
|
||||||
|
to the NPU. If `rocket`'s userspace path today is conv-centric (Teflon
|
||||||
|
demos MobileNetV1/V2 + MobileDet) and the kernel-side op set doesn't
|
||||||
|
yet expose a clean matmul primitive, we will need to either:
|
||||||
|
|
||||||
|
1. **Shoehorn matmul through conv-1×1** with appropriate tensor reshape
|
||||||
|
(legit on this silicon — RKNPU2 vendor stack does exactly this — but
|
||||||
|
it depends on what shapes the rocket uAPI lets us submit), or
|
||||||
|
2. **Land NPU matmul / additional op coverage upstream** ourselves (a
|
||||||
|
real contribution, but adds scope), or
|
||||||
|
3. **Drop to a thinner shim** that submits raw NPU command buffers
|
||||||
|
while still going through the rocket GEM + job-submit lifecycle (so
|
||||||
|
we keep IOMMU + power-domain correctness for free).
|
||||||
|
|
||||||
|
This is exactly the kind of question Phase 2 was scoped to answer
|
||||||
|
("smallest-possible op-set llama.cpp can offload"). The decision now
|
||||||
|
becomes "what does rocket's submit-job ioctl actually accept?" which we
|
||||||
|
answer by reading `drivers/accel/rocket/rocket_job.c` and the Mesa
|
||||||
|
Rocket Gallium driver.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## IOMMU v1.0 hazard — blocks naive Phase-2 unblock
|
||||||
|
|
||||||
|
Surfaced from a `linux-rockchip` thread of 2026-04-03 ("Re: [PATCH]
|
||||||
|
iommu/rockchip: fix page table allocation flags for v2 IOMMU", Midgy
|
||||||
|
BALON → Simon Xue / Jonas Karlman). The RK356x SoCs integrate **two
|
||||||
|
distinct IOMMU IP versions, v1.0 and v2.0**. The NPU and ISP use v1.0;
|
||||||
|
VOP2 uses v2.0. Both map 40-bit physical pages, but **v1.0 does not
|
||||||
|
support placing the first-level page table (DTE) above 4 GB**.
|
||||||
|
|
||||||
|
The current DT binding `rockchip,rk3568-iommu` does **not** distinguish
|
||||||
|
the two — same compat for both IP versions. On `boltzmann` all three
|
||||||
|
NPU IOMMUs are bound to this compat (with `rockchip,rk3588-iommu`
|
||||||
|
fallback), so they will be driven by the v2.0 code path:
|
||||||
|
|
||||||
|
```
|
||||||
|
/sys/firmware/devicetree/base/iommu@fdab9000 compatible: rockchip,rk3588-iommu rockchip,rk3568-iommu status: disabled
|
||||||
|
/sys/firmware/devicetree/base/iommu@fdaca000 compatible: rockchip,rk3588-iommu rockchip,rk3568-iommu status: disabled
|
||||||
|
/sys/firmware/devicetree/base/iommu@fdada000 compatible: rockchip,rk3588-iommu rockchip,rk3568-iommu status: (none → okay default)
|
||||||
|
```
|
||||||
|
|
||||||
|
With `MemTotal = 32 GB` on boltzmann, two failure modes are reachable
|
||||||
|
the moment we enable the NPU:
|
||||||
|
|
||||||
|
1. **DTE allocated above `0x100000000`** → v1.0 hardware silently
|
||||||
|
truncates or errors. Page-walk fails, every NPU job faults.
|
||||||
|
2. **SWIOTLB bounce-buffer PTE poisoning**. With `DMA_BIT_MASK(32)` on
|
||||||
|
the NPU device, bounce buffers land below 4 GB,
|
||||||
|
`phys_to_virt(bounce_addr)` is used as the PTE write target, and the
|
||||||
|
following `dma_sync_single_for_device(DMA_TO_DEVICE)` overwrites the
|
||||||
|
live PTEs with zeros from the original buffer.
|
||||||
|
|
||||||
|
**Planned upstream fix** (two-step, in flight per the thread):
|
||||||
|
|
||||||
|
1. Simon Xue, per-device-ops patch
|
||||||
|
— `<https://lore.kernel.org/all/20260310105303.128859-1-xxm@rock-chips.com/>`
|
||||||
|
2. Midgy BALON, "Add `rockchip,rk3568-iommu-v1` compatible for IOMMU
|
||||||
|
v1.0 blocks (NPU, ISP/VICAP) on RK3568" — adds ops with
|
||||||
|
`.gfp_flags = GFP_DMA32` and `.dma_bit_mask = DMA_BIT_MASK(40)`.
|
||||||
|
3. DT update: NPU + VICAP MMU nodes flipped to the new compat.
|
||||||
|
|
||||||
|
**Mainline state of the fix (verified via cgit, 2026-05-19):**
|
||||||
|
|
||||||
|
| Patch | Posted | Merged to torvalds master? |
|
||||||
|
|---|---|---|
|
||||||
|
| Simon Xue per-device-ops `[20260310105303.128859-1]` | 2026-03-10 | **no** (most recent rockchip-iommu commit is the 2025-06-27 dead-loop fix) |
|
||||||
|
| Midgy `-v1` discriminator (`[1/2]`) | not yet posted as of 2026-04-03 | **no** |
|
||||||
|
| DT update for `rknpu_mmu` / `vicap_mmu` (`[2/2]`) | not yet posted | **no** |
|
||||||
|
| Treewide context: `kmalloc_obj` / `alloc_obj GFP_KERNEL` defaults | 2026-02-21 | merged — possible aggravator of the original bug, may force Midgy's series to be rebased |
|
||||||
|
|
||||||
|
**Subtlety:** Midgy's earlier patch (the standalone one referenced as
|
||||||
|
`In-Reply-To: <20260331075010.1463-1-midgy971@gmail.com>`) modified
|
||||||
|
`iommu_data_ops_v2` directly and was withdrawn after Simon's analysis
|
||||||
|
— it would have over-constrained VOP2 and other v2 users. The
|
||||||
|
discriminator-compat approach is correct but not yet upstream.
|
||||||
|
|
||||||
|
**Implication for Rosenblatt Phase 2:** the README's Phase-2 unblock
|
||||||
|
plan ("enable DT nodes → `modprobe rocket` → `/dev/accel/accel0..2`")
|
||||||
|
will **almost certainly trigger one of the two failure modes** on
|
||||||
|
32 GB boltzmann. Three viable workarounds:
|
||||||
|
|
||||||
|
- **A. Boot with `mem=4GB`** (degraded but valid Phase-2 path; lets us
|
||||||
|
validate the rocket bringup end-to-end without addressing the IOMMU
|
||||||
|
bug).
|
||||||
|
- **B. Carry Midgy's discriminator + DT changes as local patches** on
|
||||||
|
the marfrit kernel branch. Track upstream landing; rebase or drop.
|
||||||
|
- **C. Wait for upstream** — given last visible activity was
|
||||||
|
2026-04-03 and the v8 series isn't yet posted, this could stall
|
||||||
|
Rosenblatt arbitrarily. Not recommended.
|
||||||
|
|
||||||
|
Recommended: **start with (A) for Phase-2 bringup**, switch to (B) when
|
||||||
|
we want to use the full 32 GB and need to benchmark realistic working
|
||||||
|
sets. Track upstream as standing item.
|
||||||
|
|
||||||
|
References (Anubis-gated; fetch with a JS-capable browser or via
|
||||||
|
`lore.kernel.org/lkml/` mirrors):
|
||||||
|
- Original Midgy patch: `<https://lore.kernel.org/all/20260331075010.1463-1-midgy971@gmail.com/>`
|
||||||
|
- Simon Xue per-device-ops: `<https://lore.kernel.org/all/20260310105303.128859-1-xxm@rock-chips.com/>`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Vendor stack — spec source only, not a runtime
|
||||||
|
|
||||||
|
- `librknnrt.so`: proprietary aarch64 binary, restrictive Rockchip
|
||||||
|
license (no reverse-engineering, no redistribution). Do not link.
|
||||||
|
- `.rknn` model format: the one interface we may need to understand if
|
||||||
|
we want to consume pre-quantized vendor model weights. Otherwise we
|
||||||
|
quantize from GGUF ourselves.
|
||||||
|
- BSP `drivers/rknpu/` in `rockchip-linux/kernel`: usable as a
|
||||||
|
register-layout reference. Read; don't lift.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|||||||
@@ -0,0 +1,257 @@
|
|||||||
|
# Op coverage — what the rocket NPU pipeline actually does
|
||||||
|
|
||||||
|
> **Phase-2 deliverable.** Captures the rocket uAPI surface, the NPU's
|
||||||
|
> sub-block topology, and the implication for how llama.cpp's matmul
|
||||||
|
> can be expressed on this silicon. Read alongside
|
||||||
|
> `npu-mainline-status.md`.
|
||||||
|
|
||||||
|
**Reference sources** (all in-tree as of mainline 2026-05-19):
|
||||||
|
|
||||||
|
- `include/uapi/drm/rocket_accel.h` — userspace ABI
|
||||||
|
- `drivers/accel/rocket/rocket_drv.c` — module init, of-match, ioctl table
|
||||||
|
- `drivers/accel/rocket/rocket_job.c` — `rocket_job_hw_submit()` is the
|
||||||
|
authoritative description of the per-task hardware sequence
|
||||||
|
- `drivers/accel/rocket/rocket_registers.h` — autogenerated from Mesa's
|
||||||
|
`src/gallium/drivers/rocket/registers.xml` (60 KB; **the regcmd
|
||||||
|
format is owned by Mesa, not the kernel**)
|
||||||
|
- `Documentation/accel/rocket/index.rst` — one-paragraph driver summary
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## uAPI: four ioctls, one device per fd
|
||||||
|
|
||||||
|
| Ioctl | Direction | Purpose |
|
||||||
|
|---|---|---|
|
||||||
|
| `DRM_ROCKET_CREATE_BO` | IOWR | Allocate a GEM buffer, get NPU DMA address + mmap offset |
|
||||||
|
| `DRM_ROCKET_PREP_BO` | IOW | Take CPU ownership; wait for NPU fences; cache-invalidate |
|
||||||
|
| `DRM_ROCKET_FINI_BO` | IOW | Release CPU ownership; cache-flush for NPU |
|
||||||
|
| `DRM_ROCKET_SUBMIT` | IOW | Submit an array of jobs (each = array of tasks) |
|
||||||
|
|
||||||
|
Submission unit hierarchy:
|
||||||
|
|
||||||
|
```
|
||||||
|
drm_rocket_submit
|
||||||
|
└── drm_rocket_job[] (each tied to specific in/out BOs)
|
||||||
|
└── drm_rocket_task[] (executes sequentially on ONE core)
|
||||||
|
├── regcmd DMA address of register-command buffer
|
||||||
|
└── regcmd_count number of 32-bit reg/value entries
|
||||||
|
```
|
||||||
|
|
||||||
|
Key kernel-side invariant from `rocket_job_hw_submit()`:
|
||||||
|
|
||||||
|
> "All tasks in the same job will be executed sequentially on the same
|
||||||
|
> core, to benefit from memory residency in SRAM."
|
||||||
|
|
||||||
|
Per-core SRAM is 2 MB. Bundling matmul tiles into one job buys weight
|
||||||
|
reuse across tasks for free — important Phase-2 perf knob.
|
||||||
|
|
||||||
|
The driver exposes a **single `/dev/accel/accel0` facade** that
|
||||||
|
schedules across all probed RKNN cores. Three DT nodes →
|
||||||
|
`rdev->num_cores = 3`, but userspace sees one device.
|
||||||
|
|
||||||
|
## What "regcmd" actually is
|
||||||
|
|
||||||
|
The kernel's `rocket_job_hw_submit()` does **not** know what op runs.
|
||||||
|
For each task it:
|
||||||
|
|
||||||
|
1. Writes the task's `regcmd` DMA address to `PC_BASE_ADDRESS`.
|
||||||
|
2. Writes `PC_REGISTER_AMOUNTS = (regcmd_count + 1) / 2 - 1`.
|
||||||
|
3. Configures S_POINTER ping-pong (CNA + Core, with a per-core "extra
|
||||||
|
bit" `0x10000000 * core_index` lifted from BSP rknpu — undocumented
|
||||||
|
in TRM, kernel comments call it out).
|
||||||
|
4. Enables DPU_0/DPU_1 interrupts.
|
||||||
|
5. Pulls the trigger: `PC_OPERATION_ENABLE.OP_EN = 1`.
|
||||||
|
|
||||||
|
The NPU's Program Controller then plays back the regcmd buffer —
|
||||||
|
**a sequence of {register address, register value} pairs the userspace
|
||||||
|
emitted**. The kernel is fundamentally a power/IOMMU/scheduler shim.
|
||||||
|
**All op coverage lives in userspace regcmd construction.** That's
|
||||||
|
why "what ops does rocket support?" is a Mesa question, not a kernel
|
||||||
|
question.
|
||||||
|
|
||||||
|
## NPU sub-blocks (from PC_INTERRUPT_MASK)
|
||||||
|
|
||||||
|
The interrupt-mask register names the discrete blocks the PC drives:
|
||||||
|
|
||||||
|
| Block | Channels | Likely role |
|
||||||
|
|---|---|---|
|
||||||
|
| **CNA** — Convolution Neural-net Accelerator | FEATURE_0/1, WEIGHT_0/1, CSC_0/1 | Loads features + weights, color-space convert (legacy ISP path) |
|
||||||
|
| **CORE** | 0/1 | MAC arrays — INT8 multiply-accumulate datapath |
|
||||||
|
| **DPU** — Data Processing Unit | 0/1 | Per-channel post-process (likely accumulation / type convert) |
|
||||||
|
| **PPU** — Post-Processing Unit | 0/1 | Activation / pooling / requantization |
|
||||||
|
| **PC** — Program Controller | — | Reads regcmd, sequences the pipeline |
|
||||||
|
| **DMA** | read/write error flags | Pulls features+weights, writes outputs through IOMMU |
|
||||||
|
|
||||||
|
Each per-core block has two parallel channels — same hardware as the
|
||||||
|
RKNPU2 vendor stack relies on for double-buffering input feature maps
|
||||||
|
and weights to keep the MAC arrays fed.
|
||||||
|
|
||||||
|
## Matmul on conv-shaped silicon — confirmed conv-only path
|
||||||
|
|
||||||
|
Skimming Mesa's `src/gallium/drivers/rocket/rkt_regcmd.c` (only ~700
|
||||||
|
lines, single conv emit path) **confirms**: the rocket hardware has no
|
||||||
|
matmul-shaped instruction. Every register in the emit path is named
|
||||||
|
`CNA_CONV_*`, `CNA_DATA_*`, `CNA_WEIGHT_*` — and the function that
|
||||||
|
builds a task's regcmd buffer is literally called `fill_first_regcmd`
|
||||||
|
with no alternate path for matmul. **Any matmul we run on this silicon
|
||||||
|
goes through the conv pipeline.**
|
||||||
|
|
||||||
|
### Regcmd encoding (from `rkt_regcmd.c::emit_raw`)
|
||||||
|
|
||||||
|
The userspace regcmd buffer is an array of **64-bit packed entries**:
|
||||||
|
|
||||||
|
```
|
||||||
|
packed = (target << 48) | (value << 16) | reg
|
||||||
|
```
|
||||||
|
|
||||||
|
- `target` — block destination ID (CNA, CORE, DPU, PC, …) + a `+0x1`
|
||||||
|
arming bit. `rkt_get_target(reg)` returns the block per register.
|
||||||
|
- `reg` — 16-bit register offset within the block.
|
||||||
|
- `value` — 32-bit register write value.
|
||||||
|
|
||||||
|
Each task's regcmd ends with the TRM-mandated trigger sequence:
|
||||||
|
|
||||||
|
```
|
||||||
|
0x0041000000000000 /* TRM: must precede op_en */
|
||||||
|
emit_raw(regs, 0x81, REG_PC_OPERATION_ENABLE,
|
||||||
|
PC_OPERATION_ENABLE_RESERVED_0(14) | PC_OPERATION_ENABLE_OP_EN(1));
|
||||||
|
```
|
||||||
|
|
||||||
|
That trailing PC.OPERATION_ENABLE write is what makes the NPU pipeline
|
||||||
|
start consuming the regcmd above it.
|
||||||
|
|
||||||
|
### Matmul-as-conv-1×1 — concrete mapping
|
||||||
|
|
||||||
|
For an INT8 matmul `Y[M, N] = X[M, K] @ W[K, N]` (the shape attention
|
||||||
|
and FFN projections take):
|
||||||
|
|
||||||
|
| Conv axis | Matmul axis | Concrete |
|
||||||
|
|---|---|---|
|
||||||
|
| `DATAIN_WIDTH` | 1 | `CNA_DATA_SIZE0.DATAIN_WIDTH = 1` |
|
||||||
|
| `DATAIN_HEIGHT` | M (rows) | `CNA_DATA_SIZE0.DATAIN_HEIGHT = M` |
|
||||||
|
| `DATAIN_CHANNEL` | K (input dim) | `CNA_DATA_SIZE1.DATAIN_CHANNEL = K`, `DATAIN_CHANNEL_REAL = K-1` |
|
||||||
|
| `WEIGHT_WIDTH/HEIGHT` | 1, 1 | `CNA_WEIGHT_SIZE2.{WIDTH, HEIGHT} = 1` |
|
||||||
|
| `WEIGHT_KERNELS` | N (output dim) | `CNA_WEIGHT_SIZE2.WEIGHT_KERNELS = N` |
|
||||||
|
| `CONV_X/Y_STRIDE` | 1, 1 | `CNA_CONV_CON3.{CONV_X_STRIDE, CONV_Y_STRIDE} = 1` |
|
||||||
|
| `PAD_*` | 0, 0, 0, 0 | `CNA_PAD_CON0.{PAD_LEFT, PAD_TOP} = 0` |
|
||||||
|
| `DATAOUT_WIDTH` | 1 | `CORE_DATAOUT_SIZE_0.DATAOUT_WIDTH = 0` (encoded as W-1) |
|
||||||
|
| `DATAOUT_HEIGHT` | M | `CORE_DATAOUT_SIZE_0.DATAOUT_HEIGHT = M-1` |
|
||||||
|
| `DATAOUT_CHANNEL` | N | `CORE_DATAOUT_SIZE_1.DATAOUT_CHANNEL = N-1` |
|
||||||
|
|
||||||
|
This is exactly what the RKNPU2 vendor stack does — its public op
|
||||||
|
list has `Conv2D` and not `MatMul`. We're aligned with the silicon's
|
||||||
|
intended op surface, just from a clean uAPI.
|
||||||
|
|
||||||
|
### Quantization handoff
|
||||||
|
|
||||||
|
`rkt_regcmd.c` has a clean (non-add-tensor) requantization path:
|
||||||
|
|
||||||
|
```
|
||||||
|
conv_scale = (input_scale * weights_scale) / output_scale;
|
||||||
|
shift = 127 + 31 - 32 - (scale_bits >> 23) + 16;
|
||||||
|
scale = ((scale_bits >> 9) & 0x7fff) + 1;
|
||||||
|
```
|
||||||
|
|
||||||
|
Mirrors PyTorch QNNPACK requantization. For our **Q4_K_M weights** the
|
||||||
|
dequant-to-INT8 step is on CPU (NEON): walk the 32-element groups in
|
||||||
|
the block, scale per-group, write an INT8 weight tile of `K×N`. NPU
|
||||||
|
gets:
|
||||||
|
|
||||||
|
- INT8 activation tensor `X` with per-tensor `input_scale`,
|
||||||
|
`input_zero_point`
|
||||||
|
- INT8 weight tile `W` with per-tensor `weights_scale` (after we fold
|
||||||
|
per-group into per-channel — the Q4_K_M scaling is finer-grained
|
||||||
|
than what this conv requantization expects)
|
||||||
|
- One `output_scale`, `output_zero_point` per matmul
|
||||||
|
|
||||||
|
The Q4_K_M → conv-quant fold is the **non-trivial CPU work** that
|
||||||
|
sits between Q4_K_M GGUF and a regcmd: we need to choose between:
|
||||||
|
|
||||||
|
(a) per-output-channel scales (NPU-friendly, lossy vs Q4_K_M)
|
||||||
|
(b) per-block scales applied at runtime via multiple smaller matmul
|
||||||
|
submits with different `output_scale` each (CPU overhead per
|
||||||
|
submit but mathematically faithful to Q4_K_M)
|
||||||
|
|
||||||
|
Phase-3 question. (a) is likely good enough for "credible" tok/s; (b)
|
||||||
|
preserves Q4_K_M accuracy but tanks throughput.
|
||||||
|
|
||||||
|
### What rocket today does NOT exercise (matmul-relevant blocks)
|
||||||
|
|
||||||
|
The Mesa emit path uses **CNA + CORE + DPU**. It does **not** touch:
|
||||||
|
|
||||||
|
- **PPU** — Post-Processing Unit. Per `rocket_registers.h` interrupt
|
||||||
|
flags, PPU has 2 channels. Likely activation / pooling. For pure
|
||||||
|
matmul we don't need it.
|
||||||
|
- **CSC** — color-space conversion submodule of CNA. Vestigial for
|
||||||
|
ML; bypassed in non-image paths.
|
||||||
|
|
||||||
|
That's fine — we won't need to wake those for transformer matmul.
|
||||||
|
|
||||||
|
What we need to learn from Mesa's regcmd encoding (Phase-2 follow-up):
|
||||||
|
|
||||||
|
1. **Max tile shape per CNA invocation** — what (K × N) tile does the
|
||||||
|
hardware natively prefer? Pulls from `rocket_registers.h` CNA_*
|
||||||
|
geometry registers.
|
||||||
|
2. **INT8 quantization parameter layout** — per-channel scales live
|
||||||
|
where in the regcmd / weight blob?
|
||||||
|
3. **Output type** — INT32 accumulator → INT8 quant? FP16 output? llama.cpp's
|
||||||
|
ggml graph needs to know what comes back to NEON.
|
||||||
|
4. **CSC block usage** — color-space conversion is meaningless for LLM,
|
||||||
|
but the FEATURE pipeline may force going through it. If so, it's an
|
||||||
|
identity transform we configure once.
|
||||||
|
5. **Per-core split for transformer attention** — three cores × two
|
||||||
|
channels each = 6 parallel matmul lanes. Worth the orchestration
|
||||||
|
complexity only if one core can't saturate the path on its own.
|
||||||
|
|
||||||
|
## Op coverage gap inventory
|
||||||
|
|
||||||
|
Looking at what a transformer decode step needs vs. what the rocket
|
||||||
|
pipeline can plausibly do today (per-op confidence in parens):
|
||||||
|
|
||||||
|
| llama.cpp op | NPU target | Confidence |
|
||||||
|
|---|---|---|
|
||||||
|
| INT8 matmul (Q/K/V projections, FFN) | CNA conv-1×1 | **medium-high** — Mesa already does conv in INT8 |
|
||||||
|
| INT8 matmul with INT4 weights (Q4_K_M) | dequant on CPU, INT8 matmul on NPU | medium — Q4_K_M's per-group scales add a dequant pass; first cut keeps it on NEON |
|
||||||
|
| RoPE (rotary position embeddings) | CPU (NEON) | n/a — wrong op shape for CNA |
|
||||||
|
| Softmax (attention) | CPU (NEON) | n/a — needs exp/sum, no direct support |
|
||||||
|
| RMSNorm | CPU (NEON) | n/a — sqrt + element-wise mul |
|
||||||
|
| Embedding lookup | CPU | n/a — gather, no NPU support |
|
||||||
|
| Sampling (top-k, top-p, multinomial) | CPU | n/a |
|
||||||
|
| Residual add + activation | DPU/PPU? unknown | low — needs Mesa-side check |
|
||||||
|
|
||||||
|
The Phase-4 baseline run will tell us how much wall-time matmul
|
||||||
|
actually owns; if it's >70% on CPU baseline, then offloading just
|
||||||
|
matmul (the medium-high row) gets us most of the speedup.
|
||||||
|
|
||||||
|
## Open questions — kicked to Phase-3
|
||||||
|
|
||||||
|
1. **regcmd format spec.** Pull `registers.xml` from Mesa and trace one
|
||||||
|
conv invocation end-to-end in `src/gallium/drivers/rocket/`.
|
||||||
|
2. **Memory budget per task.** 2 MB SRAM per core × 3 cores = 6 MB
|
||||||
|
working set. qwen-1.5B Q4_K_M has ~750 MB of weights (Q4 → ~375 MB
|
||||||
|
actual). Need to chunk by FFN-hidden / attention-head and stream.
|
||||||
|
3. **DMA address space layout.** Per-fd IOMMU domain (per
|
||||||
|
`rocket_drv.c::rocket_open`), so we can keep weights resident across
|
||||||
|
submits within one llama.cpp process. Cross-process sharing would
|
||||||
|
need dmabuf import.
|
||||||
|
4. **Power-domain bring-up time.** `pm_runtime_get_sync(core->dev)` per
|
||||||
|
job (see `rocket_job_run`) — if the autosuspend timeout fires
|
||||||
|
between decode steps, we pay the wake cost each token. Worth
|
||||||
|
measuring before fighting it.
|
||||||
|
5. **Job-timeout = 500 ms hard-coded** in `rocket_job.c::JOB_TIMEOUT_MS`.
|
||||||
|
Long matmul tiles must stay under that. Cap tile size accordingly
|
||||||
|
or split into smaller tasks.
|
||||||
|
|
||||||
|
## Recommendation for Phase-3 entry
|
||||||
|
|
||||||
|
When Phase-2 closes, Phase-3 ("Analyze" — read RKNPU2 SDK + Tomeu's
|
||||||
|
uAPI + BSP rockchip-npu for the spec) becomes mostly:
|
||||||
|
|
||||||
|
- Read `src/gallium/drivers/rocket/` end-to-end in Mesa main.
|
||||||
|
- Pick one conv invocation from a known model (Teflon MobileNetV1
|
||||||
|
first layer is the simplest), trace it down to the regcmd buffer
|
||||||
|
bytes, and document the field layout for matmul reconstruction.
|
||||||
|
- Read the BSP `drivers/rknpu/` for any quant-scale plumbing the Mesa
|
||||||
|
side doesn't expose (especially for INT4 → INT8 dequant alignment).
|
||||||
|
- The vendor-blob `librknnrt.so` is **off-limits** (closed license);
|
||||||
|
do not link, do not extract symbols.
|
||||||
+111
@@ -0,0 +1,111 @@
|
|||||||
|
# phases — Rosenblatt per-phase log
|
||||||
|
|
||||||
|
One entry per phase as it closes. Stores the *findings* (what we
|
||||||
|
learned that future-us shouldn't have to rediscover) and the *next
|
||||||
|
gate* (what Phase N+1 needs from us). Lives alongside the
|
||||||
|
campaign-level README; this file is the durable journal.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 0 — bootstrap
|
||||||
|
|
||||||
|
**Closed:** 2026-05-19
|
||||||
|
**Deliverable:** repo scaffold (README, TODO, docs/, kernel/, userspace/,
|
||||||
|
fleet/, benchmarks/), one initial commit `Rosenblatt: project scaffold
|
||||||
|
for RK3588 NPU on mainline`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 1 — substrate audit
|
||||||
|
|
||||||
|
**Closed:** 2026-05-19
|
||||||
|
**Deliverable:** `docs/npu-mainline-status.md` table fully populated;
|
||||||
|
`fleet/boltzmann.yaml` kernel/NPU-driver block filled with live data;
|
||||||
|
clear answer to the "accel uAPI vs. own MMIO driver" question.
|
||||||
|
|
||||||
|
### Findings
|
||||||
|
|
||||||
|
1. **`rocket` driver is in mainline** — Tomeu Vizoso's NPU work merged
|
||||||
|
to torvalds in Linux 6.18 (Nov 2025) as `drivers/accel/rocket/`,
|
||||||
|
Kconfig `DRM_ACCEL_ROCKET`. Driver author + history kept the same
|
||||||
|
shape, but the **upstream name is `rocket`, not `rknpu`** —
|
||||||
|
searching for `rknpu` in mainline misses everything.
|
||||||
|
2. **Boltzmann already ships the driver** — `linux-rk3588-marfrit-A1`
|
||||||
|
(7.0.0-rc3-ARCH+) is post-6.18 and contains `rocket.ko` at
|
||||||
|
`/lib/modules/.../drivers/accel/rocket/rocket.ko`, marked
|
||||||
|
`intree: Y`. Module aliases to `rockchip,rk3588-rknn-core`,
|
||||||
|
matching the DT compatibles on the box.
|
||||||
|
3. **DT nodes present but disabled** — `rk3588-base.dtsi` defines
|
||||||
|
`rknn_core_0/1/2` at `0xfdab/c/d_0000` with compat
|
||||||
|
`rockchip,rk3588-rknn-core`; all three boot with `status =
|
||||||
|
"disabled"`. No board file enables them. Per-core IOMMUs
|
||||||
|
`rknn_mmu_0/1/2` at `0xfdab9/aca/ada_000` also disabled.
|
||||||
|
4. **Userspace is Mesa Rocket Gallium + Teflon** — shipped in Mesa
|
||||||
|
25.3. `src/gallium/drivers/rocket/` in mesa3d main is the
|
||||||
|
authoritative reference for regcmd construction. `rkt_regcmd.c`
|
||||||
|
is ~700 lines, single-conv emit path. No matmul-specific code.
|
||||||
|
5. **Kernel is a thin shim** — `drivers/accel/rocket/rocket_drv.c`
|
||||||
|
exposes a single facade `/dev/accel/accel0` for all probed RKNN
|
||||||
|
cores. uAPI is 4 ioctls (CREATE_BO, SUBMIT, PREP_BO, FINI_BO).
|
||||||
|
`rocket_job_hw_submit()` powers the NPU, points
|
||||||
|
`PC_BASE_ADDRESS` at the regcmd, sets `PC_REGISTER_AMOUNTS`,
|
||||||
|
pulls `PC_OPERATION_ENABLE.OP_EN = 1`. Everything else is the
|
||||||
|
userspace-built regcmd buffer.
|
||||||
|
6. **NPU sub-blocks** identified from `rocket_registers.h` interrupt
|
||||||
|
masks: **PC** (program controller), **CNA** (conv neural-net
|
||||||
|
accel; FEATURE / WEIGHT / CSC channels), **CORE** (MAC array),
|
||||||
|
**DPU** (data processing unit, requant), **PPU**
|
||||||
|
(post-processing — not exercised by Mesa today). Each per-core
|
||||||
|
block has 2 parallel channels for double-buffering.
|
||||||
|
7. **Matmul-as-conv-1×1 is the only viable path** — confirmed by
|
||||||
|
reading Mesa's emit path. INT8 matmul `Y[M,N] = X[M,K] @ W[K,N]`
|
||||||
|
maps cleanly to a conv with width=height=1 kernel,
|
||||||
|
`DATAIN_CHANNEL=K`, `WEIGHT_KERNELS=N`. The vendor RKNPU2 stack
|
||||||
|
does the same shoehorn.
|
||||||
|
8. **IOMMU v1.0 hazard surfaced from `linux-rockchip` thread**
|
||||||
|
(Midgy BALON / Simon Xue, 2026-04-03). The NPU IOMMU is v1.0 IP
|
||||||
|
bound to generic `rockchip,rk3568-iommu` — driven via the v2.0
|
||||||
|
code path. v1.0 can't allocate its DTE above 4 GB. Boltzmann
|
||||||
|
has 32 GB. Naive enable will silently fault. Discriminator-compat
|
||||||
|
patch series planned but **not landed in mainline master as of
|
||||||
|
2026-05-19** (verified via cgit on
|
||||||
|
`drivers/iommu/rockchip-iommu.c`).
|
||||||
|
9. **Vendor stack is off-limits** — `librknnrt.so` is a closed
|
||||||
|
binary blob under restrictive Rockchip license. BSP
|
||||||
|
`rockchip-linux/kernel` `drivers/rknpu/` source is permitted as
|
||||||
|
a spec-extraction reference only.
|
||||||
|
|
||||||
|
### Phase exit decision
|
||||||
|
|
||||||
|
**Drive via the `rocket` DRM-accel uAPI.** Writing our own MMIO
|
||||||
|
driver would mean re-implementing IOMMU integration, power-domain
|
||||||
|
sequencing, and fence/sched plumbing that's already in-tree and
|
||||||
|
production-validated by Mesa Teflon consumers. The Phase-2 unblock
|
||||||
|
list is short: DT enable + IOMMU mitigation + `modprobe rocket`.
|
||||||
|
|
||||||
|
### Phase outflow → TODO
|
||||||
|
|
||||||
|
Captured in `TODO.md` "Phase-2 unblock" section. Highlights:
|
||||||
|
|
||||||
|
- Apply `kernel/dt-overlays/rk3588-rosenblatt-npu-enable.dtso` (or
|
||||||
|
equivalent board-DTS patch) to boltzmann.
|
||||||
|
- Mitigate IOMMU v1.0 hazard before first NPU job: `mem=4G` boot or
|
||||||
|
local discriminator-compat carry.
|
||||||
|
- `modprobe rocket`, confirm `/dev/accel/accel0`, no IOMMU faults.
|
||||||
|
- Read `rkt_regcmd.c`, `rkt_ml.c`, `rkt_task.c`, `rkt_coefs.c` from
|
||||||
|
Mesa for the conv-1×1 matmul encoding details (op-coverage.md
|
||||||
|
has the first cut).
|
||||||
|
|
||||||
|
### Memory persisted
|
||||||
|
|
||||||
|
- `project_rosenblatt_overview.md`
|
||||||
|
- `project_rocket_upstream_state.md` (note: name is `rocket`, not
|
||||||
|
`rknpu`)
|
||||||
|
- `project_iommu_v1_hazard.md`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 2 — formulate (open)
|
||||||
|
|
||||||
|
**Status:** open as of 2026-05-19. See `TODO.md` and
|
||||||
|
`docs/op-coverage.md` for current state of the formulation.
|
||||||
+25
-4
@@ -21,11 +21,32 @@ hardware:
|
|||||||
local_sram_per_core_mib: 2
|
local_sram_per_core_mib: 2
|
||||||
power_domain: pd_npu
|
power_domain: pd_npu
|
||||||
|
|
||||||
# Phase-1 audit fills these (pending boltzmann inspection)
|
# Phase-1 audit (captured 2026-05-19)
|
||||||
kernel:
|
kernel:
|
||||||
running_version: TBD # uname -r snapshot at audit time
|
running_version: 7.0.0-rc3-ARCH+ # `uname -r`; built Apr 29 2026, custom marfrit-A1
|
||||||
source: TBD # mainline torvalds / mmind-rockchip / custom
|
image: linux-rk3588-marfrit-A1 # from BOOT_IMAGE in /proc/cmdline
|
||||||
npu_driver: TBD # vendor rockchip-npu / mainline rknpu / none
|
source: marfrit/neutron (mainline-rc3 base, RK3588 patches)
|
||||||
|
npu_driver:
|
||||||
|
name: rocket # drivers/accel/rocket, author Tomeu Vizoso
|
||||||
|
state: built-as-module-not-loaded
|
||||||
|
module_path: /lib/modules/7.0.0-rc3-ARCH+/kernel/drivers/accel/rocket/rocket.ko
|
||||||
|
config:
|
||||||
|
CONFIG_DRM_ACCEL: 'y'
|
||||||
|
CONFIG_DRM_ACCEL_ROCKET: m
|
||||||
|
binds_compatible: rockchip,rk3588-rknn-core
|
||||||
|
depends: [gpu-sched, drm_shmem_helper]
|
||||||
|
dt_npu_nodes:
|
||||||
|
- addr: 0xfdab0000
|
||||||
|
compatible: rockchip,rk3588-rknn-core
|
||||||
|
status: disabled
|
||||||
|
- addr: 0xfdac0000
|
||||||
|
compatible: rockchip,rk3588-rknn-core
|
||||||
|
status: disabled
|
||||||
|
- addr: 0xfdad0000
|
||||||
|
compatible: rockchip,rk3588-rknn-core
|
||||||
|
status: disabled
|
||||||
|
dev_accel: /dev/accel # not present — major 261 registered in /proc/devices but no device bound
|
||||||
|
dmesg_npu_lines: 0 # zero npu/accel/rknpu lines since boot
|
||||||
|
|
||||||
userspace:
|
userspace:
|
||||||
rknn_vendor_runtime_installed: false # commitment: stay mainline-clean
|
rknn_vendor_runtime_installed: false # commitment: stay mainline-clean
|
||||||
|
|||||||
+25
-17
@@ -1,24 +1,32 @@
|
|||||||
# kernel/
|
# kernel/
|
||||||
|
|
||||||
Mainline-bound kernel patches: DT bindings, rknpu driver tweaks,
|
Phase-1 audit closed the "do we need a driver?" question: Tomeu
|
||||||
power-domain wiring for boltzmann's `rk3588-rock-5-itx-plus.dts`.
|
Vizoso's `rocket` driver is in mainline since Linux 6.18, and
|
||||||
|
boltzmann's `linux-rk3588-marfrit-A1` build (7.0.0-rc3-ARCH+)
|
||||||
|
already ships it as a module. See `docs/npu-mainline-status.md` for
|
||||||
|
the full audit.
|
||||||
|
|
||||||
Empty until Phase 1 audit identifies what's actually missing in mainline.
|
Current contents:
|
||||||
|
|
||||||
If Tomeu Vizoso's rknpu series is far enough along to use as-is, this
|
|
||||||
directory may stay nearly empty — we'd just carry a small DT-overlay
|
|
||||||
patch for boltzmann's board file.
|
|
||||||
|
|
||||||
If we end up writing a driver from scratch (worst case), the structure
|
|
||||||
will mirror an upstream submission layout:
|
|
||||||
```
|
```
|
||||||
0001-dt-bindings-npu-add-rockchip-rk3588-npu.patch
|
kernel/
|
||||||
0002-arm64-dts-rockchip-rk3588-add-npu-node.patch
|
└── dt-overlays/
|
||||||
0003-arm64-dts-rockchip-rock-5-itx-plus-enable-npu.patch
|
└── rk3588-rosenblatt-npu-enable.dtso
|
||||||
0004-accel-rknpu-add-rockchip-rk3588-driver.patch
|
# Flips rknn_core_0/1/2 + rknn_mmu_0/1/2 to status="okay".
|
||||||
...
|
# Do NOT apply on >4 GB-RAM hosts without IOMMU v1.0
|
||||||
|
# mitigation (mem=4G or local discriminator-compat patches).
|
||||||
```
|
```
|
||||||
|
|
||||||
Tracking-wise these go through `marfrit/kernel-agent` as scope
|
Anticipated next additions:
|
||||||
`patches/driver/accel/rknpu/`, with `fleet/boltzmann-rosenblatt.yaml`
|
|
||||||
as the consuming manifest.
|
- `iommu/0001-iommu-rockchip-add-rk3568-iommu-v1-compatible.patch`
|
||||||
|
— local carry of Midgy BALON's discriminator-compat patch when it
|
||||||
|
appears upstream, plus the matching DT update for `rknn_mmu_*`.
|
||||||
|
Drop when the upstream series lands.
|
||||||
|
- Patches against marfrit's `linux-rk3588-marfrit-*` branch only if
|
||||||
|
we need fixes beyond what mainline rocket provides. Kept minimal
|
||||||
|
per the "mainline-clean from day 1" rule.
|
||||||
|
|
||||||
|
Patch handling routes through the `kernel-agent` subagent (per
|
||||||
|
`feedback_agent_routing.md`); board / host wiring lives in
|
||||||
|
`fleet/boltzmann.yaml`.
|
||||||
|
|||||||
@@ -0,0 +1,54 @@
|
|||||||
|
// SPDX-License-Identifier: (GPL-2.0+ OR MIT)
|
||||||
|
/*
|
||||||
|
* Rosenblatt: enable all three RK3588 RKNN cores + their IOMMUs.
|
||||||
|
*
|
||||||
|
* Apply on top of any rk3588 board DT that uses the mainline
|
||||||
|
* rk3588-base.dtsi labels. Verified against boltzmann
|
||||||
|
* (model "Radxa ROCK 5 ITX", compatible "radxa,rock-5-itx").
|
||||||
|
*
|
||||||
|
* Pre-conditions before applying on a system with >4 GB RAM
|
||||||
|
* (boltzmann has 32 GB):
|
||||||
|
* - The IOMMU v1.0 hazard MUST be mitigated first.
|
||||||
|
* See docs/npu-mainline-status.md "IOMMU v1.0 hazard".
|
||||||
|
* Either boot with `mem=4G`, OR carry the discriminator-compat
|
||||||
|
* patch series (Simon Xue per-device-ops +
|
||||||
|
* Midgy BALON `rockchip,rk3568-iommu-v1`).
|
||||||
|
*
|
||||||
|
* After applying:
|
||||||
|
* - `modprobe rocket`
|
||||||
|
* - expect `/dev/accel/accel0` (single facade, schedules across cores)
|
||||||
|
* - dmesg: rocket-driven probe per `npu@fdab0000`, `..fdac0000`,
|
||||||
|
* `..fdad0000`
|
||||||
|
*
|
||||||
|
* Keep the assigned-clock-rate at the rk3588-base.dtsi default of
|
||||||
|
* 200 MHz for first-bringup. Lift to 1 GHz (`SCMI_CLK_NPU` max) only
|
||||||
|
* after the rocket bringup, IOMMU, and a baseline matmul are
|
||||||
|
* verified stable.
|
||||||
|
*/
|
||||||
|
|
||||||
|
/dts-v1/;
|
||||||
|
/plugin/;
|
||||||
|
|
||||||
|
&rknn_core_0 {
|
||||||
|
status = "okay";
|
||||||
|
};
|
||||||
|
|
||||||
|
&rknn_mmu_0 {
|
||||||
|
status = "okay";
|
||||||
|
};
|
||||||
|
|
||||||
|
&rknn_core_1 {
|
||||||
|
status = "okay";
|
||||||
|
};
|
||||||
|
|
||||||
|
&rknn_mmu_1 {
|
||||||
|
status = "okay";
|
||||||
|
};
|
||||||
|
|
||||||
|
&rknn_core_2 {
|
||||||
|
status = "okay";
|
||||||
|
};
|
||||||
|
|
||||||
|
&rknn_mmu_2 {
|
||||||
|
status = "okay";
|
||||||
|
};
|
||||||
Reference in New Issue
Block a user