Files
marfrit c9a3f5c600 Rosenblatt Phase-1 closeout: rocket-driver substrate inventory
Phase-1 audit closes with a substantively different picture than the
original scaffold's TBDs:

- Tomeu Vizoso's RK3588 NPU work merged in Linux 6.18 (Nov 2025) under
  codename `rocket` (NOT `rknpu`).  All references updated.
- Boltzmann's `linux-rk3588-marfrit-A1` (7.0.0-rc3-ARCH+) already ships
  `drivers/accel/rocket/rocket.ko` as a built-but-not-loaded module.
- DT bindings + per-core nodes (`npu@fdab/c/d_0000`,
  compatible `rockchip,rk3588-rknn-core`) in mainline since 6.18 but
  ship `status = "disabled"` — board enable is the Phase-2 unblock,
  not a driver port.
- Mesa 25.3 ships Rocket Gallium + Teflon TFLite delegate as the
  authoritative userspace reference for the uAPI shape.
- Op coverage today is conv-centric (MobileNet-class); transformer
  matmul needs the conv-1×1 shoehorn (RKNPU2 BSP precedent) or rocket
  op-set additions.  Surfaced as Phase-2-load-bearing risk.
- IOMMU v1.0 hazard: 32 GB host needs `mem=4G` or local
  `rockchip,rk3568-iommu-v1` discriminator patches before the first
  NPU job, to avoid DMA-window faults.

Files:
- docs/npu-mainline-status.md: full audit table with upstream pointers
  (kernel.org / Mesa docs / dri-devel patch URLs / Tomeu's "we are in
  mainline" blog post).
- docs/phases.md: per-phase log entry for Phase-1 closeout.
- docs/op-coverage.md: matmul-vs-conv-vs-rocket-op-set framing.
- fleet/boltzmann.yaml: audited kernel + npu_driver + dt_npu_nodes
  state.
- kernel/dt-overlays/rk3588-rosenblatt-npu-enable.dtso: overlay to
  flip the three rknn-core nodes to "okay" (+ matching mmu nodes),
  carries the IOMMU-mitigation warning inline.
- kernel/README.md: kernel-agent scope wiring + anticipated local
  carry patches.
- README.md: phase-status table + "rknpu → rocket" rename note.
- TODO.md: Phase-2 unblock concrete steps + standing
  upstream-watch items.
2026-05-19 12:41:31 +00:00

5.4 KiB
Raw Permalink Blame History

TODO — Rosenblatt

Rolling punch-list. Older items at bottom (move done → DONE.md when noisy).


Phase 1 — substrate audit

  • On boltzmann: uname -r → record in fleet/boltzmann.yaml:kernel.running_version
  • find / -path '*accel*' -name '*.ko' 2>/dev/null — check if accel framework is built
  • ls /dev/accel/ /dev/dri/ — what's exposed?
  • lsmod | grep -iE 'rknpu|accel' — what's loaded?
  • dmesg | grep -iE 'rknpu|npu|accel' since boot — driver bringup log
  • Tomeu's rknpu series — find on lore.kernel.org/dri-devel, capture latest patch-set version + state (merged / in-review / dropped) → fill table in docs/npu-mainline-status.md
  • Check drivers/accel/ in current torvalds tree — list in-tree accelerators, confirm rknpu's mainline state
  • Check DT bindings: Documentation/devicetree/bindings/npu/rockchip,*.yaml
  • Inspect arch/arm64/boot/dts/rockchip/rk3588.dtsi for npu node
  • If a userspace shim exists (rkneural?), capture repo URL + try hello-world against the running kernel
  • Spec-extract from BSP vendor rockchip-npu source — register map, DMA descriptor format, irq handling. No code lift; spec only.

Phase exit criteria: docs/npu-mainline-status.md table fully populated; clear answer to "do we drive via accel uAPI or write our own MMIO driver."


Phase 2 — formulate

  • List llama.cpp ops by wallclock %, profiling qwen-1.5B Q4_K_M on CPU (use llama.cpp's built-in perf-timer or perf record)
  • Pick the exact INT8 matmul tile size the NPU prefers (read from BSP source)
  • Spec out the smallest backend interface: which ops we MUST handle, which the framework falls back to CPU
  • Write docs/op-coverage.md

Phase 3 — analyze

  • RKNPU2 SDK: trace through librknnrt.so user-API → kernel ioctl shapes (objdump + strings, no actual reverse-engineering of vendor blob — just the syscall surface)
  • Tomeu's accel uAPI: read driver source, understand:
    • submit-job ioctl shape
    • dmabuf import path
    • fence-wait mechanism
    • error reporting
  • BSP vendor rockchip-npu source: register layout, DMA descriptor struct, irq handling sequence

Phase 4 — baseline

  • Build vanilla llama.cpp on boltzmann (mainline branch)
  • Pull qwen2.5-1.5b-instruct Q4_K_M GGUF
  • llama-bench -m qwen2.5-1.5b -p 512 -n 128 × 3 runs
  • Capture JSON to benchmarks/$(date +%F)_boltzmann_qwen1.5b_cpu_baseline.json
  • Record into fleet/boltzmann.yaml:baseline_measurement

Phase-2 unblock — prerequisites (NEW, Phase-1 outflow)

Phase-1 audit (2026-05-19) reframed Phase-2 from "design rknpu backend interface" to a concrete bringup sequence. See docs/npu-mainline-status.md for full context.

  • Patch boltzmann board DTS / overlay to flip npu@fdab0000, npu@fdac0000, npu@fdad0000 from status = "disabled""okay". Rebuild DTB.
  • Mitigate IOMMU v1.0 hazard before first NPU job (32 GB host). Pick one: - (A) Boot with mem=4G for first-bringup validation, OR - (B) Carry local patches: Simon Xue per-device-ops (<https://lore.kernel.org/all/20260310105303.128859-1-xxm@rock-chips.com/>) + Midgy rockchip,rk3568-iommu-v1 discriminator compat + DT update for rknpu_mmu to the new compat.
  • modprobe rocket and confirm /dev/accel/accel0..2 appear, no probe errors in dmesg, IOMMU faults absent.
  • Read drivers/accel/rocket/rocket_job.c + Mesa Rocket Gallium to determine submit-job uAPI capabilities — specifically whether we can express a transformer matmul as a tile/op the NPU pipeline accepts, or whether we need additional op coverage upstream.
  • Decide matmul strategy (Phase-2 deliverable): conv-1×1 shoehorn / extend rocket op set / thinner submit shim.

Standing items — track upstream

  • Watch drivers/iommu/rockchip-iommu.c for the discriminator rockchip,rk3568-iommu-v1 compat to land; drop local patch (B) when it does.
  • Watch linux-rockchip for the next iteration of Midgy / Simon's thread (last visible activity 2026-04-03).
  • Watch drivers/accel/rocket/ for matmul / GEMM op additions.

Cross-phase / standing items (older)

  • Mirror Tomeu's branch — superseded: code is now in-tree. Keep git.kernel.org/.../torvalds/linux.git checkout pinned to the boltzmann kernel rev for in-tree reading.
  • Set up serial console on boltzmann for kernel-panic recovery (Quark umbrella; check current state) — becomes load-bearing once we start poking IOMMU code.
  • Add project_rosenblatt_overview.md + project_rocket_upstream_state.md to claude-memory — done 2026-05-19.
  • Decide repo home: marfrit/rosenblatt on git.reauktion.de (probably yes, after Phase-1 substrate is captured).
  • Resolve board-name discrepancy. README and fleet/boltzmann.yaml say boltzmann is a "Rock 5 ITX+" / rock-5-itx-plus; the running DT reports model = "Radxa ROCK 5 ITX", compatible = "radxa,rock-5-itx". Confirm physical board model (Radxa sells both SKUs) and either correct the README + manifest, or note that we boot the plain-ITX DT on ITX+ hardware (likely fine; ITX+ is mostly a connectivity-refresh, same SoC + same NPU silicon).