Rosenblatt Phase-1 closeout: rocket-driver substrate inventory

Phase-1 audit closes with a substantively different picture than the original scaffold's TBDs: - Tomeu Vizoso's RK3588 NPU work merged in Linux 6.18 (Nov 2025) under codename `rocket` (NOT `rknpu`). All references updated. - Boltzmann's `linux-rk3588-marfrit-A1` (7.0.0-rc3-ARCH+) already ships `drivers/accel/rocket/rocket.ko` as a built-but-not-loaded module. - DT bindings + per-core nodes (`npu@fdab/c/d_0000`, compatible `rockchip,rk3588-rknn-core`) in mainline since 6.18 but ship `status = "disabled"` — board enable is the Phase-2 unblock, not a driver port. - Mesa 25.3 ships Rocket Gallium + Teflon TFLite delegate as the authoritative userspace reference for the uAPI shape. - Op coverage today is conv-centric (MobileNet-class); transformer matmul needs the conv-1×1 shoehorn (RKNPU2 BSP precedent) or rocket op-set additions. Surfaced as Phase-2-load-bearing risk. - IOMMU v1.0 hazard: 32 GB host needs `mem=4G` or local `rockchip,rk3568-iommu-v1` discriminator patches before the first NPU job, to avoid DMA-window faults. Files: - docs/npu-mainline-status.md: full audit table with upstream pointers (kernel.org / Mesa docs / dri-devel patch URLs / Tomeu's "we are in mainline" blog post). - docs/phases.md: per-phase log entry for Phase-1 closeout. - docs/op-coverage.md: matmul-vs-conv-vs-rocket-op-set framing. - fleet/boltzmann.yaml: audited kernel + npu_driver + dt_npu_nodes state. - kernel/dt-overlays/rk3588-rosenblatt-npu-enable.dtso: overlay to flip the three rknn-core nodes to "okay" (+ matching mmu nodes), carries the IOMMU-mitigation warning inline. - kernel/README.md: kernel-agent scope wiring + anticipated local carry patches. - README.md: phase-status table + "rknpu → rocket" rename note. - TODO.md: Phase-2 unblock concrete steps + standing upstream-watch items.
2026-05-19 12:41:31 +00:00
parent 24adc74812
commit c9a3f5c600
8 changed files with 731 additions and 61 deletions
@@ -0,0 +1,257 @@
+# Op coverage — what the rocket NPU pipeline actually does
+
+> **Phase-2 deliverable.** Captures the rocket uAPI surface, the NPU's
+> sub-block topology, and the implication for how llama.cpp's matmul
+> can be expressed on this silicon. Read alongside
+> `npu-mainline-status.md`.
+
+**Reference sources** (all in-tree as of mainline 2026-05-19):
+
+- `include/uapi/drm/rocket_accel.h` — userspace ABI
+- `drivers/accel/rocket/rocket_drv.c` — module init, of-match, ioctl table
+- `drivers/accel/rocket/rocket_job.c` — `rocket_job_hw_submit()` is the
+  authoritative description of the per-task hardware sequence
+- `drivers/accel/rocket/rocket_registers.h` — autogenerated from Mesa's
+  `src/gallium/drivers/rocket/registers.xml` (60 KB; **the regcmd
+  format is owned by Mesa, not the kernel**)
+- `Documentation/accel/rocket/index.rst` — one-paragraph driver summary
+
+---
+
+## uAPI: four ioctls, one device per fd
+
+| Ioctl | Direction | Purpose |
+|---|---|---|
+| `DRM_ROCKET_CREATE_BO` | IOWR | Allocate a GEM buffer, get NPU DMA address + mmap offset |
+| `DRM_ROCKET_PREP_BO`   | IOW  | Take CPU ownership; wait for NPU fences; cache-invalidate |
+| `DRM_ROCKET_FINI_BO`   | IOW  | Release CPU ownership; cache-flush for NPU |
+| `DRM_ROCKET_SUBMIT`    | IOW  | Submit an array of jobs (each = array of tasks) |
+
+Submission unit hierarchy:
+
+```
+drm_rocket_submit
+└── drm_rocket_job[]                       (each tied to specific in/out BOs)
+    └── drm_rocket_task[]                  (executes sequentially on ONE core)
+        ├── regcmd        DMA address of register-command buffer
+        └── regcmd_count  number of 32-bit reg/value entries
+```
+
+Key kernel-side invariant from `rocket_job_hw_submit()`:
+
+> "All tasks in the same job will be executed sequentially on the same
+> core, to benefit from memory residency in SRAM."
+
+Per-core SRAM is 2 MB. Bundling matmul tiles into one job buys weight
+reuse across tasks for free — important Phase-2 perf knob.
+
+The driver exposes a **single `/dev/accel/accel0` facade** that
+schedules across all probed RKNN cores. Three DT nodes →
+`rdev->num_cores = 3`, but userspace sees one device.
+
+## What "regcmd" actually is
+
+The kernel's `rocket_job_hw_submit()` does **not** know what op runs.
+For each task it:
+
+1. Writes the task's `regcmd` DMA address to `PC_BASE_ADDRESS`.
+2. Writes `PC_REGISTER_AMOUNTS = (regcmd_count + 1) / 2 - 1`.
+3. Configures S_POINTER ping-pong (CNA + Core, with a per-core "extra
+   bit" `0x10000000 * core_index` lifted from BSP rknpu — undocumented
+   in TRM, kernel comments call it out).
+4. Enables DPU_0/DPU_1 interrupts.
+5. Pulls the trigger: `PC_OPERATION_ENABLE.OP_EN = 1`.
+
+The NPU's Program Controller then plays back the regcmd buffer —
+**a sequence of {register address, register value} pairs the userspace
+emitted**. The kernel is fundamentally a power/IOMMU/scheduler shim.
+**All op coverage lives in userspace regcmd construction.** That's
+why "what ops does rocket support?" is a Mesa question, not a kernel
+question.
+
+## NPU sub-blocks (from PC_INTERRUPT_MASK)
+
+The interrupt-mask register names the discrete blocks the PC drives:
+
+| Block | Channels | Likely role |
+|---|---|---|
+| **CNA** — Convolution Neural-net Accelerator | FEATURE_0/1, WEIGHT_0/1, CSC_0/1 | Loads features + weights, color-space convert (legacy ISP path) |
+| **CORE** | 0/1 | MAC arrays — INT8 multiply-accumulate datapath |
+| **DPU** — Data Processing Unit | 0/1 | Per-channel post-process (likely accumulation / type convert) |
+| **PPU** — Post-Processing Unit | 0/1 | Activation / pooling / requantization |
+| **PC** — Program Controller | — | Reads regcmd, sequences the pipeline |
+| **DMA** | read/write error flags | Pulls features+weights, writes outputs through IOMMU |
+
+Each per-core block has two parallel channels — same hardware as the
+RKNPU2 vendor stack relies on for double-buffering input feature maps
+and weights to keep the MAC arrays fed.
+
+## Matmul on conv-shaped silicon — confirmed conv-only path
+
+Skimming Mesa's `src/gallium/drivers/rocket/rkt_regcmd.c` (only ~700
+lines, single conv emit path) **confirms**: the rocket hardware has no
+matmul-shaped instruction. Every register in the emit path is named
+`CNA_CONV_*`, `CNA_DATA_*`, `CNA_WEIGHT_*` — and the function that
+builds a task's regcmd buffer is literally called `fill_first_regcmd`
+with no alternate path for matmul. **Any matmul we run on this silicon
+goes through the conv pipeline.**
+
+### Regcmd encoding (from `rkt_regcmd.c::emit_raw`)
+
+The userspace regcmd buffer is an array of **64-bit packed entries**:
+
+```
+packed = (target << 48) | (value << 16) | reg
+```
+
+- `target` — block destination ID (CNA, CORE, DPU, PC, …) + a `+0x1`
+  arming bit. `rkt_get_target(reg)` returns the block per register.
+- `reg` — 16-bit register offset within the block.
+- `value` — 32-bit register write value.
+
+Each task's regcmd ends with the TRM-mandated trigger sequence:
+
+```
+0x0041000000000000   /* TRM: must precede op_en */
+emit_raw(regs, 0x81, REG_PC_OPERATION_ENABLE,
+         PC_OPERATION_ENABLE_RESERVED_0(14) | PC_OPERATION_ENABLE_OP_EN(1));
+```
+
+That trailing PC.OPERATION_ENABLE write is what makes the NPU pipeline
+start consuming the regcmd above it.
+
+### Matmul-as-conv-1×1 — concrete mapping
+
+For an INT8 matmul `Y[M, N] = X[M, K] @ W[K, N]` (the shape attention
+and FFN projections take):
+
+| Conv axis | Matmul axis | Concrete |
+|---|---|---|
+| `DATAIN_WIDTH`         | 1                  | `CNA_DATA_SIZE0.DATAIN_WIDTH = 1` |
+| `DATAIN_HEIGHT`        | M (rows)           | `CNA_DATA_SIZE0.DATAIN_HEIGHT = M` |
+| `DATAIN_CHANNEL`       | K (input dim)      | `CNA_DATA_SIZE1.DATAIN_CHANNEL = K`, `DATAIN_CHANNEL_REAL = K-1` |
+| `WEIGHT_WIDTH/HEIGHT`  | 1, 1               | `CNA_WEIGHT_SIZE2.{WIDTH, HEIGHT} = 1` |
+| `WEIGHT_KERNELS`       | N (output dim)     | `CNA_WEIGHT_SIZE2.WEIGHT_KERNELS = N` |
+| `CONV_X/Y_STRIDE`      | 1, 1               | `CNA_CONV_CON3.{CONV_X_STRIDE, CONV_Y_STRIDE} = 1` |
+| `PAD_*`                | 0, 0, 0, 0         | `CNA_PAD_CON0.{PAD_LEFT, PAD_TOP} = 0` |
+| `DATAOUT_WIDTH`        | 1                  | `CORE_DATAOUT_SIZE_0.DATAOUT_WIDTH = 0` (encoded as W-1) |
+| `DATAOUT_HEIGHT`       | M                  | `CORE_DATAOUT_SIZE_0.DATAOUT_HEIGHT = M-1` |
+| `DATAOUT_CHANNEL`      | N                  | `CORE_DATAOUT_SIZE_1.DATAOUT_CHANNEL = N-1` |
+
+This is exactly what the RKNPU2 vendor stack does — its public op
+list has `Conv2D` and not `MatMul`. We're aligned with the silicon's
+intended op surface, just from a clean uAPI.
+
+### Quantization handoff
+
+`rkt_regcmd.c` has a clean (non-add-tensor) requantization path:
+
+```
+conv_scale = (input_scale * weights_scale) / output_scale;
+shift = 127 + 31 - 32 - (scale_bits >> 23) + 16;
+scale = ((scale_bits >> 9) & 0x7fff) + 1;
+```
+
+Mirrors PyTorch QNNPACK requantization. For our **Q4_K_M weights** the
+dequant-to-INT8 step is on CPU (NEON): walk the 32-element groups in
+the block, scale per-group, write an INT8 weight tile of `K×N`. NPU
+gets:
+
+- INT8 activation tensor `X` with per-tensor `input_scale`,
+  `input_zero_point`
+- INT8 weight tile `W` with per-tensor `weights_scale` (after we fold
+  per-group into per-channel — the Q4_K_M scaling is finer-grained
+  than what this conv requantization expects)
+- One `output_scale`, `output_zero_point` per matmul
+
+The Q4_K_M → conv-quant fold is the **non-trivial CPU work** that
+sits between Q4_K_M GGUF and a regcmd: we need to choose between:
+
+(a) per-output-channel scales (NPU-friendly, lossy vs Q4_K_M)
+(b) per-block scales applied at runtime via multiple smaller matmul
+    submits with different `output_scale` each (CPU overhead per
+    submit but mathematically faithful to Q4_K_M)
+
+Phase-3 question. (a) is likely good enough for "credible" tok/s; (b)
+preserves Q4_K_M accuracy but tanks throughput.
+
+### What rocket today does NOT exercise (matmul-relevant blocks)
+
+The Mesa emit path uses **CNA + CORE + DPU**. It does **not** touch:
+
+- **PPU** — Post-Processing Unit. Per `rocket_registers.h` interrupt
+  flags, PPU has 2 channels. Likely activation / pooling. For pure
+  matmul we don't need it.
+- **CSC** — color-space conversion submodule of CNA. Vestigial for
+  ML; bypassed in non-image paths.
+
+That's fine — we won't need to wake those for transformer matmul.
+
+What we need to learn from Mesa's regcmd encoding (Phase-2 follow-up):
+
+1. **Max tile shape per CNA invocation** — what (K × N) tile does the
+   hardware natively prefer? Pulls from `rocket_registers.h` CNA_*
+   geometry registers.
+2. **INT8 quantization parameter layout** — per-channel scales live
+   where in the regcmd / weight blob?
+3. **Output type** — INT32 accumulator → INT8 quant? FP16 output? llama.cpp's
+   ggml graph needs to know what comes back to NEON.
+4. **CSC block usage** — color-space conversion is meaningless for LLM,
+   but the FEATURE pipeline may force going through it. If so, it's an
+   identity transform we configure once.
+5. **Per-core split for transformer attention** — three cores × two
+   channels each = 6 parallel matmul lanes. Worth the orchestration
+   complexity only if one core can't saturate the path on its own.
+
+## Op coverage gap inventory
+
+Looking at what a transformer decode step needs vs. what the rocket
+pipeline can plausibly do today (per-op confidence in parens):
+
+| llama.cpp op | NPU target | Confidence |
+|---|---|---|
+| INT8 matmul (Q/K/V projections, FFN) | CNA conv-1×1 | **medium-high** — Mesa already does conv in INT8 |
+| INT8 matmul with INT4 weights (Q4_K_M) | dequant on CPU, INT8 matmul on NPU | medium — Q4_K_M's per-group scales add a dequant pass; first cut keeps it on NEON |
+| RoPE (rotary position embeddings) | CPU (NEON) | n/a — wrong op shape for CNA |
+| Softmax (attention) | CPU (NEON) | n/a — needs exp/sum, no direct support |
+| RMSNorm | CPU (NEON) | n/a — sqrt + element-wise mul |
+| Embedding lookup | CPU | n/a — gather, no NPU support |
+| Sampling (top-k, top-p, multinomial) | CPU | n/a |
+| Residual add + activation | DPU/PPU? unknown | low — needs Mesa-side check |
+
+The Phase-4 baseline run will tell us how much wall-time matmul
+actually owns; if it's >70% on CPU baseline, then offloading just
+matmul (the medium-high row) gets us most of the speedup.
+
+## Open questions — kicked to Phase-3
+
+1. **regcmd format spec.** Pull `registers.xml` from Mesa and trace one
+   conv invocation end-to-end in `src/gallium/drivers/rocket/`.
+2. **Memory budget per task.** 2 MB SRAM per core × 3 cores = 6 MB
+   working set. qwen-1.5B Q4_K_M has ~750 MB of weights (Q4 → ~375 MB
+   actual). Need to chunk by FFN-hidden / attention-head and stream.
+3. **DMA address space layout.** Per-fd IOMMU domain (per
+   `rocket_drv.c::rocket_open`), so we can keep weights resident across
+   submits within one llama.cpp process. Cross-process sharing would
+   need dmabuf import.
+4. **Power-domain bring-up time.** `pm_runtime_get_sync(core->dev)` per
+   job (see `rocket_job_run`) — if the autosuspend timeout fires
+   between decode steps, we pay the wake cost each token. Worth
+   measuring before fighting it.
+5. **Job-timeout = 500 ms hard-coded** in `rocket_job.c::JOB_TIMEOUT_MS`.
+   Long matmul tiles must stay under that. Cap tile size accordingly
+   or split into smaller tasks.
+
+## Recommendation for Phase-3 entry
+
+When Phase-2 closes, Phase-3 ("Analyze" — read RKNPU2 SDK + Tomeu's
+uAPI + BSP rockchip-npu for the spec) becomes mostly:
+
+- Read `src/gallium/drivers/rocket/` end-to-end in Mesa main.
+- Pick one conv invocation from a known model (Teflon MobileNetV1
+  first layer is the simplest), trace it down to the regcmd buffer
+  bytes, and document the field layout for matmul reconstruction.
+- Read the BSP `drivers/rknpu/` for any quant-scale plumbing the Mesa
+  side doesn't expose (especially for INT4 → INT8 dequant alignment).
+- The vendor-blob `librknnrt.so` is **off-limits** (closed license);
+  do not link, do not extract symbols.