Files
marfrit c9a3f5c600 Rosenblatt Phase-1 closeout: rocket-driver substrate inventory
Phase-1 audit closes with a substantively different picture than the
original scaffold's TBDs:

- Tomeu Vizoso's RK3588 NPU work merged in Linux 6.18 (Nov 2025) under
  codename `rocket` (NOT `rknpu`).  All references updated.
- Boltzmann's `linux-rk3588-marfrit-A1` (7.0.0-rc3-ARCH+) already ships
  `drivers/accel/rocket/rocket.ko` as a built-but-not-loaded module.
- DT bindings + per-core nodes (`npu@fdab/c/d_0000`,
  compatible `rockchip,rk3588-rknn-core`) in mainline since 6.18 but
  ship `status = "disabled"` — board enable is the Phase-2 unblock,
  not a driver port.
- Mesa 25.3 ships Rocket Gallium + Teflon TFLite delegate as the
  authoritative userspace reference for the uAPI shape.
- Op coverage today is conv-centric (MobileNet-class); transformer
  matmul needs the conv-1×1 shoehorn (RKNPU2 BSP precedent) or rocket
  op-set additions.  Surfaced as Phase-2-load-bearing risk.
- IOMMU v1.0 hazard: 32 GB host needs `mem=4G` or local
  `rockchip,rk3568-iommu-v1` discriminator patches before the first
  NPU job, to avoid DMA-window faults.

Files:
- docs/npu-mainline-status.md: full audit table with upstream pointers
  (kernel.org / Mesa docs / dri-devel patch URLs / Tomeu's "we are in
  mainline" blog post).
- docs/phases.md: per-phase log entry for Phase-1 closeout.
- docs/op-coverage.md: matmul-vs-conv-vs-rocket-op-set framing.
- fleet/boltzmann.yaml: audited kernel + npu_driver + dt_npu_nodes
  state.
- kernel/dt-overlays/rk3588-rosenblatt-npu-enable.dtso: overlay to
  flip the three rknn-core nodes to "okay" (+ matching mmu nodes),
  carries the IOMMU-mitigation warning inline.
- kernel/README.md: kernel-agent scope wiring + anticipated local
  carry patches.
- README.md: phase-status table + "rknpu → rocket" rename note.
- TODO.md: Phase-2 unblock concrete steps + standing
  upstream-watch items.
2026-05-19 12:41:31 +00:00

258 lines
12 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Op coverage — what the rocket NPU pipeline actually does
> **Phase-2 deliverable.** Captures the rocket uAPI surface, the NPU's
> sub-block topology, and the implication for how llama.cpp's matmul
> can be expressed on this silicon. Read alongside
> `npu-mainline-status.md`.
**Reference sources** (all in-tree as of mainline 2026-05-19):
- `include/uapi/drm/rocket_accel.h` — userspace ABI
- `drivers/accel/rocket/rocket_drv.c` — module init, of-match, ioctl table
- `drivers/accel/rocket/rocket_job.c``rocket_job_hw_submit()` is the
authoritative description of the per-task hardware sequence
- `drivers/accel/rocket/rocket_registers.h` — autogenerated from Mesa's
`src/gallium/drivers/rocket/registers.xml` (60 KB; **the regcmd
format is owned by Mesa, not the kernel**)
- `Documentation/accel/rocket/index.rst` — one-paragraph driver summary
---
## uAPI: four ioctls, one device per fd
| Ioctl | Direction | Purpose |
|---|---|---|
| `DRM_ROCKET_CREATE_BO` | IOWR | Allocate a GEM buffer, get NPU DMA address + mmap offset |
| `DRM_ROCKET_PREP_BO` | IOW | Take CPU ownership; wait for NPU fences; cache-invalidate |
| `DRM_ROCKET_FINI_BO` | IOW | Release CPU ownership; cache-flush for NPU |
| `DRM_ROCKET_SUBMIT` | IOW | Submit an array of jobs (each = array of tasks) |
Submission unit hierarchy:
```
drm_rocket_submit
└── drm_rocket_job[] (each tied to specific in/out BOs)
└── drm_rocket_task[] (executes sequentially on ONE core)
├── regcmd DMA address of register-command buffer
└── regcmd_count number of 32-bit reg/value entries
```
Key kernel-side invariant from `rocket_job_hw_submit()`:
> "All tasks in the same job will be executed sequentially on the same
> core, to benefit from memory residency in SRAM."
Per-core SRAM is 2 MB. Bundling matmul tiles into one job buys weight
reuse across tasks for free — important Phase-2 perf knob.
The driver exposes a **single `/dev/accel/accel0` facade** that
schedules across all probed RKNN cores. Three DT nodes →
`rdev->num_cores = 3`, but userspace sees one device.
## What "regcmd" actually is
The kernel's `rocket_job_hw_submit()` does **not** know what op runs.
For each task it:
1. Writes the task's `regcmd` DMA address to `PC_BASE_ADDRESS`.
2. Writes `PC_REGISTER_AMOUNTS = (regcmd_count + 1) / 2 - 1`.
3. Configures S_POINTER ping-pong (CNA + Core, with a per-core "extra
bit" `0x10000000 * core_index` lifted from BSP rknpu — undocumented
in TRM, kernel comments call it out).
4. Enables DPU_0/DPU_1 interrupts.
5. Pulls the trigger: `PC_OPERATION_ENABLE.OP_EN = 1`.
The NPU's Program Controller then plays back the regcmd buffer —
**a sequence of {register address, register value} pairs the userspace
emitted**. The kernel is fundamentally a power/IOMMU/scheduler shim.
**All op coverage lives in userspace regcmd construction.** That's
why "what ops does rocket support?" is a Mesa question, not a kernel
question.
## NPU sub-blocks (from PC_INTERRUPT_MASK)
The interrupt-mask register names the discrete blocks the PC drives:
| Block | Channels | Likely role |
|---|---|---|
| **CNA** — Convolution Neural-net Accelerator | FEATURE_0/1, WEIGHT_0/1, CSC_0/1 | Loads features + weights, color-space convert (legacy ISP path) |
| **CORE** | 0/1 | MAC arrays — INT8 multiply-accumulate datapath |
| **DPU** — Data Processing Unit | 0/1 | Per-channel post-process (likely accumulation / type convert) |
| **PPU** — Post-Processing Unit | 0/1 | Activation / pooling / requantization |
| **PC** — Program Controller | — | Reads regcmd, sequences the pipeline |
| **DMA** | read/write error flags | Pulls features+weights, writes outputs through IOMMU |
Each per-core block has two parallel channels — same hardware as the
RKNPU2 vendor stack relies on for double-buffering input feature maps
and weights to keep the MAC arrays fed.
## Matmul on conv-shaped silicon — confirmed conv-only path
Skimming Mesa's `src/gallium/drivers/rocket/rkt_regcmd.c` (only ~700
lines, single conv emit path) **confirms**: the rocket hardware has no
matmul-shaped instruction. Every register in the emit path is named
`CNA_CONV_*`, `CNA_DATA_*`, `CNA_WEIGHT_*` — and the function that
builds a task's regcmd buffer is literally called `fill_first_regcmd`
with no alternate path for matmul. **Any matmul we run on this silicon
goes through the conv pipeline.**
### Regcmd encoding (from `rkt_regcmd.c::emit_raw`)
The userspace regcmd buffer is an array of **64-bit packed entries**:
```
packed = (target << 48) | (value << 16) | reg
```
- `target` — block destination ID (CNA, CORE, DPU, PC, …) + a `+0x1`
arming bit. `rkt_get_target(reg)` returns the block per register.
- `reg` — 16-bit register offset within the block.
- `value` — 32-bit register write value.
Each task's regcmd ends with the TRM-mandated trigger sequence:
```
0x0041000000000000 /* TRM: must precede op_en */
emit_raw(regs, 0x81, REG_PC_OPERATION_ENABLE,
PC_OPERATION_ENABLE_RESERVED_0(14) | PC_OPERATION_ENABLE_OP_EN(1));
```
That trailing PC.OPERATION_ENABLE write is what makes the NPU pipeline
start consuming the regcmd above it.
### Matmul-as-conv-1×1 — concrete mapping
For an INT8 matmul `Y[M, N] = X[M, K] @ W[K, N]` (the shape attention
and FFN projections take):
| Conv axis | Matmul axis | Concrete |
|---|---|---|
| `DATAIN_WIDTH` | 1 | `CNA_DATA_SIZE0.DATAIN_WIDTH = 1` |
| `DATAIN_HEIGHT` | M (rows) | `CNA_DATA_SIZE0.DATAIN_HEIGHT = M` |
| `DATAIN_CHANNEL` | K (input dim) | `CNA_DATA_SIZE1.DATAIN_CHANNEL = K`, `DATAIN_CHANNEL_REAL = K-1` |
| `WEIGHT_WIDTH/HEIGHT` | 1, 1 | `CNA_WEIGHT_SIZE2.{WIDTH, HEIGHT} = 1` |
| `WEIGHT_KERNELS` | N (output dim) | `CNA_WEIGHT_SIZE2.WEIGHT_KERNELS = N` |
| `CONV_X/Y_STRIDE` | 1, 1 | `CNA_CONV_CON3.{CONV_X_STRIDE, CONV_Y_STRIDE} = 1` |
| `PAD_*` | 0, 0, 0, 0 | `CNA_PAD_CON0.{PAD_LEFT, PAD_TOP} = 0` |
| `DATAOUT_WIDTH` | 1 | `CORE_DATAOUT_SIZE_0.DATAOUT_WIDTH = 0` (encoded as W-1) |
| `DATAOUT_HEIGHT` | M | `CORE_DATAOUT_SIZE_0.DATAOUT_HEIGHT = M-1` |
| `DATAOUT_CHANNEL` | N | `CORE_DATAOUT_SIZE_1.DATAOUT_CHANNEL = N-1` |
This is exactly what the RKNPU2 vendor stack does — its public op
list has `Conv2D` and not `MatMul`. We're aligned with the silicon's
intended op surface, just from a clean uAPI.
### Quantization handoff
`rkt_regcmd.c` has a clean (non-add-tensor) requantization path:
```
conv_scale = (input_scale * weights_scale) / output_scale;
shift = 127 + 31 - 32 - (scale_bits >> 23) + 16;
scale = ((scale_bits >> 9) & 0x7fff) + 1;
```
Mirrors PyTorch QNNPACK requantization. For our **Q4_K_M weights** the
dequant-to-INT8 step is on CPU (NEON): walk the 32-element groups in
the block, scale per-group, write an INT8 weight tile of `K×N`. NPU
gets:
- INT8 activation tensor `X` with per-tensor `input_scale`,
`input_zero_point`
- INT8 weight tile `W` with per-tensor `weights_scale` (after we fold
per-group into per-channel — the Q4_K_M scaling is finer-grained
than what this conv requantization expects)
- One `output_scale`, `output_zero_point` per matmul
The Q4_K_M → conv-quant fold is the **non-trivial CPU work** that
sits between Q4_K_M GGUF and a regcmd: we need to choose between:
(a) per-output-channel scales (NPU-friendly, lossy vs Q4_K_M)
(b) per-block scales applied at runtime via multiple smaller matmul
submits with different `output_scale` each (CPU overhead per
submit but mathematically faithful to Q4_K_M)
Phase-3 question. (a) is likely good enough for "credible" tok/s; (b)
preserves Q4_K_M accuracy but tanks throughput.
### What rocket today does NOT exercise (matmul-relevant blocks)
The Mesa emit path uses **CNA + CORE + DPU**. It does **not** touch:
- **PPU** — Post-Processing Unit. Per `rocket_registers.h` interrupt
flags, PPU has 2 channels. Likely activation / pooling. For pure
matmul we don't need it.
- **CSC** — color-space conversion submodule of CNA. Vestigial for
ML; bypassed in non-image paths.
That's fine — we won't need to wake those for transformer matmul.
What we need to learn from Mesa's regcmd encoding (Phase-2 follow-up):
1. **Max tile shape per CNA invocation** — what (K × N) tile does the
hardware natively prefer? Pulls from `rocket_registers.h` CNA_*
geometry registers.
2. **INT8 quantization parameter layout** — per-channel scales live
where in the regcmd / weight blob?
3. **Output type** — INT32 accumulator → INT8 quant? FP16 output? llama.cpp's
ggml graph needs to know what comes back to NEON.
4. **CSC block usage** — color-space conversion is meaningless for LLM,
but the FEATURE pipeline may force going through it. If so, it's an
identity transform we configure once.
5. **Per-core split for transformer attention** — three cores × two
channels each = 6 parallel matmul lanes. Worth the orchestration
complexity only if one core can't saturate the path on its own.
## Op coverage gap inventory
Looking at what a transformer decode step needs vs. what the rocket
pipeline can plausibly do today (per-op confidence in parens):
| llama.cpp op | NPU target | Confidence |
|---|---|---|
| INT8 matmul (Q/K/V projections, FFN) | CNA conv-1×1 | **medium-high** — Mesa already does conv in INT8 |
| INT8 matmul with INT4 weights (Q4_K_M) | dequant on CPU, INT8 matmul on NPU | medium — Q4_K_M's per-group scales add a dequant pass; first cut keeps it on NEON |
| RoPE (rotary position embeddings) | CPU (NEON) | n/a — wrong op shape for CNA |
| Softmax (attention) | CPU (NEON) | n/a — needs exp/sum, no direct support |
| RMSNorm | CPU (NEON) | n/a — sqrt + element-wise mul |
| Embedding lookup | CPU | n/a — gather, no NPU support |
| Sampling (top-k, top-p, multinomial) | CPU | n/a |
| Residual add + activation | DPU/PPU? unknown | low — needs Mesa-side check |
The Phase-4 baseline run will tell us how much wall-time matmul
actually owns; if it's >70% on CPU baseline, then offloading just
matmul (the medium-high row) gets us most of the speedup.
## Open questions — kicked to Phase-3
1. **regcmd format spec.** Pull `registers.xml` from Mesa and trace one
conv invocation end-to-end in `src/gallium/drivers/rocket/`.
2. **Memory budget per task.** 2 MB SRAM per core × 3 cores = 6 MB
working set. qwen-1.5B Q4_K_M has ~750 MB of weights (Q4 → ~375 MB
actual). Need to chunk by FFN-hidden / attention-head and stream.
3. **DMA address space layout.** Per-fd IOMMU domain (per
`rocket_drv.c::rocket_open`), so we can keep weights resident across
submits within one llama.cpp process. Cross-process sharing would
need dmabuf import.
4. **Power-domain bring-up time.** `pm_runtime_get_sync(core->dev)` per
job (see `rocket_job_run`) — if the autosuspend timeout fires
between decode steps, we pay the wake cost each token. Worth
measuring before fighting it.
5. **Job-timeout = 500 ms hard-coded** in `rocket_job.c::JOB_TIMEOUT_MS`.
Long matmul tiles must stay under that. Cap tile size accordingly
or split into smaller tasks.
## Recommendation for Phase-3 entry
When Phase-2 closes, Phase-3 ("Analyze" — read RKNPU2 SDK + Tomeu's
uAPI + BSP rockchip-npu for the spec) becomes mostly:
- Read `src/gallium/drivers/rocket/` end-to-end in Mesa main.
- Pick one conv invocation from a known model (Teflon MobileNetV1
first layer is the simplest), trace it down to the regcmd buffer
bytes, and document the field layout for matmul reconstruction.
- Read the BSP `drivers/rknpu/` for any quant-scale plumbing the Mesa
side doesn't expose (especially for INT4 → INT8 dequant alignment).
- The vendor-blob `librknnrt.so` is **off-limits** (closed license);
do not link, do not extract symbols.