# Op coverage — what the rocket NPU pipeline actually does > **Phase-2 deliverable.** Captures the rocket uAPI surface, the NPU's > sub-block topology, and the implication for how llama.cpp's matmul > can be expressed on this silicon. Read alongside > `npu-mainline-status.md`. **Reference sources** (all in-tree as of mainline 2026-05-19): - `include/uapi/drm/rocket_accel.h` — userspace ABI - `drivers/accel/rocket/rocket_drv.c` — module init, of-match, ioctl table - `drivers/accel/rocket/rocket_job.c` — `rocket_job_hw_submit()` is the authoritative description of the per-task hardware sequence - `drivers/accel/rocket/rocket_registers.h` — autogenerated from Mesa's `src/gallium/drivers/rocket/registers.xml` (60 KB; **the regcmd format is owned by Mesa, not the kernel**) - `Documentation/accel/rocket/index.rst` — one-paragraph driver summary --- ## uAPI: four ioctls, one device per fd | Ioctl | Direction | Purpose | |---|---|---| | `DRM_ROCKET_CREATE_BO` | IOWR | Allocate a GEM buffer, get NPU DMA address + mmap offset | | `DRM_ROCKET_PREP_BO` | IOW | Take CPU ownership; wait for NPU fences; cache-invalidate | | `DRM_ROCKET_FINI_BO` | IOW | Release CPU ownership; cache-flush for NPU | | `DRM_ROCKET_SUBMIT` | IOW | Submit an array of jobs (each = array of tasks) | Submission unit hierarchy: ``` drm_rocket_submit └── drm_rocket_job[] (each tied to specific in/out BOs) └── drm_rocket_task[] (executes sequentially on ONE core) ├── regcmd DMA address of register-command buffer └── regcmd_count number of 32-bit reg/value entries ``` Key kernel-side invariant from `rocket_job_hw_submit()`: > "All tasks in the same job will be executed sequentially on the same > core, to benefit from memory residency in SRAM." Per-core SRAM is 2 MB. Bundling matmul tiles into one job buys weight reuse across tasks for free — important Phase-2 perf knob. The driver exposes a **single `/dev/accel/accel0` facade** that schedules across all probed RKNN cores. Three DT nodes → `rdev->num_cores = 3`, but userspace sees one device. ## What "regcmd" actually is The kernel's `rocket_job_hw_submit()` does **not** know what op runs. For each task it: 1. Writes the task's `regcmd` DMA address to `PC_BASE_ADDRESS`. 2. Writes `PC_REGISTER_AMOUNTS = (regcmd_count + 1) / 2 - 1`. 3. Configures S_POINTER ping-pong (CNA + Core, with a per-core "extra bit" `0x10000000 * core_index` lifted from BSP rknpu — undocumented in TRM, kernel comments call it out). 4. Enables DPU_0/DPU_1 interrupts. 5. Pulls the trigger: `PC_OPERATION_ENABLE.OP_EN = 1`. The NPU's Program Controller then plays back the regcmd buffer — **a sequence of {register address, register value} pairs the userspace emitted**. The kernel is fundamentally a power/IOMMU/scheduler shim. **All op coverage lives in userspace regcmd construction.** That's why "what ops does rocket support?" is a Mesa question, not a kernel question. ## NPU sub-blocks (from PC_INTERRUPT_MASK) The interrupt-mask register names the discrete blocks the PC drives: | Block | Channels | Likely role | |---|---|---| | **CNA** — Convolution Neural-net Accelerator | FEATURE_0/1, WEIGHT_0/1, CSC_0/1 | Loads features + weights, color-space convert (legacy ISP path) | | **CORE** | 0/1 | MAC arrays — INT8 multiply-accumulate datapath | | **DPU** — Data Processing Unit | 0/1 | Per-channel post-process (likely accumulation / type convert) | | **PPU** — Post-Processing Unit | 0/1 | Activation / pooling / requantization | | **PC** — Program Controller | — | Reads regcmd, sequences the pipeline | | **DMA** | read/write error flags | Pulls features+weights, writes outputs through IOMMU | Each per-core block has two parallel channels — same hardware as the RKNPU2 vendor stack relies on for double-buffering input feature maps and weights to keep the MAC arrays fed. ## Matmul on conv-shaped silicon — confirmed conv-only path Skimming Mesa's `src/gallium/drivers/rocket/rkt_regcmd.c` (only ~700 lines, single conv emit path) **confirms**: the rocket hardware has no matmul-shaped instruction. Every register in the emit path is named `CNA_CONV_*`, `CNA_DATA_*`, `CNA_WEIGHT_*` — and the function that builds a task's regcmd buffer is literally called `fill_first_regcmd` with no alternate path for matmul. **Any matmul we run on this silicon goes through the conv pipeline.** ### Regcmd encoding (from `rkt_regcmd.c::emit_raw`) The userspace regcmd buffer is an array of **64-bit packed entries**: ``` packed = (target << 48) | (value << 16) | reg ``` - `target` — block destination ID (CNA, CORE, DPU, PC, …) + a `+0x1` arming bit. `rkt_get_target(reg)` returns the block per register. - `reg` — 16-bit register offset within the block. - `value` — 32-bit register write value. Each task's regcmd ends with the TRM-mandated trigger sequence: ``` 0x0041000000000000 /* TRM: must precede op_en */ emit_raw(regs, 0x81, REG_PC_OPERATION_ENABLE, PC_OPERATION_ENABLE_RESERVED_0(14) | PC_OPERATION_ENABLE_OP_EN(1)); ``` That trailing PC.OPERATION_ENABLE write is what makes the NPU pipeline start consuming the regcmd above it. ### Matmul-as-conv-1×1 — concrete mapping For an INT8 matmul `Y[M, N] = X[M, K] @ W[K, N]` (the shape attention and FFN projections take): | Conv axis | Matmul axis | Concrete | |---|---|---| | `DATAIN_WIDTH` | 1 | `CNA_DATA_SIZE0.DATAIN_WIDTH = 1` | | `DATAIN_HEIGHT` | M (rows) | `CNA_DATA_SIZE0.DATAIN_HEIGHT = M` | | `DATAIN_CHANNEL` | K (input dim) | `CNA_DATA_SIZE1.DATAIN_CHANNEL = K`, `DATAIN_CHANNEL_REAL = K-1` | | `WEIGHT_WIDTH/HEIGHT` | 1, 1 | `CNA_WEIGHT_SIZE2.{WIDTH, HEIGHT} = 1` | | `WEIGHT_KERNELS` | N (output dim) | `CNA_WEIGHT_SIZE2.WEIGHT_KERNELS = N` | | `CONV_X/Y_STRIDE` | 1, 1 | `CNA_CONV_CON3.{CONV_X_STRIDE, CONV_Y_STRIDE} = 1` | | `PAD_*` | 0, 0, 0, 0 | `CNA_PAD_CON0.{PAD_LEFT, PAD_TOP} = 0` | | `DATAOUT_WIDTH` | 1 | `CORE_DATAOUT_SIZE_0.DATAOUT_WIDTH = 0` (encoded as W-1) | | `DATAOUT_HEIGHT` | M | `CORE_DATAOUT_SIZE_0.DATAOUT_HEIGHT = M-1` | | `DATAOUT_CHANNEL` | N | `CORE_DATAOUT_SIZE_1.DATAOUT_CHANNEL = N-1` | This is exactly what the RKNPU2 vendor stack does — its public op list has `Conv2D` and not `MatMul`. We're aligned with the silicon's intended op surface, just from a clean uAPI. ### Quantization handoff `rkt_regcmd.c` has a clean (non-add-tensor) requantization path: ``` conv_scale = (input_scale * weights_scale) / output_scale; shift = 127 + 31 - 32 - (scale_bits >> 23) + 16; scale = ((scale_bits >> 9) & 0x7fff) + 1; ``` Mirrors PyTorch QNNPACK requantization. For our **Q4_K_M weights** the dequant-to-INT8 step is on CPU (NEON): walk the 32-element groups in the block, scale per-group, write an INT8 weight tile of `K×N`. NPU gets: - INT8 activation tensor `X` with per-tensor `input_scale`, `input_zero_point` - INT8 weight tile `W` with per-tensor `weights_scale` (after we fold per-group into per-channel — the Q4_K_M scaling is finer-grained than what this conv requantization expects) - One `output_scale`, `output_zero_point` per matmul The Q4_K_M → conv-quant fold is the **non-trivial CPU work** that sits between Q4_K_M GGUF and a regcmd: we need to choose between: (a) per-output-channel scales (NPU-friendly, lossy vs Q4_K_M) (b) per-block scales applied at runtime via multiple smaller matmul submits with different `output_scale` each (CPU overhead per submit but mathematically faithful to Q4_K_M) Phase-3 question. (a) is likely good enough for "credible" tok/s; (b) preserves Q4_K_M accuracy but tanks throughput. ### What rocket today does NOT exercise (matmul-relevant blocks) The Mesa emit path uses **CNA + CORE + DPU**. It does **not** touch: - **PPU** — Post-Processing Unit. Per `rocket_registers.h` interrupt flags, PPU has 2 channels. Likely activation / pooling. For pure matmul we don't need it. - **CSC** — color-space conversion submodule of CNA. Vestigial for ML; bypassed in non-image paths. That's fine — we won't need to wake those for transformer matmul. What we need to learn from Mesa's regcmd encoding (Phase-2 follow-up): 1. **Max tile shape per CNA invocation** — what (K × N) tile does the hardware natively prefer? Pulls from `rocket_registers.h` CNA_* geometry registers. 2. **INT8 quantization parameter layout** — per-channel scales live where in the regcmd / weight blob? 3. **Output type** — INT32 accumulator → INT8 quant? FP16 output? llama.cpp's ggml graph needs to know what comes back to NEON. 4. **CSC block usage** — color-space conversion is meaningless for LLM, but the FEATURE pipeline may force going through it. If so, it's an identity transform we configure once. 5. **Per-core split for transformer attention** — three cores × two channels each = 6 parallel matmul lanes. Worth the orchestration complexity only if one core can't saturate the path on its own. ## Op coverage gap inventory Looking at what a transformer decode step needs vs. what the rocket pipeline can plausibly do today (per-op confidence in parens): | llama.cpp op | NPU target | Confidence | |---|---|---| | INT8 matmul (Q/K/V projections, FFN) | CNA conv-1×1 | **medium-high** — Mesa already does conv in INT8 | | INT8 matmul with INT4 weights (Q4_K_M) | dequant on CPU, INT8 matmul on NPU | medium — Q4_K_M's per-group scales add a dequant pass; first cut keeps it on NEON | | RoPE (rotary position embeddings) | CPU (NEON) | n/a — wrong op shape for CNA | | Softmax (attention) | CPU (NEON) | n/a — needs exp/sum, no direct support | | RMSNorm | CPU (NEON) | n/a — sqrt + element-wise mul | | Embedding lookup | CPU | n/a — gather, no NPU support | | Sampling (top-k, top-p, multinomial) | CPU | n/a | | Residual add + activation | DPU/PPU? unknown | low — needs Mesa-side check | The Phase-4 baseline run will tell us how much wall-time matmul actually owns; if it's >70% on CPU baseline, then offloading just matmul (the medium-high row) gets us most of the speedup. ## Open questions — kicked to Phase-3 1. **regcmd format spec.** Pull `registers.xml` from Mesa and trace one conv invocation end-to-end in `src/gallium/drivers/rocket/`. 2. **Memory budget per task.** 2 MB SRAM per core × 3 cores = 6 MB working set. qwen-1.5B Q4_K_M has ~750 MB of weights (Q4 → ~375 MB actual). Need to chunk by FFN-hidden / attention-head and stream. 3. **DMA address space layout.** Per-fd IOMMU domain (per `rocket_drv.c::rocket_open`), so we can keep weights resident across submits within one llama.cpp process. Cross-process sharing would need dmabuf import. 4. **Power-domain bring-up time.** `pm_runtime_get_sync(core->dev)` per job (see `rocket_job_run`) — if the autosuspend timeout fires between decode steps, we pay the wake cost each token. Worth measuring before fighting it. 5. **Job-timeout = 500 ms hard-coded** in `rocket_job.c::JOB_TIMEOUT_MS`. Long matmul tiles must stay under that. Cap tile size accordingly or split into smaller tasks. ## Recommendation for Phase-3 entry When Phase-2 closes, Phase-3 ("Analyze" — read RKNPU2 SDK + Tomeu's uAPI + BSP rockchip-npu for the spec) becomes mostly: - Read `src/gallium/drivers/rocket/` end-to-end in Mesa main. - Pick one conv invocation from a known model (Teflon MobileNetV1 first layer is the simplest), trace it down to the regcmd buffer bytes, and document the field layout for matmul reconstruction. - Read the BSP `drivers/rknpu/` for any quant-scale plumbing the Mesa side doesn't expose (especially for INT4 → INT8 dequant alignment). - The vendor-blob `librknnrt.so` is **off-limits** (closed license); do not link, do not extract symbols.