Files

T

marfrit c9a3f5c600 Rosenblatt Phase-1 closeout: rocket-driver substrate inventory

Phase-1 audit closes with a substantively different picture than the
original scaffold's TBDs:

- Tomeu Vizoso's RK3588 NPU work merged in Linux 6.18 (Nov 2025) under
  codename `rocket` (NOT `rknpu`).  All references updated.
- Boltzmann's `linux-rk3588-marfrit-A1` (7.0.0-rc3-ARCH+) already ships
  `drivers/accel/rocket/rocket.ko` as a built-but-not-loaded module.
- DT bindings + per-core nodes (`npu@fdab/c/d_0000`,
  compatible `rockchip,rk3588-rknn-core`) in mainline since 6.18 but
  ship `status = "disabled"` — board enable is the Phase-2 unblock,
  not a driver port.
- Mesa 25.3 ships Rocket Gallium + Teflon TFLite delegate as the
  authoritative userspace reference for the uAPI shape.
- Op coverage today is conv-centric (MobileNet-class); transformer
  matmul needs the conv-1×1 shoehorn (RKNPU2 BSP precedent) or rocket
  op-set additions.  Surfaced as Phase-2-load-bearing risk.
- IOMMU v1.0 hazard: 32 GB host needs `mem=4G` or local
  `rockchip,rk3568-iommu-v1` discriminator patches before the first
  NPU job, to avoid DMA-window faults.

Files:
- docs/npu-mainline-status.md: full audit table with upstream pointers
  (kernel.org / Mesa docs / dri-devel patch URLs / Tomeu's "we are in
  mainline" blog post).
- docs/phases.md: per-phase log entry for Phase-1 closeout.
- docs/op-coverage.md: matmul-vs-conv-vs-rocket-op-set framing.
- fleet/boltzmann.yaml: audited kernel + npu_driver + dt_npu_nodes
  state.
- kernel/dt-overlays/rk3588-rosenblatt-npu-enable.dtso: overlay to
  flip the three rknn-core nodes to "okay" (+ matching mmu nodes),
  carries the IOMMU-mitigation warning inline.
- kernel/README.md: kernel-agent scope wiring + anticipated local
  carry patches.
- README.md: phase-status table + "rknpu → rocket" rename note.
- TODO.md: Phase-2 unblock concrete steps + standing
  upstream-watch items.

2026-05-19 12:41:31 +00:00

12 KiB

Raw Permalink Blame History

Op coverage — what the rocket NPU pipeline actually does

Phase-2 deliverable. Captures the rocket uAPI surface, the NPU's sub-block topology, and the implication for how llama.cpp's matmul can be expressed on this silicon. Read alongside npu-mainline-status.md.

Reference sources (all in-tree as of mainline 2026-05-19):

include/uapi/drm/rocket_accel.h — userspace ABI
drivers/accel/rocket/rocket_drv.c — module init, of-match, ioctl table
drivers/accel/rocket/rocket_job.c — rocket_job_hw_submit() is the authoritative description of the per-task hardware sequence
drivers/accel/rocket/rocket_registers.h — autogenerated from Mesa's src/gallium/drivers/rocket/registers.xml (60 KB; the regcmd format is owned by Mesa, not the kernel)
Documentation/accel/rocket/index.rst — one-paragraph driver summary

uAPI: four ioctls, one device per fd

Ioctl	Direction	Purpose
`DRM_ROCKET_CREATE_BO`	IOWR	Allocate a GEM buffer, get NPU DMA address + mmap offset
`DRM_ROCKET_PREP_BO`	IOW	Take CPU ownership; wait for NPU fences; cache-invalidate
`DRM_ROCKET_FINI_BO`	IOW	Release CPU ownership; cache-flush for NPU
`DRM_ROCKET_SUBMIT`	IOW	Submit an array of jobs (each = array of tasks)

Submission unit hierarchy:

drm_rocket_submit
└── drm_rocket_job[]                       (each tied to specific in/out BOs)
    └── drm_rocket_task[]                  (executes sequentially on ONE core)
        ├── regcmd        DMA address of register-command buffer
        └── regcmd_count  number of 32-bit reg/value entries

Key kernel-side invariant from rocket_job_hw_submit():

"All tasks in the same job will be executed sequentially on the same core, to benefit from memory residency in SRAM."

Per-core SRAM is 2 MB. Bundling matmul tiles into one job buys weight reuse across tasks for free — important Phase-2 perf knob.

The driver exposes a single /dev/accel/accel0 facade that schedules across all probed RKNN cores. Three DT nodes → rdev->num_cores = 3, but userspace sees one device.

What "regcmd" actually is

The kernel's rocket_job_hw_submit() does not know what op runs. For each task it:

Writes the task's regcmd DMA address to PC_BASE_ADDRESS.
Writes PC_REGISTER_AMOUNTS = (regcmd_count + 1) / 2 - 1.
Configures S_POINTER ping-pong (CNA + Core, with a per-core "extra bit" 0x10000000 * core_index lifted from BSP rknpu — undocumented in TRM, kernel comments call it out).
Enables DPU_0/DPU_1 interrupts.
Pulls the trigger: PC_OPERATION_ENABLE.OP_EN = 1.

The NPU's Program Controller then plays back the regcmd buffer — a sequence of {register address, register value} pairs the userspace emitted. The kernel is fundamentally a power/IOMMU/scheduler shim. All op coverage lives in userspace regcmd construction. That's why "what ops does rocket support?" is a Mesa question, not a kernel question.

NPU sub-blocks (from PC_INTERRUPT_MASK)

The interrupt-mask register names the discrete blocks the PC drives:

Block	Channels	Likely role
CNA — Convolution Neural-net Accelerator	FEATURE_0/1, WEIGHT_0/1, CSC_0/1	Loads features + weights, color-space convert (legacy ISP path)
CORE	0/1	MAC arrays — INT8 multiply-accumulate datapath
DPU — Data Processing Unit	0/1	Per-channel post-process (likely accumulation / type convert)
PPU — Post-Processing Unit	0/1	Activation / pooling / requantization
PC — Program Controller	—	Reads regcmd, sequences the pipeline
DMA	read/write error flags	Pulls features+weights, writes outputs through IOMMU

Each per-core block has two parallel channels — same hardware as the RKNPU2 vendor stack relies on for double-buffering input feature maps and weights to keep the MAC arrays fed.

Matmul on conv-shaped silicon — confirmed conv-only path

Skimming Mesa's src/gallium/drivers/rocket/rkt_regcmd.c (only ~700 lines, single conv emit path) confirms: the rocket hardware has no matmul-shaped instruction. Every register in the emit path is named CNA_CONV_*, CNA_DATA_*, CNA_WEIGHT_* — and the function that builds a task's regcmd buffer is literally called fill_first_regcmd with no alternate path for matmul. Any matmul we run on this silicon goes through the conv pipeline.

Regcmd encoding (from `rkt_regcmd.c::emit_raw`)

The userspace regcmd buffer is an array of 64-bit packed entries:

packed = (target << 48) | (value << 16) | reg

target — block destination ID (CNA, CORE, DPU, PC, …) + a +0x1 arming bit. rkt_get_target(reg) returns the block per register.
reg — 16-bit register offset within the block.
value — 32-bit register write value.

Each task's regcmd ends with the TRM-mandated trigger sequence:

0x0041000000000000   /* TRM: must precede op_en */
emit_raw(regs, 0x81, REG_PC_OPERATION_ENABLE,
         PC_OPERATION_ENABLE_RESERVED_0(14) | PC_OPERATION_ENABLE_OP_EN(1));

That trailing PC.OPERATION_ENABLE write is what makes the NPU pipeline start consuming the regcmd above it.

Matmul-as-conv-1×1 — concrete mapping

For an INT8 matmul Y[M, N] = X[M, K] @ W[K, N] (the shape attention and FFN projections take):

Conv axis	Matmul axis	Concrete
`DATAIN_WIDTH`	1	`CNA_DATA_SIZE0.DATAIN_WIDTH = 1`
`DATAIN_HEIGHT`	M (rows)	`CNA_DATA_SIZE0.DATAIN_HEIGHT = M`
`DATAIN_CHANNEL`	K (input dim)	`CNA_DATA_SIZE1.DATAIN_CHANNEL = K`, `DATAIN_CHANNEL_REAL = K-1`
`WEIGHT_WIDTH/HEIGHT`	1, 1	`CNA_WEIGHT_SIZE2.{WIDTH, HEIGHT} = 1`
`WEIGHT_KERNELS`	N (output dim)	`CNA_WEIGHT_SIZE2.WEIGHT_KERNELS = N`
`CONV_X/Y_STRIDE`	1, 1	`CNA_CONV_CON3.{CONV_X_STRIDE, CONV_Y_STRIDE} = 1`
`PAD_*`	0, 0, 0, 0	`CNA_PAD_CON0.{PAD_LEFT, PAD_TOP} = 0`
`DATAOUT_WIDTH`	1	`CORE_DATAOUT_SIZE_0.DATAOUT_WIDTH = 0` (encoded as W-1)
`DATAOUT_HEIGHT`	M	`CORE_DATAOUT_SIZE_0.DATAOUT_HEIGHT = M-1`
`DATAOUT_CHANNEL`	N	`CORE_DATAOUT_SIZE_1.DATAOUT_CHANNEL = N-1`

This is exactly what the RKNPU2 vendor stack does — its public op list has Conv2D and not MatMul. We're aligned with the silicon's intended op surface, just from a clean uAPI.

Quantization handoff

rkt_regcmd.c has a clean (non-add-tensor) requantization path:

conv_scale = (input_scale * weights_scale) / output_scale;
shift = 127 + 31 - 32 - (scale_bits >> 23) + 16;
scale = ((scale_bits >> 9) & 0x7fff) + 1;

Mirrors PyTorch QNNPACK requantization. For our Q4_K_M weights the dequant-to-INT8 step is on CPU (NEON): walk the 32-element groups in the block, scale per-group, write an INT8 weight tile of K×N. NPU gets:

INT8 activation tensor X with per-tensor input_scale, input_zero_point
INT8 weight tile W with per-tensor weights_scale (after we fold per-group into per-channel — the Q4_K_M scaling is finer-grained than what this conv requantization expects)
One output_scale, output_zero_point per matmul

The Q4_K_M → conv-quant fold is the non-trivial CPU work that sits between Q4_K_M GGUF and a regcmd: we need to choose between:

(a) per-output-channel scales (NPU-friendly, lossy vs Q4_K_M) (b) per-block scales applied at runtime via multiple smaller matmul submits with different output_scale each (CPU overhead per submit but mathematically faithful to Q4_K_M)

Phase-3 question. (a) is likely good enough for "credible" tok/s; (b) preserves Q4_K_M accuracy but tanks throughput.

What rocket today does NOT exercise (matmul-relevant blocks)

The Mesa emit path uses CNA + CORE + DPU. It does not touch:

PPU — Post-Processing Unit. Per rocket_registers.h interrupt flags, PPU has 2 channels. Likely activation / pooling. For pure matmul we don't need it.
CSC — color-space conversion submodule of CNA. Vestigial for ML; bypassed in non-image paths.

That's fine — we won't need to wake those for transformer matmul.

What we need to learn from Mesa's regcmd encoding (Phase-2 follow-up):

Max tile shape per CNA invocation — what (K × N) tile does the hardware natively prefer? Pulls from rocket_registers.h CNA_* geometry registers.
INT8 quantization parameter layout — per-channel scales live where in the regcmd / weight blob?
Output type — INT32 accumulator → INT8 quant? FP16 output? llama.cpp's ggml graph needs to know what comes back to NEON.
CSC block usage — color-space conversion is meaningless for LLM, but the FEATURE pipeline may force going through it. If so, it's an identity transform we configure once.
Per-core split for transformer attention — three cores × two channels each = 6 parallel matmul lanes. Worth the orchestration complexity only if one core can't saturate the path on its own.

Op coverage gap inventory

Looking at what a transformer decode step needs vs. what the rocket pipeline can plausibly do today (per-op confidence in parens):

llama.cpp op	NPU target	Confidence
INT8 matmul (Q/K/V projections, FFN)	CNA conv-1×1	medium-high — Mesa already does conv in INT8
INT8 matmul with INT4 weights (Q4_K_M)	dequant on CPU, INT8 matmul on NPU	medium — Q4_K_M's per-group scales add a dequant pass; first cut keeps it on NEON
RoPE (rotary position embeddings)	CPU (NEON)	n/a — wrong op shape for CNA
Softmax (attention)	CPU (NEON)	n/a — needs exp/sum, no direct support
RMSNorm	CPU (NEON)	n/a — sqrt + element-wise mul
Embedding lookup	CPU	n/a — gather, no NPU support
Sampling (top-k, top-p, multinomial)	CPU	n/a
Residual add + activation	DPU/PPU? unknown	low — needs Mesa-side check

The Phase-4 baseline run will tell us how much wall-time matmul actually owns; if it's >70% on CPU baseline, then offloading just matmul (the medium-high row) gets us most of the speedup.

Open questions — kicked to Phase-3

regcmd format spec. Pull registers.xml from Mesa and trace one conv invocation end-to-end in src/gallium/drivers/rocket/.
Memory budget per task. 2 MB SRAM per core × 3 cores = 6 MB working set. qwen-1.5B Q4_K_M has ~750 MB of weights (Q4 → ~375 MB actual). Need to chunk by FFN-hidden / attention-head and stream.
DMA address space layout. Per-fd IOMMU domain (per rocket_drv.c::rocket_open), so we can keep weights resident across submits within one llama.cpp process. Cross-process sharing would need dmabuf import.
Power-domain bring-up time. pm_runtime_get_sync(core->dev) per job (see rocket_job_run) — if the autosuspend timeout fires between decode steps, we pay the wake cost each token. Worth measuring before fighting it.
Job-timeout = 500 ms hard-coded in rocket_job.c::JOB_TIMEOUT_MS. Long matmul tiles must stay under that. Cap tile size accordingly or split into smaller tasks.

Recommendation for Phase-3 entry

When Phase-2 closes, Phase-3 ("Analyze" — read RKNPU2 SDK + Tomeu's uAPI + BSP rockchip-npu for the spec) becomes mostly:

Read src/gallium/drivers/rocket/ end-to-end in Mesa main.
Pick one conv invocation from a known model (Teflon MobileNetV1 first layer is the simplest), trace it down to the regcmd buffer bytes, and document the field layout for matmul reconstruction.
Read the BSP drivers/rknpu/ for any quant-scale plumbing the Mesa side doesn't expose (especially for INT4 → INT8 dequant alignment).
The vendor-blob librknnrt.so is off-limits (closed license); do not link, do not extract symbols.

12 KiB Raw Permalink Blame History Unescape Escape