Files
marfrit 24adc74812 Rosenblatt: project scaffold for RK3588 NPU on mainline
Codename: Frank Rosenblatt — Mark I Perceptron 1958, the first
hardware neural network.  This project lights up the RK3588 NPU on
mainline Linux so the OSS world finally owns the silicon-side of
inference on that chip.

Phase-1 scope: small LLM running CPU + NPU mix on boltzmann (Rock 5
ITX+).  Backend: llama.cpp with a new rknpu ggml backend offloading
INT8 GEMM (attention + FFN matmuls) to the NPU's tile-MAC array while
leaving dequant / RoPE / softmax / sampling / embedding on A76 NEON.

Target model: qwen2.5-1.5B-instruct Q4_K_M GGUF.

Scaffold layout: README.md (frame + 9+1-phase plan), TODO.md (rolling
punch-list), docs/{npu-mainline-status,architecture}.md, kernel/ for
DT bindings + driver tweaks, userspace/{npu-probe,llm-runtime}/,
fleet/boltzmann.yaml.

Next: Phase-1 substrate audit — fill the TBDs in docs/npu-mainline-status.md
with the actual state of Tomeu Vizoso's rknpu / DRM-accel work on
the boltzmann-running kernel.
2026-05-19 11:57:48 +00:00

4.6 KiB

Architecture — CPU + NPU mix for llama.cpp on RK3588

The split

llama.cpp's compute graph is built around ggml ops. We don't replace llama.cpp's whole engine — we register a new device backend (in ggml's ggml-backend abstraction) named rknpu and selectively offload the ops that are worth the round-trip cost.

Goes to NPU (heavy, dense, INT8-friendly)

  • MUL_MAT (matrix-matrix multiply) — the workhorse, dominates wall time. Both attention Q @ K^T, attn @ V, and FFN up_proj, down_proj, gate_proj are this shape.
  • MUL_MAT_ID (MoE-style mixture matmul) — when we eventually try a mixture-of-experts model. Phase-1+ scope.

Stays on CPU (small, op-specific, or per-token)

  • Embedding lookup (GET_ROWS) — random-access gather, NPU has no fast path
  • RMS_NORM / LAYER_NORM — per-token reduction + element-wise
  • ROPE — small, per-head, lots of trig
  • SOFT_MAX — small, per-head
  • Activations (SILU, GELU) — element-wise, cheap on NEON
  • SCALE, ADD, MUL (element-wise) — cheap on NEON
  • Sampling, KV cache update, tokenization — entirely host

KV cache: open question

Two options:

  1. CPU-resident: lives in normal Linux memory; NPU pulls activations from CPU and pushes results back per layer.
  2. NPU-resident: allocated in dmabuf, NPU reads K/V across layers without round-trips. Cheaper per-layer but constrains model size to NPU-accessible memory.

Phase 5 (Plan) picks based on Phase-3 (Analyze) findings on the DMA cost.


Memory mapping

ggml's ggml_backend_buffer_t abstracts the buffer pool. We implement:

  • alloc_buffer(size) → allocate a dmabuf of size bytes
  • free_buffer(buffer) → release dmabuf
  • set_tensor / get_tensor → CPU → NPU memcpy
  • cpy_tensor → device-internal copy

The dmabuf approach lets us share buffers between CPU producer (e.g. embedding lookup output) and NPU consumer (matmul) without an extra copy — mmap on CPU side, DMA-import on NPU side.

If Tomeu's accel uAPI uses dmabuf natively, we follow that. If it doesn't, we go through /dev/dri/renderD* with a thin shim.


Quantization strategy

llama.cpp ships Q4_K_M as the default for ~2B models. Q4_K_M is a 4-bit weight quantization with per-group scale + min, no per-channel scale. The NPU expects INT8 (or INT16) tensors with per-channel scale factors.

Two paths:

  1. Dequantize on CPU per-layer: unpack Q4_K_M → INT8 right before the matmul; ship INT8 to NPU. Adds a per-layer CPU pre-pass.
  2. Dequantize once at load time: unpack the entire weight tensor to INT8 + per-channel scales at model-load. Permanent ~2x memory cost (Q4_K_M is ~5 bits/weight effective; INT8 is 8 bits/weight), but no per-layer CPU work.

Phase-1 choice: (2) — straightforward, makes the NPU path the only thing happening at inference time, easier to profile. The memory cost on 1.5B is ~1.5 GB INT8 weights vs ~900 MB Q4_K_M — boltzmann has 32 GB, this isn't the constraint.

Phase-2+: revisit (1) if we go for larger models or if INT8 turns out to be quality-loss-meaningful on the small ones.


Backend interface — concrete

Mirroring ggml's existing ggml-cuda.cu / ggml-metal.m shape, we add:

ggml-rknpu.h       — public API: ggml_backend_rknpu_init() etc.
ggml-rknpu.c       — backend implementation: device registration, op
                     dispatch table, memory management
ggml-rknpu-ops.c   — per-op kernels: matmul tiled to NPU's preferred
                     shape, INT8 quant pre-pass

In llama.cpp/ggml/src/ggml-rknpu/. Out-of-tree initially; if upstream-acceptable, send for review after Phase 8.


Failure handling

llama.cpp's backend abstraction already supports falling back to CPU on per-op basis — ggml_backend_dev_supports_op(). We declare MUL_MAT supported for INT8 / FP16 inputs with the right shape constraints, and let the framework route everything else to CPU.

If the NPU driver returns an error mid-inference (timeout, DMA fence wait fail, etc.), the strategy is abort the inference, log, return error to caller. We don't try to silently fall back to CPU mid-stream because the state would be corrupted (NPU may have partially written to the dmabuf).


Phase-1 milestone

A npu-probe userspace binary that:

  1. Opens the NPU device (whatever the mainline path is — likely /dev/accel/accelN or /dev/dri/renderD*)
  2. Allocates two small INT8 input tensors + one output (e.g. 64x64)
  3. Submits a matmul via the uAPI
  4. Waits, reads back, compares to a CPU reference

This proves the substrate is alive before we touch llama.cpp. If it doesn't work, we're back in kernel-driver land, not llama.cpp land.