# Architecture — CPU + NPU mix for llama.cpp on RK3588 ## The split llama.cpp's compute graph is built around ggml ops. We don't replace llama.cpp's whole engine — we register a new **device backend** (in ggml's `ggml-backend` abstraction) named `rknpu` and selectively offload the ops that are worth the round-trip cost. ### Goes to NPU (heavy, dense, INT8-friendly) - `MUL_MAT` (matrix-matrix multiply) — the workhorse, dominates wall time. Both attention `Q @ K^T`, `attn @ V`, and FFN `up_proj`, `down_proj`, `gate_proj` are this shape. - `MUL_MAT_ID` (MoE-style mixture matmul) — when we eventually try a mixture-of-experts model. Phase-1+ scope. ### Stays on CPU (small, op-specific, or per-token) - Embedding lookup (`GET_ROWS`) — random-access gather, NPU has no fast path - `RMS_NORM` / `LAYER_NORM` — per-token reduction + element-wise - `ROPE` — small, per-head, lots of trig - `SOFT_MAX` — small, per-head - Activations (`SILU`, `GELU`) — element-wise, cheap on NEON - `SCALE`, `ADD`, `MUL` (element-wise) — cheap on NEON - Sampling, KV cache update, tokenization — entirely host ### KV cache: open question Two options: 1. **CPU-resident:** lives in normal Linux memory; NPU pulls activations from CPU and pushes results back per layer. 2. **NPU-resident:** allocated in dmabuf, NPU reads K/V across layers without round-trips. Cheaper per-layer but constrains model size to NPU-accessible memory. Phase 5 (Plan) picks based on Phase-3 (Analyze) findings on the DMA cost. --- ## Memory mapping ggml's `ggml_backend_buffer_t` abstracts the buffer pool. We implement: - `alloc_buffer(size)` → allocate a dmabuf of `size` bytes - `free_buffer(buffer)` → release dmabuf - `set_tensor` / `get_tensor` → CPU → NPU memcpy - `cpy_tensor` → device-internal copy The dmabuf approach lets us share buffers between CPU producer (e.g. embedding lookup output) and NPU consumer (matmul) without an extra copy — `mmap` on CPU side, DMA-import on NPU side. If Tomeu's accel uAPI uses dmabuf natively, we follow that. If it doesn't, we go through `/dev/dri/renderD*` with a thin shim. --- ## Quantization strategy llama.cpp ships Q4_K_M as the default for ~2B models. Q4_K_M is a 4-bit weight quantization with per-group scale + min, no per-channel scale. The NPU expects INT8 (or INT16) tensors with per-channel scale factors. Two paths: 1. **Dequantize on CPU per-layer:** unpack Q4_K_M → INT8 right before the matmul; ship INT8 to NPU. Adds a per-layer CPU pre-pass. 2. **Dequantize once at load time:** unpack the entire weight tensor to INT8 + per-channel scales at model-load. Permanent ~2x memory cost (Q4_K_M is ~5 bits/weight effective; INT8 is 8 bits/weight), but no per-layer CPU work. Phase-1 choice: (2) — straightforward, makes the NPU path the only thing happening at inference time, easier to profile. The memory cost on 1.5B is ~1.5 GB INT8 weights vs ~900 MB Q4_K_M — boltzmann has 32 GB, this isn't the constraint. Phase-2+: revisit (1) if we go for larger models or if INT8 turns out to be quality-loss-meaningful on the small ones. --- ## Backend interface — concrete Mirroring ggml's existing `ggml-cuda.cu` / `ggml-metal.m` shape, we add: ``` ggml-rknpu.h — public API: ggml_backend_rknpu_init() etc. ggml-rknpu.c — backend implementation: device registration, op dispatch table, memory management ggml-rknpu-ops.c — per-op kernels: matmul tiled to NPU's preferred shape, INT8 quant pre-pass ``` In `llama.cpp/ggml/src/ggml-rknpu/`. Out-of-tree initially; if upstream-acceptable, send for review after Phase 8. --- ## Failure handling llama.cpp's backend abstraction already supports falling back to CPU on per-op basis — `ggml_backend_dev_supports_op()`. We declare `MUL_MAT` supported for INT8 / FP16 inputs with the right shape constraints, and let the framework route everything else to CPU. If the NPU driver returns an error mid-inference (timeout, DMA fence wait fail, etc.), the strategy is **abort the inference, log, return error to caller**. We don't try to silently fall back to CPU mid-stream because the state would be corrupted (NPU may have partially written to the dmabuf). --- ## Phase-1 milestone A `npu-probe` userspace binary that: 1. Opens the NPU device (whatever the mainline path is — likely `/dev/accel/accelN` or `/dev/dri/renderD*`) 2. Allocates two small INT8 input tensors + one output (e.g. 64x64) 3. Submits a matmul via the uAPI 4. Waits, reads back, compares to a CPU reference This proves the substrate is alive before we touch llama.cpp. If it doesn't work, we're back in kernel-driver land, not llama.cpp land.