Rosenblatt: project scaffold for RK3588 NPU on mainline

Codename: Frank Rosenblatt — Mark I Perceptron 1958, the first hardware neural network. This project lights up the RK3588 NPU on mainline Linux so the OSS world finally owns the silicon-side of inference on that chip. Phase-1 scope: small LLM running CPU + NPU mix on boltzmann (Rock 5 ITX+). Backend: llama.cpp with a new rknpu ggml backend offloading INT8 GEMM (attention + FFN matmuls) to the NPU's tile-MAC array while leaving dequant / RoPE / softmax / sampling / embedding on A76 NEON. Target model: qwen2.5-1.5B-instruct Q4_K_M GGUF. Scaffold layout: README.md (frame + 9+1-phase plan), TODO.md (rolling punch-list), docs/{npu-mainline-status,architecture}.md, kernel/ for DT bindings + driver tweaks, userspace/{npu-probe,llm-runtime}/, fleet/boltzmann.yaml. Next: Phase-1 substrate audit — fill the TBDs in docs/npu-mainline-status.md with the actual state of Tomeu Vizoso's rknpu / DRM-accel work on the boltzmann-running kernel.
2026-05-19 11:57:48 +00:00
commit 24adc74812
8 changed files with 578 additions and 0 deletions
@@ -0,0 +1,32 @@
+# llm-runtime
+
+llama.cpp fork (or out-of-tree backend) with the rknpu ggml backend.
+
+Code lands here starting at Phase 5 (Plan) — too early in Phase 1.
+
+Until then, this directory holds:
+- design notes (`docs/architecture.md` from project root is authoritative)
+- the eventual `ggml-rknpu/` backend source
+- patch series for upstream submission if quality reaches that bar
+
+## Approach
+
+Two paths to consider in Phase 5:
+
+1. **Fork llama.cpp, add backend in tree.** Easier to keep in sync;
+   harder to upstream because llama.cpp may not want a Rockchip-specific
+   backend that depends on a still-WIP mainline driver.
+2. **Out-of-tree backend, load via llama.cpp's plugin API
+   (`-DGGML_BACKEND_DL=ON`).** Cleaner separation; tracks llama.cpp
+   upstream without our diff being in the way. Recommended unless we
+   need to patch core llama.cpp logic.
+
+Decision deferred to Phase 5.
+
+## Model
+
+Phase-1 target: `qwen2.5-1.5b-instruct-q4_k_m.gguf`. Source:
+hf.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF or built locally with
+llama-quantize.
+
+Stretch: qwen2.5-3B (if memory + NPU SRAM allow), gemma3-2B.
@@ -0,0 +1,33 @@
+# npu-probe
+
+Smallest-possible userspace binary that:
+1. Opens the NPU device (path TBD per Phase-1 audit)
+2. Allocates two INT8 input tensors (64×64) + one output (64×64)
+3. Submits a matmul via the uAPI in use (Tomeu's accel ioctl OR our own
+   shim around vendor MMIO if accel-mainline isn't ready)
+4. Waits for completion (DMA fence or polled completion register)
+5. Reads back the output
+6. Compares to a CPU INT8 matmul reference; reports pass/fail
+
+**Phase-1 deliverable.** Until this works, nothing else in this repo
+can be exercised against real silicon.
+
+## Build
+
+_(filled when Phase-1 audit picks the uAPI shape — `meson` or `cmake`,
+no autotools)_
+
+## Run
+
+```
+./npu-probe                      # default 64×64 INT8 matmul
+./npu-probe --shape 128,128,128  # M,N,K override
+./npu-probe --device /dev/accel/accel0    # override device path
+./npu-probe --golden golden_64x64.bin     # provide expected output for diff
+```
+
+## Why C, not Python
+
+Direct ioctl + dmabuf + mmap. Python wrapper layer would obscure the
+exact syscall sequence we need to understand. Once npu-probe works,
+a Python binding for benchmark scripts is fine.