24adc74812
Codename: Frank Rosenblatt — Mark I Perceptron 1958, the first
hardware neural network. This project lights up the RK3588 NPU on
mainline Linux so the OSS world finally owns the silicon-side of
inference on that chip.
Phase-1 scope: small LLM running CPU + NPU mix on boltzmann (Rock 5
ITX+). Backend: llama.cpp with a new rknpu ggml backend offloading
INT8 GEMM (attention + FFN matmuls) to the NPU's tile-MAC array while
leaving dequant / RoPE / softmax / sampling / embedding on A76 NEON.
Target model: qwen2.5-1.5B-instruct Q4_K_M GGUF.
Scaffold layout: README.md (frame + 9+1-phase plan), TODO.md (rolling
punch-list), docs/{npu-mainline-status,architecture}.md, kernel/ for
DT bindings + driver tweaks, userspace/{npu-probe,llm-runtime}/,
fleet/boltzmann.yaml.
Next: Phase-1 substrate audit — fill the TBDs in docs/npu-mainline-status.md
with the actual state of Tomeu Vizoso's rknpu / DRM-accel work on
the boltzmann-running kernel.
1.1 KiB
1.1 KiB
llm-runtime
llama.cpp fork (or out-of-tree backend) with the rknpu ggml backend.
Code lands here starting at Phase 5 (Plan) — too early in Phase 1.
Until then, this directory holds:
- design notes (
docs/architecture.mdfrom project root is authoritative) - the eventual
ggml-rknpu/backend source - patch series for upstream submission if quality reaches that bar
Approach
Two paths to consider in Phase 5:
- Fork llama.cpp, add backend in tree. Easier to keep in sync; harder to upstream because llama.cpp may not want a Rockchip-specific backend that depends on a still-WIP mainline driver.
- Out-of-tree backend, load via llama.cpp's plugin API
(
-DGGML_BACKEND_DL=ON). Cleaner separation; tracks llama.cpp upstream without our diff being in the way. Recommended unless we need to patch core llama.cpp logic.
Decision deferred to Phase 5.
Model
Phase-1 target: qwen2.5-1.5b-instruct-q4_k_m.gguf. Source:
hf.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF or built locally with
llama-quantize.
Stretch: qwen2.5-3B (if memory + NPU SRAM allow), gemma3-2B.