24adc74812
Codename: Frank Rosenblatt — Mark I Perceptron 1958, the first
hardware neural network. This project lights up the RK3588 NPU on
mainline Linux so the OSS world finally owns the silicon-side of
inference on that chip.
Phase-1 scope: small LLM running CPU + NPU mix on boltzmann (Rock 5
ITX+). Backend: llama.cpp with a new rknpu ggml backend offloading
INT8 GEMM (attention + FFN matmuls) to the NPU's tile-MAC array while
leaving dequant / RoPE / softmax / sampling / embedding on A76 NEON.
Target model: qwen2.5-1.5B-instruct Q4_K_M GGUF.
Scaffold layout: README.md (frame + 9+1-phase plan), TODO.md (rolling
punch-list), docs/{npu-mainline-status,architecture}.md, kernel/ for
DT bindings + driver tweaks, userspace/{npu-probe,llm-runtime}/,
fleet/boltzmann.yaml.
Next: Phase-1 substrate audit — fill the TBDs in docs/npu-mainline-status.md
with the actual state of Tomeu Vizoso's rknpu / DRM-accel work on
the boltzmann-running kernel.
npu-probe
Smallest-possible userspace binary that:
- Opens the NPU device (path TBD per Phase-1 audit)
- Allocates two INT8 input tensors (64×64) + one output (64×64)
- Submits a matmul via the uAPI in use (Tomeu's accel ioctl OR our own shim around vendor MMIO if accel-mainline isn't ready)
- Waits for completion (DMA fence or polled completion register)
- Reads back the output
- Compares to a CPU INT8 matmul reference; reports pass/fail
Phase-1 deliverable. Until this works, nothing else in this repo can be exercised against real silicon.
Build
(filled when Phase-1 audit picks the uAPI shape — meson or cmake,
no autotools)
Run
./npu-probe # default 64×64 INT8 matmul
./npu-probe --shape 128,128,128 # M,N,K override
./npu-probe --device /dev/accel/accel0 # override device path
./npu-probe --golden golden_64x64.bin # provide expected output for diff
Why C, not Python
Direct ioctl + dmabuf + mmap. Python wrapper layer would obscure the exact syscall sequence we need to understand. Once npu-probe works, a Python binding for benchmark scripts is fine.