Files
rosenblatt/userspace/npu-probe/README.md
T
marfrit 24adc74812 Rosenblatt: project scaffold for RK3588 NPU on mainline
Codename: Frank Rosenblatt — Mark I Perceptron 1958, the first
hardware neural network.  This project lights up the RK3588 NPU on
mainline Linux so the OSS world finally owns the silicon-side of
inference on that chip.

Phase-1 scope: small LLM running CPU + NPU mix on boltzmann (Rock 5
ITX+).  Backend: llama.cpp with a new rknpu ggml backend offloading
INT8 GEMM (attention + FFN matmuls) to the NPU's tile-MAC array while
leaving dequant / RoPE / softmax / sampling / embedding on A76 NEON.

Target model: qwen2.5-1.5B-instruct Q4_K_M GGUF.

Scaffold layout: README.md (frame + 9+1-phase plan), TODO.md (rolling
punch-list), docs/{npu-mainline-status,architecture}.md, kernel/ for
DT bindings + driver tweaks, userspace/{npu-probe,llm-runtime}/,
fleet/boltzmann.yaml.

Next: Phase-1 substrate audit — fill the TBDs in docs/npu-mainline-status.md
with the actual state of Tomeu Vizoso's rknpu / DRM-accel work on
the boltzmann-running kernel.
2026-05-19 11:57:48 +00:00

34 lines
1.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# npu-probe
Smallest-possible userspace binary that:
1. Opens the NPU device (path TBD per Phase-1 audit)
2. Allocates two INT8 input tensors (64×64) + one output (64×64)
3. Submits a matmul via the uAPI in use (Tomeu's accel ioctl OR our own
shim around vendor MMIO if accel-mainline isn't ready)
4. Waits for completion (DMA fence or polled completion register)
5. Reads back the output
6. Compares to a CPU INT8 matmul reference; reports pass/fail
**Phase-1 deliverable.** Until this works, nothing else in this repo
can be exercised against real silicon.
## Build
_(filled when Phase-1 audit picks the uAPI shape — `meson` or `cmake`,
no autotools)_
## Run
```
./npu-probe # default 64×64 INT8 matmul
./npu-probe --shape 128,128,128 # M,N,K override
./npu-probe --device /dev/accel/accel0 # override device path
./npu-probe --golden golden_64x64.bin # provide expected output for diff
```
## Why C, not Python
Direct ioctl + dmabuf + mmap. Python wrapper layer would obscure the
exact syscall sequence we need to understand. Once npu-probe works,
a Python binding for benchmark scripts is fine.