Rosenblatt: project scaffold for RK3588 NPU on mainline
Codename: Frank Rosenblatt — Mark I Perceptron 1958, the first
hardware neural network. This project lights up the RK3588 NPU on
mainline Linux so the OSS world finally owns the silicon-side of
inference on that chip.
Phase-1 scope: small LLM running CPU + NPU mix on boltzmann (Rock 5
ITX+). Backend: llama.cpp with a new rknpu ggml backend offloading
INT8 GEMM (attention + FFN matmuls) to the NPU's tile-MAC array while
leaving dequant / RoPE / softmax / sampling / embedding on A76 NEON.
Target model: qwen2.5-1.5B-instruct Q4_K_M GGUF.
Scaffold layout: README.md (frame + 9+1-phase plan), TODO.md (rolling
punch-list), docs/{npu-mainline-status,architecture}.md, kernel/ for
DT bindings + driver tweaks, userspace/{npu-probe,llm-runtime}/,
fleet/boltzmann.yaml.
Next: Phase-1 substrate audit — fill the TBDs in docs/npu-mainline-status.md
with the actual state of Tomeu Vizoso's rknpu / DRM-accel work on
the boltzmann-running kernel.
This commit is contained in:
@@ -0,0 +1,107 @@
|
||||
# Rosenblatt
|
||||
|
||||
**Codename:** Frank Rosenblatt built the Mark I Perceptron in 1958 — the first
|
||||
hardware neural network (400 photocells, stepper-motor-tunable analog weights).
|
||||
This project lights up the RK3588 NPU on mainline Linux, so the OSS world
|
||||
finally owns the silicon-side of inference on that chip.
|
||||
|
||||
**Scope (Phase 1):** small LLM running CPU + NPU mix on `boltzmann` (Rock 5
|
||||
ITX+, RK3588, 32 GB DDR4). Backend: `llama.cpp` with a new `rknpu` device that
|
||||
offloads the heavy GEMM (matmul in attention + FFN) to the NPU's INT8 path
|
||||
while leaving dequant / RoPE / softmax / sampling / embedding lookup on the
|
||||
A76 NEON cores.
|
||||
|
||||
**Target model (Phase 1):**
|
||||
`qwen2.5-1.5B-instruct` Q4_K_M GGUF. Fits in NPU's accessible memory
|
||||
budget, has chat tuning, public license. Stretch: `qwen2.5-3B`,
|
||||
`gemma3-2B`.
|
||||
|
||||
**Out of scope (Phase 1, capture separately if pursued):**
|
||||
- Vision helper (object detection / OCR / face-blur) — different op mix,
|
||||
re-scope after Phase-1 numbers
|
||||
- RKNPU vendor SDK adoption — we want mainline-clean, not vendor-blob
|
||||
- Other Rockchip NPUs (RK3576 has the same NPU IP block — should port for
|
||||
free once the RK3588 path lands, but defer until Phase-1 closes)
|
||||
|
||||
**Not goal: parity with rknn-llm vendor stack on day 1.** Vendor has
|
||||
hand-tuned tensor layouts + quantization; we'll be slower at first. Goal
|
||||
is *credible* — defined as ≥1 tok/s sustained on qwen-1.5B Q4 with the
|
||||
NPU actually doing the bulk of the GEMM work. The number itself isn't the
|
||||
point; the open path to it is.
|
||||
|
||||
---
|
||||
|
||||
## Phases (9 + 1 loop)
|
||||
|
||||
| # | Phase | Deliverable |
|
||||
|---|---|---|
|
||||
| 1 | **Substrate** | Audit mainline NPU driver state (Tomeu Vizoso's rknpu / DRM-accel series); `/dev/accel/*` probe on boltzmann; running kernel + module inventory. `docs/npu-mainline-status.md` snapshot. |
|
||||
| 2 | Formulate | Pick the exact matmul shape that fits the NPU's tile-MAC array. Identify the smallest-possible op-set llama.cpp can offload. |
|
||||
| 3 | Analyze | Read the RKNPU2 SDK + Tomeu's rknpu uAPI to learn: register layout, DMA tensor format, INT8 quant scheme. Don't lift code — extract the spec. |
|
||||
| 4 | Baseline | llama.cpp pure-CPU tok/s on boltzmann for qwen-1.5B Q4_K_M. Three runs, median. Reproducible bench script in `benchmarks/`. |
|
||||
| 5 | Plan | rknpu backend interface design — where it plugs into ggml's compute graph; memory mapping strategy (dmabuf vs userptr); fallback path. |
|
||||
| 6 | Review | Janet (ARM/DRM specialist agent) reviews the NPU register-write + DMA fence strategy. Cold-eyes pass. |
|
||||
| 7 | Implement | rknpu ggml backend skeleton + first INT8 matmul. Bit-exact against CPU reference (Q4_K dequant + fp32 matmul). |
|
||||
| 8 | Verify | Compare tok/s vs Phase-4 baseline. Profile: % time in NPU vs % in CPU vs % stalled on DMA. |
|
||||
| 9 | Closing | Writeup at `dokuwiki.reauktion.de/doku.php?id=rosenblatt`. Benchmarks rendered. Send-to-upstream cover letter draft if quality is there. |
|
||||
| 10 | Memory | `project_rosenblatt.md` in claude-memory: what worked, what to avoid for the next NPU campaign (RK3576 port). |
|
||||
|
||||
Per `feedback_dev_process.md`: rewind to Phase 1 on blocker, Phase 4 on
|
||||
direction change, Phase 0 on scope change.
|
||||
|
||||
---
|
||||
|
||||
## Repo layout
|
||||
|
||||
```
|
||||
rosenblatt/
|
||||
├── README.md this file
|
||||
├── TODO.md rolling punch-list
|
||||
├── docs/
|
||||
│ ├── npu-mainline-status.md Phase-1 audit
|
||||
│ ├── architecture.md CPU+NPU split, ggml backend shape
|
||||
│ └── phases.md per-phase log (analog to ~/src/bin/phases/)
|
||||
├── kernel/ mainline-bound patches (DT bindings, rknpu driver tweaks)
|
||||
├── userspace/
|
||||
│ ├── npu-probe/ smallest-possible "open device + run trivial matmul" sanity
|
||||
│ └── llm-runtime/ llama.cpp fork with rknpu backend
|
||||
├── fleet/
|
||||
│ └── boltzmann.yaml host manifest (kernel + NPU driver pin, baseline measurement)
|
||||
└── benchmarks/ reproducible bench scripts + recorded results (JSON + plots)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Host
|
||||
|
||||
Primary: **boltzmann** (Rock 5 ITX+, RK3588, 32 GB DDR4-2666, NVMe rootfs).
|
||||
- Already runs mainline ~v7.0 with most peripheral drivers working.
|
||||
- Has the Quark UEFI / Neutron kernel stack — NPU is the next missing peripheral.
|
||||
- Other RK3588 hosts (`ampere` = CoolPi GenBook) come later for port-validation.
|
||||
|
||||
Why not `ampere`: laptop, intermittent power, in-use for other campaigns.
|
||||
Boltzmann is always-on with 32 GB headroom — right substrate for kernel
|
||||
hacking with serial-console fallback (when [Quark](https://git.reauktion.de/marfrit/quark) exposes one).
|
||||
|
||||
---
|
||||
|
||||
## Codename rationale
|
||||
|
||||
Rosenblatt's Mark I was custom analog hardware doing fixed-function matmul-
|
||||
adjacent work (weighted-sum + threshold), with weights tunable per slot via
|
||||
mechanical control. The RK3588 NPU is fixed-function INT8 matmul/conv hardware
|
||||
with weights loaded per inference. Same shape, 67 years later, with the same
|
||||
"how do we drive this thing from a general-purpose computer?" problem. The
|
||||
1958 paper's answer was: build a control panel. The 2026 answer is: a DRM
|
||||
accelerator driver + a userspace runtime that maps tensor ops to MMIO + DMA.
|
||||
We're writing the second half.
|
||||
|
||||
---
|
||||
|
||||
## Status
|
||||
|
||||
| Phase | State | Date |
|
||||
|---|---|---|
|
||||
| 0 — bootstrap | done | 2026-05-19 |
|
||||
| 1 — substrate audit | open | |
|
||||
| 2..10 | pending | |
|
||||
@@ -0,0 +1,75 @@
|
||||
# TODO — Rosenblatt
|
||||
|
||||
Rolling punch-list. Older items at bottom (move done → DONE.md when noisy).
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 — substrate audit
|
||||
|
||||
- [ ] On boltzmann: `uname -r` → record in `fleet/boltzmann.yaml:kernel.running_version`
|
||||
- [ ] `find / -path '*accel*' -name '*.ko' 2>/dev/null` — check if accel framework is built
|
||||
- [ ] `ls /dev/accel/ /dev/dri/` — what's exposed?
|
||||
- [ ] `lsmod | grep -iE 'rknpu|accel'` — what's loaded?
|
||||
- [ ] `dmesg | grep -iE 'rknpu|npu|accel'` since boot — driver bringup log
|
||||
- [ ] Tomeu's rknpu series — find on lore.kernel.org/dri-devel, capture latest
|
||||
patch-set version + state (merged / in-review / dropped) → fill table in
|
||||
`docs/npu-mainline-status.md`
|
||||
- [ ] Check `drivers/accel/` in current torvalds tree — list in-tree
|
||||
accelerators, confirm rknpu's mainline state
|
||||
- [ ] Check DT bindings: `Documentation/devicetree/bindings/npu/rockchip,*.yaml`
|
||||
- [ ] Inspect `arch/arm64/boot/dts/rockchip/rk3588.dtsi` for `npu` node
|
||||
- [ ] If a userspace shim exists (rkneural?), capture repo URL + try
|
||||
hello-world against the running kernel
|
||||
- [ ] Spec-extract from BSP vendor `rockchip-npu` source — register map,
|
||||
DMA descriptor format, irq handling. No code lift; spec only.
|
||||
|
||||
Phase exit criteria: `docs/npu-mainline-status.md` table fully populated;
|
||||
clear answer to "do we drive via accel uAPI or write our own MMIO driver."
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 — formulate
|
||||
|
||||
- [ ] List llama.cpp ops by wallclock %, profiling qwen-1.5B Q4_K_M on CPU
|
||||
(use llama.cpp's built-in perf-timer or perf record)
|
||||
- [ ] Pick the exact INT8 matmul tile size the NPU prefers (read from BSP source)
|
||||
- [ ] Spec out the smallest backend interface: which ops we MUST handle,
|
||||
which the framework falls back to CPU
|
||||
- [ ] Write `docs/op-coverage.md`
|
||||
|
||||
---
|
||||
|
||||
## Phase 3 — analyze
|
||||
|
||||
- [ ] RKNPU2 SDK: trace through `librknnrt.so` user-API → kernel ioctl shapes
|
||||
(objdump + strings, no actual reverse-engineering of vendor blob — just
|
||||
the syscall surface)
|
||||
- [ ] Tomeu's accel uAPI: read driver source, understand:
|
||||
- submit-job ioctl shape
|
||||
- dmabuf import path
|
||||
- fence-wait mechanism
|
||||
- error reporting
|
||||
- [ ] BSP vendor `rockchip-npu` source: register layout, DMA descriptor
|
||||
struct, irq handling sequence
|
||||
|
||||
---
|
||||
|
||||
## Phase 4 — baseline
|
||||
|
||||
- [ ] Build vanilla llama.cpp on boltzmann (mainline branch)
|
||||
- [ ] Pull qwen2.5-1.5b-instruct Q4_K_M GGUF
|
||||
- [ ] `llama-bench -m qwen2.5-1.5b -p 512 -n 128` × 3 runs
|
||||
- [ ] Capture JSON to `benchmarks/$(date +%F)_boltzmann_qwen1.5b_cpu_baseline.json`
|
||||
- [ ] Record into `fleet/boltzmann.yaml:baseline_measurement`
|
||||
|
||||
---
|
||||
|
||||
## Cross-phase / standing items
|
||||
|
||||
- [ ] Mirror Tomeu's WIP branch into a local clone for kernel hacking
|
||||
- [ ] Set up serial console on boltzmann for kernel-panic recovery (Quark
|
||||
umbrella; check current state)
|
||||
- [ ] Add `project_rosenblatt.md` to claude-memory once Phase 1 closes (so
|
||||
future sessions don't re-discover the campaign)
|
||||
- [ ] Decide repo home: marfrit/rosenblatt on git.reauktion.de (probably yes,
|
||||
after Phase-1 substrate is captured and the README isn't embarrassing)
|
||||
@@ -0,0 +1,127 @@
|
||||
# Architecture — CPU + NPU mix for llama.cpp on RK3588
|
||||
|
||||
## The split
|
||||
|
||||
llama.cpp's compute graph is built around ggml ops. We don't replace
|
||||
llama.cpp's whole engine — we register a new **device backend** (in
|
||||
ggml's `ggml-backend` abstraction) named `rknpu` and selectively offload
|
||||
the ops that are worth the round-trip cost.
|
||||
|
||||
### Goes to NPU (heavy, dense, INT8-friendly)
|
||||
|
||||
- `MUL_MAT` (matrix-matrix multiply) — the workhorse, dominates wall
|
||||
time. Both attention `Q @ K^T`, `attn @ V`, and FFN `up_proj`,
|
||||
`down_proj`, `gate_proj` are this shape.
|
||||
- `MUL_MAT_ID` (MoE-style mixture matmul) — when we eventually try a
|
||||
mixture-of-experts model. Phase-1+ scope.
|
||||
|
||||
### Stays on CPU (small, op-specific, or per-token)
|
||||
|
||||
- Embedding lookup (`GET_ROWS`) — random-access gather, NPU has no
|
||||
fast path
|
||||
- `RMS_NORM` / `LAYER_NORM` — per-token reduction + element-wise
|
||||
- `ROPE` — small, per-head, lots of trig
|
||||
- `SOFT_MAX` — small, per-head
|
||||
- Activations (`SILU`, `GELU`) — element-wise, cheap on NEON
|
||||
- `SCALE`, `ADD`, `MUL` (element-wise) — cheap on NEON
|
||||
- Sampling, KV cache update, tokenization — entirely host
|
||||
|
||||
### KV cache: open question
|
||||
|
||||
Two options:
|
||||
1. **CPU-resident:** lives in normal Linux memory; NPU pulls
|
||||
activations from CPU and pushes results back per layer.
|
||||
2. **NPU-resident:** allocated in dmabuf, NPU reads K/V across layers
|
||||
without round-trips. Cheaper per-layer but constrains model size
|
||||
to NPU-accessible memory.
|
||||
|
||||
Phase 5 (Plan) picks based on Phase-3 (Analyze) findings on the DMA
|
||||
cost.
|
||||
|
||||
---
|
||||
|
||||
## Memory mapping
|
||||
|
||||
ggml's `ggml_backend_buffer_t` abstracts the buffer pool. We implement:
|
||||
- `alloc_buffer(size)` → allocate a dmabuf of `size` bytes
|
||||
- `free_buffer(buffer)` → release dmabuf
|
||||
- `set_tensor` / `get_tensor` → CPU → NPU memcpy
|
||||
- `cpy_tensor` → device-internal copy
|
||||
|
||||
The dmabuf approach lets us share buffers between CPU producer (e.g.
|
||||
embedding lookup output) and NPU consumer (matmul) without an extra
|
||||
copy — `mmap` on CPU side, DMA-import on NPU side.
|
||||
|
||||
If Tomeu's accel uAPI uses dmabuf natively, we follow that. If it
|
||||
doesn't, we go through `/dev/dri/renderD*` with a thin shim.
|
||||
|
||||
---
|
||||
|
||||
## Quantization strategy
|
||||
|
||||
llama.cpp ships Q4_K_M as the default for ~2B models. Q4_K_M is a
|
||||
4-bit weight quantization with per-group scale + min, no per-channel
|
||||
scale. The NPU expects INT8 (or INT16) tensors with per-channel scale
|
||||
factors.
|
||||
|
||||
Two paths:
|
||||
1. **Dequantize on CPU per-layer:** unpack Q4_K_M → INT8 right before
|
||||
the matmul; ship INT8 to NPU. Adds a per-layer CPU pre-pass.
|
||||
2. **Dequantize once at load time:** unpack the entire weight tensor
|
||||
to INT8 + per-channel scales at model-load. Permanent ~2x memory
|
||||
cost (Q4_K_M is ~5 bits/weight effective; INT8 is 8 bits/weight),
|
||||
but no per-layer CPU work.
|
||||
|
||||
Phase-1 choice: (2) — straightforward, makes the NPU path the only
|
||||
thing happening at inference time, easier to profile. The memory cost
|
||||
on 1.5B is ~1.5 GB INT8 weights vs ~900 MB Q4_K_M — boltzmann has
|
||||
32 GB, this isn't the constraint.
|
||||
|
||||
Phase-2+: revisit (1) if we go for larger models or if INT8 turns out
|
||||
to be quality-loss-meaningful on the small ones.
|
||||
|
||||
---
|
||||
|
||||
## Backend interface — concrete
|
||||
|
||||
Mirroring ggml's existing `ggml-cuda.cu` / `ggml-metal.m` shape, we add:
|
||||
|
||||
```
|
||||
ggml-rknpu.h — public API: ggml_backend_rknpu_init() etc.
|
||||
ggml-rknpu.c — backend implementation: device registration, op
|
||||
dispatch table, memory management
|
||||
ggml-rknpu-ops.c — per-op kernels: matmul tiled to NPU's preferred
|
||||
shape, INT8 quant pre-pass
|
||||
```
|
||||
|
||||
In `llama.cpp/ggml/src/ggml-rknpu/`. Out-of-tree initially; if
|
||||
upstream-acceptable, send for review after Phase 8.
|
||||
|
||||
---
|
||||
|
||||
## Failure handling
|
||||
|
||||
llama.cpp's backend abstraction already supports falling back to CPU
|
||||
on per-op basis — `ggml_backend_dev_supports_op()`. We declare
|
||||
`MUL_MAT` supported for INT8 / FP16 inputs with the right shape
|
||||
constraints, and let the framework route everything else to CPU.
|
||||
|
||||
If the NPU driver returns an error mid-inference (timeout, DMA
|
||||
fence wait fail, etc.), the strategy is **abort the inference, log,
|
||||
return error to caller**. We don't try to silently fall back to CPU
|
||||
mid-stream because the state would be corrupted (NPU may have
|
||||
partially written to the dmabuf).
|
||||
|
||||
---
|
||||
|
||||
## Phase-1 milestone
|
||||
|
||||
A `npu-probe` userspace binary that:
|
||||
1. Opens the NPU device (whatever the mainline path is — likely
|
||||
`/dev/accel/accelN` or `/dev/dri/renderD*`)
|
||||
2. Allocates two small INT8 input tensors + one output (e.g. 64x64)
|
||||
3. Submits a matmul via the uAPI
|
||||
4. Waits, reads back, compares to a CPU reference
|
||||
|
||||
This proves the substrate is alive before we touch llama.cpp. If it
|
||||
doesn't work, we're back in kernel-driver land, not llama.cpp land.
|
||||
@@ -0,0 +1,125 @@
|
||||
# RK3588 NPU on mainline — status audit (Phase 1)
|
||||
|
||||
> **Phase 1 deliverable.** This file is the canonical snapshot of where
|
||||
> mainline kernel + DRM/accel + userspace stand for the RK3588 NPU as of
|
||||
> the audit date. Re-audit on each kernel-version bump or when a
|
||||
> blocker hits.
|
||||
|
||||
**Audit date:** _(pending — first Phase-1 task)_
|
||||
**Kernel under audit:** _(boltzmann's running kernel, capture with `uname -r`)_
|
||||
|
||||
---
|
||||
|
||||
## Hardware quick-spec
|
||||
|
||||
- **NPU:** Rockchip "RKNPU2" — three independent NPU cores, each ~2 TOPS
|
||||
INT8 (combined 6 TOPS). Also supports INT16, FP16, and BF16 (with
|
||||
reduced throughput). Per-core local memory (~2 MB SRAM each).
|
||||
- **Bus:** AXI interconnect, dedicated DMA engine, integrated power
|
||||
domain (`pd_npu`).
|
||||
- **TRM section:** Chapter 23 ("NPU") in the RK3588 TRM v1.0 — local
|
||||
copy referenced in `reference_rk3588_trm` memory.
|
||||
- **Vendor docs source-of-truth:**
|
||||
https://github.com/airockchip/rknn-toolkit2 (SDK + ops list +
|
||||
quantization scheme).
|
||||
|
||||
---
|
||||
|
||||
## Upstream work in flight
|
||||
|
||||
To audit on Phase 1:
|
||||
|
||||
1. **Tomeu Vizoso's rknpu / DRM-accel series**
|
||||
- Branch: search lore.kernel.org for "rknpu" + "Tomeu Vizoso" in
|
||||
`linux-rockchip` / `dri-devel`. Check `linux-rockchip` ml cutoff
|
||||
date per `reference_linux_rockchip_ml` memory.
|
||||
- State to capture: which patches merged, which are review-pending,
|
||||
which dropped on the floor.
|
||||
2. **`drivers/accel/` mainline state**
|
||||
- Check `drivers/accel/` in current torvalds tree — list of in-tree
|
||||
accelerators (habanalabs, ivpu, qaic, etc.). Note if rknpu is
|
||||
there yet.
|
||||
3. **DT bindings**
|
||||
- `Documentation/devicetree/bindings/npu/rockchip,rk3588-npu.yaml`
|
||||
— does it exist in mainline?
|
||||
4. **Userspace bridge**
|
||||
- Tomeu has a userspace shim ("rkneural"?) for the DRM-accel uAPI.
|
||||
Capture repo URL + current state.
|
||||
|
||||
Fill the table below during Phase-1 audit:
|
||||
|
||||
| Component | Mainline state | URL | Notes |
|
||||
|---|---|---|---|
|
||||
| `drivers/accel/rknpu/` | _TBD_ | | |
|
||||
| DT bindings `rockchip,rk3588-npu.yaml` | _TBD_ | | |
|
||||
| `pd_npu` power-domain in `rk3588.dtsi` | _TBD_ | | |
|
||||
| `npu` node in `rk3588.dtsi` | _TBD_ | | |
|
||||
| `npu` node in `rk3588-rock-5-itx-plus.dts` (boltzmann) | _TBD_ | | |
|
||||
| userspace uAPI consumer | _TBD_ | | |
|
||||
| Tomeu's WIP branch | _TBD_ | | |
|
||||
|
||||
---
|
||||
|
||||
## Vendor stack (for spec extraction only — don't lift code)
|
||||
|
||||
- **RKNPU2 SDK:** `airockchip/rknn-toolkit2`. Provides:
|
||||
- `librknnrt.so` runtime (closed, vendor binary)
|
||||
- `rknn-convert` ONNX → `.rknn` transcoder
|
||||
- rknn-llm fork (rknn-llm) for LLM-specific quant + scheduling
|
||||
- **Kernel module:** vendor `rockchip-npu` (BSP). Not mainline-compatible
|
||||
(uses Rockchip's iommu/dma shim). Spec-extract per `feedback_megabitchip_semantic_match` —
|
||||
read the source, don't link the binary.
|
||||
|
||||
Operational rule: **mainline-clean from day 1.** No vendor blob runtime,
|
||||
no vendor-kernel-only module. If we need to copy a register-layout table
|
||||
from BSP source to get started, that's fine; copying a `.so` is not.
|
||||
|
||||
---
|
||||
|
||||
## Hardware capability bounds
|
||||
|
||||
What the NPU CAN do (Phase-1-relevant):
|
||||
- INT8 GEMM up to ~2 TOPS/core × 3 cores = 6 TOPS peak (theoretical)
|
||||
- INT8 tensors with per-channel quantization scales
|
||||
- 3D conv (mostly unused in LLM workloads)
|
||||
- LayerNorm + activation fusion in some configs
|
||||
|
||||
What the NPU CANNOT do that LLM needs (so stays on CPU):
|
||||
- fp32 / fp16 mixed-precision (NPU is INT8/INT16-first; some FP16 but
|
||||
with throughput penalty)
|
||||
- Sampling (multinomial, top-k, top-p) — pure CPU
|
||||
- Sparse attention — no sparse-tile support in the NPU op set
|
||||
- Embedding lookup (rare access pattern; gather not in NPU op set)
|
||||
- RoPE, softmax, RMSNorm — these run on CPU because the NPU's
|
||||
fixed-function pipeline doesn't have these as first-class ops
|
||||
without going through a generic matmul shoehorn
|
||||
|
||||
Phase-2 question: which of those CAN move to NPU later, with op-fusion or
|
||||
custom kernels.
|
||||
|
||||
---
|
||||
|
||||
## Throughput sanity check (theoretical only — not measured)
|
||||
|
||||
qwen2.5-1.5B Q4_K_M:
|
||||
- ~1.5e9 weights, attention + FFN dominated by matmul
|
||||
- Per-token: roughly 2 × 1.5e9 = 3 GMAC (forward pass, weight×activation)
|
||||
- INT8 path: 3 GMAC / 6 TOPS = 0.5 ms/token compute-bound
|
||||
- → ~2000 tok/s if everything fit + zero overhead, ABSOLUTELY UNREALISTIC
|
||||
|
||||
Realistic upper bound on first cut: probably 5-15 tok/s. Memory
|
||||
bandwidth + DMA setup + CPU side will dominate. Phase-4 baseline pulls
|
||||
the actual CPU-only number; Phase-8 measures how close we get with NPU
|
||||
in the loop. **The goal is "credible CPU+NPU mix that's faster than
|
||||
pure CPU," not "saturate the 6 TOPS rating."**
|
||||
|
||||
---
|
||||
|
||||
## What this audit unblocks
|
||||
|
||||
After Phase 1 completes (the table above is filled), Phase 2 can pick:
|
||||
- whether to drive the NPU via Tomeu's accel uAPI (if it's far enough
|
||||
along) or write directly against the MMIO regs (if not)
|
||||
- which kernel branch we baseline against (mainline-rc vs Tomeu's WIP)
|
||||
- which userspace shim we use (Tomeu's, write our own, or fork
|
||||
llama.cpp's CUDA-style backend pattern for ggml)
|
||||
@@ -0,0 +1,55 @@
|
||||
# rosenblatt fleet manifest — boltzmann (Rock 5 ITX+, RK3588)
|
||||
#
|
||||
# Phase-1 audit host. Always-on, 32 GB DDR4, NVMe rootfs. NPU silicon
|
||||
# present + accessible via Rockchip-BSP vendor module today; mainline
|
||||
# path TBD (see docs/npu-mainline-status.md).
|
||||
|
||||
host: boltzmann
|
||||
arch: arm64
|
||||
soc: rockchip/rk3588
|
||||
board: rock-5-itx-plus
|
||||
distro: archlinuxarm # ALARM aarch64; boltzmann is the umbrella RK3588 host
|
||||
role: primary-development # not yet primary-target (laptop targets land later)
|
||||
|
||||
hardware:
|
||||
cpu: 4×Cortex-A76 (2.4 GHz) + 4×Cortex-A55 (1.8 GHz)
|
||||
ram: 32 GB DDR4-2666
|
||||
storage: NVMe (rootfs) + microSD (recovery)
|
||||
npu:
|
||||
cores: 3
|
||||
tops_int8_per_core: 2 # ~2 TOPS INT8 per core, 6 TOPS aggregate (theoretical peak)
|
||||
local_sram_per_core_mib: 2
|
||||
power_domain: pd_npu
|
||||
|
||||
# Phase-1 audit fills these (pending boltzmann inspection)
|
||||
kernel:
|
||||
running_version: TBD # uname -r snapshot at audit time
|
||||
source: TBD # mainline torvalds / mmind-rockchip / custom
|
||||
npu_driver: TBD # vendor rockchip-npu / mainline rknpu / none
|
||||
|
||||
userspace:
|
||||
rknn_vendor_runtime_installed: false # commitment: stay mainline-clean
|
||||
llama_cpp_installed: TBD # via marfrit-packages or built-from-source
|
||||
|
||||
baseline_measurement:
|
||||
pending: true
|
||||
target: |
|
||||
llama.cpp pure-CPU tok/s on qwen2.5-1.5b-instruct-q4_k_m.gguf,
|
||||
3 runs, median wallclock. Use llama-bench from llama.cpp/build/bin.
|
||||
ground_truth_file: benchmarks/2026-XX-XX_boltzmann_qwen1.5b_cpu_baseline.json
|
||||
|
||||
bringup_sequence:
|
||||
1: substrate audit (docs/npu-mainline-status.md table filled)
|
||||
2: npu-probe runs successfully (open device → 64×64 INT8 matmul → bit-match CPU ref)
|
||||
3: llama.cpp pure-CPU baseline captured
|
||||
4: rknpu ggml backend skeleton compiles
|
||||
5: first llama.cpp matmul offload working on a single layer
|
||||
6: full forward pass via NPU for one decode step
|
||||
7: tok/s vs baseline measured
|
||||
|
||||
backup_host: ampere # CoolPi GenBook — port-validation target. Phase-2+ scope.
|
||||
|
||||
reverse_dependencies:
|
||||
- Quark (boltzmann UEFI) — must stay bootable across kernel-rev experiments
|
||||
- Neutron (boltzmann kernel build) — provides the kernel we tweak for rknpu
|
||||
- Volta (boltzmann umbrella) — Rosenblatt is the third Volta-child after Quark + Neutron
|
||||
@@ -0,0 +1,24 @@
|
||||
# kernel/
|
||||
|
||||
Mainline-bound kernel patches: DT bindings, rknpu driver tweaks,
|
||||
power-domain wiring for boltzmann's `rk3588-rock-5-itx-plus.dts`.
|
||||
|
||||
Empty until Phase 1 audit identifies what's actually missing in mainline.
|
||||
|
||||
If Tomeu Vizoso's rknpu series is far enough along to use as-is, this
|
||||
directory may stay nearly empty — we'd just carry a small DT-overlay
|
||||
patch for boltzmann's board file.
|
||||
|
||||
If we end up writing a driver from scratch (worst case), the structure
|
||||
will mirror an upstream submission layout:
|
||||
```
|
||||
0001-dt-bindings-npu-add-rockchip-rk3588-npu.patch
|
||||
0002-arm64-dts-rockchip-rk3588-add-npu-node.patch
|
||||
0003-arm64-dts-rockchip-rock-5-itx-plus-enable-npu.patch
|
||||
0004-accel-rknpu-add-rockchip-rk3588-driver.patch
|
||||
...
|
||||
```
|
||||
|
||||
Tracking-wise these go through `marfrit/kernel-agent` as scope
|
||||
`patches/driver/accel/rknpu/`, with `fleet/boltzmann-rosenblatt.yaml`
|
||||
as the consuming manifest.
|
||||
@@ -0,0 +1,32 @@
|
||||
# llm-runtime
|
||||
|
||||
llama.cpp fork (or out-of-tree backend) with the rknpu ggml backend.
|
||||
|
||||
Code lands here starting at Phase 5 (Plan) — too early in Phase 1.
|
||||
|
||||
Until then, this directory holds:
|
||||
- design notes (`docs/architecture.md` from project root is authoritative)
|
||||
- the eventual `ggml-rknpu/` backend source
|
||||
- patch series for upstream submission if quality reaches that bar
|
||||
|
||||
## Approach
|
||||
|
||||
Two paths to consider in Phase 5:
|
||||
|
||||
1. **Fork llama.cpp, add backend in tree.** Easier to keep in sync;
|
||||
harder to upstream because llama.cpp may not want a Rockchip-specific
|
||||
backend that depends on a still-WIP mainline driver.
|
||||
2. **Out-of-tree backend, load via llama.cpp's plugin API
|
||||
(`-DGGML_BACKEND_DL=ON`).** Cleaner separation; tracks llama.cpp
|
||||
upstream without our diff being in the way. Recommended unless we
|
||||
need to patch core llama.cpp logic.
|
||||
|
||||
Decision deferred to Phase 5.
|
||||
|
||||
## Model
|
||||
|
||||
Phase-1 target: `qwen2.5-1.5b-instruct-q4_k_m.gguf`. Source:
|
||||
hf.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF or built locally with
|
||||
llama-quantize.
|
||||
|
||||
Stretch: qwen2.5-3B (if memory + NPU SRAM allow), gemma3-2B.
|
||||
@@ -0,0 +1,33 @@
|
||||
# npu-probe
|
||||
|
||||
Smallest-possible userspace binary that:
|
||||
1. Opens the NPU device (path TBD per Phase-1 audit)
|
||||
2. Allocates two INT8 input tensors (64×64) + one output (64×64)
|
||||
3. Submits a matmul via the uAPI in use (Tomeu's accel ioctl OR our own
|
||||
shim around vendor MMIO if accel-mainline isn't ready)
|
||||
4. Waits for completion (DMA fence or polled completion register)
|
||||
5. Reads back the output
|
||||
6. Compares to a CPU INT8 matmul reference; reports pass/fail
|
||||
|
||||
**Phase-1 deliverable.** Until this works, nothing else in this repo
|
||||
can be exercised against real silicon.
|
||||
|
||||
## Build
|
||||
|
||||
_(filled when Phase-1 audit picks the uAPI shape — `meson` or `cmake`,
|
||||
no autotools)_
|
||||
|
||||
## Run
|
||||
|
||||
```
|
||||
./npu-probe # default 64×64 INT8 matmul
|
||||
./npu-probe --shape 128,128,128 # M,N,K override
|
||||
./npu-probe --device /dev/accel/accel0 # override device path
|
||||
./npu-probe --golden golden_64x64.bin # provide expected output for diff
|
||||
```
|
||||
|
||||
## Why C, not Python
|
||||
|
||||
Direct ioctl + dmabuf + mmap. Python wrapper layer would obscure the
|
||||
exact syscall sequence we need to understand. Once npu-probe works,
|
||||
a Python binding for benchmark scripts is fine.
|
||||
Reference in New Issue
Block a user