c9a3f5c600
Phase-1 audit closes with a substantively different picture than the original scaffold's TBDs: - Tomeu Vizoso's RK3588 NPU work merged in Linux 6.18 (Nov 2025) under codename `rocket` (NOT `rknpu`). All references updated. - Boltzmann's `linux-rk3588-marfrit-A1` (7.0.0-rc3-ARCH+) already ships `drivers/accel/rocket/rocket.ko` as a built-but-not-loaded module. - DT bindings + per-core nodes (`npu@fdab/c/d_0000`, compatible `rockchip,rk3588-rknn-core`) in mainline since 6.18 but ship `status = "disabled"` — board enable is the Phase-2 unblock, not a driver port. - Mesa 25.3 ships Rocket Gallium + Teflon TFLite delegate as the authoritative userspace reference for the uAPI shape. - Op coverage today is conv-centric (MobileNet-class); transformer matmul needs the conv-1×1 shoehorn (RKNPU2 BSP precedent) or rocket op-set additions. Surfaced as Phase-2-load-bearing risk. - IOMMU v1.0 hazard: 32 GB host needs `mem=4G` or local `rockchip,rk3568-iommu-v1` discriminator patches before the first NPU job, to avoid DMA-window faults. Files: - docs/npu-mainline-status.md: full audit table with upstream pointers (kernel.org / Mesa docs / dri-devel patch URLs / Tomeu's "we are in mainline" blog post). - docs/phases.md: per-phase log entry for Phase-1 closeout. - docs/op-coverage.md: matmul-vs-conv-vs-rocket-op-set framing. - fleet/boltzmann.yaml: audited kernel + npu_driver + dt_npu_nodes state. - kernel/dt-overlays/rk3588-rosenblatt-npu-enable.dtso: overlay to flip the three rknn-core nodes to "okay" (+ matching mmu nodes), carries the IOMMU-mitigation warning inline. - kernel/README.md: kernel-agent scope wiring + anticipated local carry patches. - README.md: phase-status table + "rknpu → rocket" rename note. - TODO.md: Phase-2 unblock concrete steps + standing upstream-watch items.
115 lines
5.6 KiB
Markdown
115 lines
5.6 KiB
Markdown
# Rosenblatt
|
|
|
|
**Codename:** Frank Rosenblatt built the Mark I Perceptron in 1958 — the first
|
|
hardware neural network (400 photocells, stepper-motor-tunable analog weights).
|
|
This project lights up the RK3588 NPU on mainline Linux, so the OSS world
|
|
finally owns the silicon-side of inference on that chip.
|
|
|
|
**Scope (Phase 1):** small LLM running CPU + NPU mix on `boltzmann` (Rock 5
|
|
ITX+, RK3588, 32 GB DDR4). Backend: `llama.cpp` with a new `rknpu` device that
|
|
offloads the heavy GEMM (matmul in attention + FFN) to the NPU's INT8 path
|
|
while leaving dequant / RoPE / softmax / sampling / embedding lookup on the
|
|
A76 NEON cores.
|
|
|
|
**Target model (Phase 1):**
|
|
`qwen2.5-1.5B-instruct` Q4_K_M GGUF. Fits in NPU's accessible memory
|
|
budget, has chat tuning, public license. Stretch: `qwen2.5-3B`,
|
|
`gemma3-2B`.
|
|
|
|
**Out of scope (Phase 1, capture separately if pursued):**
|
|
- Vision helper (object detection / OCR / face-blur) — different op mix,
|
|
re-scope after Phase-1 numbers
|
|
- RKNPU vendor SDK adoption — we want mainline-clean, not vendor-blob
|
|
- Other Rockchip NPUs (RK3576 has the same NPU IP block — should port for
|
|
free once the RK3588 path lands, but defer until Phase-1 closes)
|
|
|
|
**Not goal: parity with rknn-llm vendor stack on day 1.** Vendor has
|
|
hand-tuned tensor layouts + quantization; we'll be slower at first. Goal
|
|
is *credible* — defined as ≥1 tok/s sustained on qwen-1.5B Q4 with the
|
|
NPU actually doing the bulk of the GEMM work. The number itself isn't the
|
|
point; the open path to it is.
|
|
|
|
---
|
|
|
|
## Phases (9 + 1 loop)
|
|
|
|
| # | Phase | Deliverable |
|
|
|---|---|---|
|
|
| 1 | **Substrate** | Audit mainline NPU driver state (Tomeu Vizoso's rknpu / DRM-accel series); `/dev/accel/*` probe on boltzmann; running kernel + module inventory. `docs/npu-mainline-status.md` snapshot. |
|
|
| 2 | Formulate | Pick the exact matmul shape that fits the NPU's tile-MAC array. Identify the smallest-possible op-set llama.cpp can offload. |
|
|
| 3 | Analyze | Read the RKNPU2 SDK + Tomeu's rknpu uAPI to learn: register layout, DMA tensor format, INT8 quant scheme. Don't lift code — extract the spec. |
|
|
| 4 | Baseline | llama.cpp pure-CPU tok/s on boltzmann for qwen-1.5B Q4_K_M. Three runs, median. Reproducible bench script in `benchmarks/`. |
|
|
| 5 | Plan | rknpu backend interface design — where it plugs into ggml's compute graph; memory mapping strategy (dmabuf vs userptr); fallback path. |
|
|
| 6 | Review | Janet (ARM/DRM specialist agent) reviews the NPU register-write + DMA fence strategy. Cold-eyes pass. |
|
|
| 7 | Implement | rknpu ggml backend skeleton + first INT8 matmul. Bit-exact against CPU reference (Q4_K dequant + fp32 matmul). |
|
|
| 8 | Verify | Compare tok/s vs Phase-4 baseline. Profile: % time in NPU vs % in CPU vs % stalled on DMA. |
|
|
| 9 | Closing | Writeup at `dokuwiki.reauktion.de/doku.php?id=rosenblatt`. Benchmarks rendered. Send-to-upstream cover letter draft if quality is there. |
|
|
| 10 | Memory | `project_rosenblatt.md` in claude-memory: what worked, what to avoid for the next NPU campaign (RK3576 port). |
|
|
|
|
Per `feedback_dev_process.md`: rewind to Phase 1 on blocker, Phase 4 on
|
|
direction change, Phase 0 on scope change.
|
|
|
|
---
|
|
|
|
## Repo layout
|
|
|
|
```
|
|
rosenblatt/
|
|
├── README.md this file
|
|
├── TODO.md rolling punch-list
|
|
├── docs/
|
|
│ ├── npu-mainline-status.md Phase-1 audit
|
|
│ ├── architecture.md CPU+NPU split, ggml backend shape
|
|
│ └── phases.md per-phase log (analog to ~/src/bin/phases/)
|
|
├── kernel/ mainline-bound patches (DT bindings, rknpu driver tweaks)
|
|
├── userspace/
|
|
│ ├── npu-probe/ smallest-possible "open device + run trivial matmul" sanity
|
|
│ └── llm-runtime/ llama.cpp fork with rknpu backend
|
|
├── fleet/
|
|
│ └── boltzmann.yaml host manifest (kernel + NPU driver pin, baseline measurement)
|
|
└── benchmarks/ reproducible bench scripts + recorded results (JSON + plots)
|
|
```
|
|
|
|
---
|
|
|
|
## Host
|
|
|
|
Primary: **boltzmann** (Rock 5 ITX+, RK3588, 32 GB DDR4-2666, NVMe rootfs).
|
|
- Already runs mainline ~v7.0 with most peripheral drivers working.
|
|
- Has the Quark UEFI / Neutron kernel stack — NPU is the next missing peripheral.
|
|
- Other RK3588 hosts (`ampere` = CoolPi GenBook) come later for port-validation.
|
|
|
|
Why not `ampere`: laptop, intermittent power, in-use for other campaigns.
|
|
Boltzmann is always-on with 32 GB headroom — right substrate for kernel
|
|
hacking with serial-console fallback (when [Quark](https://git.reauktion.de/marfrit/quark) exposes one).
|
|
|
|
---
|
|
|
|
## Codename rationale
|
|
|
|
Rosenblatt's Mark I was custom analog hardware doing fixed-function matmul-
|
|
adjacent work (weighted-sum + threshold), with weights tunable per slot via
|
|
mechanical control. The RK3588 NPU is fixed-function INT8 matmul/conv hardware
|
|
with weights loaded per inference. Same shape, 67 years later, with the same
|
|
"how do we drive this thing from a general-purpose computer?" problem. The
|
|
1958 paper's answer was: build a control panel. The 2026 answer is: a DRM
|
|
accelerator driver + a userspace runtime that maps tensor ops to MMIO + DMA.
|
|
We're writing the second half.
|
|
|
|
---
|
|
|
|
## Status
|
|
|
|
| Phase | State | Date |
|
|
|---|---|---|
|
|
| 0 — bootstrap | done | 2026-05-19 |
|
|
| 1 — substrate audit | done | 2026-05-19 |
|
|
| 2 — formulate | open | |
|
|
| 3..10 | pending | |
|
|
|
|
Phase-1 closeout: `docs/phases.md` + `docs/npu-mainline-status.md`.
|
|
Headline: mainline driver name is **`rocket`** (not `rknpu`); it's
|
|
already shipped in boltzmann's kernel as a built module. Phase-2
|
|
unblock is small (DT enable + IOMMU v1.0 mitigation + modprobe),
|
|
not a driver port.
|