Files

T

marfrit c9a3f5c600 Rosenblatt Phase-1 closeout: rocket-driver substrate inventory

Phase-1 audit closes with a substantively different picture than the
original scaffold's TBDs:

- Tomeu Vizoso's RK3588 NPU work merged in Linux 6.18 (Nov 2025) under
  codename `rocket` (NOT `rknpu`).  All references updated.
- Boltzmann's `linux-rk3588-marfrit-A1` (7.0.0-rc3-ARCH+) already ships
  `drivers/accel/rocket/rocket.ko` as a built-but-not-loaded module.
- DT bindings + per-core nodes (`npu@fdab/c/d_0000`,
  compatible `rockchip,rk3588-rknn-core`) in mainline since 6.18 but
  ship `status = "disabled"` — board enable is the Phase-2 unblock,
  not a driver port.
- Mesa 25.3 ships Rocket Gallium + Teflon TFLite delegate as the
  authoritative userspace reference for the uAPI shape.
- Op coverage today is conv-centric (MobileNet-class); transformer
  matmul needs the conv-1×1 shoehorn (RKNPU2 BSP precedent) or rocket
  op-set additions.  Surfaced as Phase-2-load-bearing risk.
- IOMMU v1.0 hazard: 32 GB host needs `mem=4G` or local
  `rockchip,rk3568-iommu-v1` discriminator patches before the first
  NPU job, to avoid DMA-window faults.

Files:
- docs/npu-mainline-status.md: full audit table with upstream pointers
  (kernel.org / Mesa docs / dri-devel patch URLs / Tomeu's "we are in
  mainline" blog post).
- docs/phases.md: per-phase log entry for Phase-1 closeout.
- docs/op-coverage.md: matmul-vs-conv-vs-rocket-op-set framing.
- fleet/boltzmann.yaml: audited kernel + npu_driver + dt_npu_nodes
  state.
- kernel/dt-overlays/rk3588-rosenblatt-npu-enable.dtso: overlay to
  flip the three rknn-core nodes to "okay" (+ matching mmu nodes),
  carries the IOMMU-mitigation warning inline.
- kernel/README.md: kernel-agent scope wiring + anticipated local
  carry patches.
- README.md: phase-status table + "rknpu → rocket" rename note.
- TODO.md: Phase-2 unblock concrete steps + standing
  upstream-watch items.

2026-05-19 12:41:31 +00:00

5.6 KiB

Raw Permalink Blame History

Rosenblatt

Codename: Frank Rosenblatt built the Mark I Perceptron in 1958 — the first hardware neural network (400 photocells, stepper-motor-tunable analog weights). This project lights up the RK3588 NPU on mainline Linux, so the OSS world finally owns the silicon-side of inference on that chip.

Scope (Phase 1): small LLM running CPU + NPU mix on boltzmann (Rock 5 ITX+, RK3588, 32 GB DDR4). Backend: llama.cpp with a new rknpu device that offloads the heavy GEMM (matmul in attention + FFN) to the NPU's INT8 path while leaving dequant / RoPE / softmax / sampling / embedding lookup on the A76 NEON cores.

Target model (Phase 1): qwen2.5-1.5B-instruct Q4_K_M GGUF. Fits in NPU's accessible memory budget, has chat tuning, public license. Stretch: qwen2.5-3B, gemma3-2B.

Out of scope (Phase 1, capture separately if pursued):

Vision helper (object detection / OCR / face-blur) — different op mix, re-scope after Phase-1 numbers
RKNPU vendor SDK adoption — we want mainline-clean, not vendor-blob
Other Rockchip NPUs (RK3576 has the same NPU IP block — should port for free once the RK3588 path lands, but defer until Phase-1 closes)

Not goal: parity with rknn-llm vendor stack on day 1. Vendor has hand-tuned tensor layouts + quantization; we'll be slower at first. Goal is credible — defined as ≥1 tok/s sustained on qwen-1.5B Q4 with the NPU actually doing the bulk of the GEMM work. The number itself isn't the point; the open path to it is.

Phases (9 + 1 loop)

#	Phase	Deliverable
1	Substrate	Audit mainline NPU driver state (Tomeu Vizoso's rknpu / DRM-accel series); `/dev/accel/*` probe on boltzmann; running kernel + module inventory. `docs/npu-mainline-status.md` snapshot.
2	Formulate	Pick the exact matmul shape that fits the NPU's tile-MAC array. Identify the smallest-possible op-set llama.cpp can offload.
3	Analyze	Read the RKNPU2 SDK + Tomeu's rknpu uAPI to learn: register layout, DMA tensor format, INT8 quant scheme. Don't lift code — extract the spec.
4	Baseline	llama.cpp pure-CPU tok/s on boltzmann for qwen-1.5B Q4_K_M. Three runs, median. Reproducible bench script in `benchmarks/`.
5	Plan	rknpu backend interface design — where it plugs into ggml's compute graph; memory mapping strategy (dmabuf vs userptr); fallback path.
6	Review	Janet (ARM/DRM specialist agent) reviews the NPU register-write + DMA fence strategy. Cold-eyes pass.
7	Implement	rknpu ggml backend skeleton + first INT8 matmul. Bit-exact against CPU reference (Q4_K dequant + fp32 matmul).
8	Verify	Compare tok/s vs Phase-4 baseline. Profile: % time in NPU vs % in CPU vs % stalled on DMA.
9	Closing	Writeup at `dokuwiki.reauktion.de/doku.php?id=rosenblatt`. Benchmarks rendered. Send-to-upstream cover letter draft if quality is there.
10	Memory	`project_rosenblatt.md` in claude-memory: what worked, what to avoid for the next NPU campaign (RK3576 port).

Per feedback_dev_process.md: rewind to Phase 1 on blocker, Phase 4 on direction change, Phase 0 on scope change.

Repo layout

rosenblatt/
├── README.md              this file
├── TODO.md                rolling punch-list
├── docs/
│   ├── npu-mainline-status.md    Phase-1 audit
│   ├── architecture.md           CPU+NPU split, ggml backend shape
│   └── phases.md                 per-phase log (analog to ~/src/bin/phases/)
├── kernel/                       mainline-bound patches (DT bindings, rknpu driver tweaks)
├── userspace/
│   ├── npu-probe/                smallest-possible "open device + run trivial matmul" sanity
│   └── llm-runtime/              llama.cpp fork with rknpu backend
├── fleet/
│   └── boltzmann.yaml            host manifest (kernel + NPU driver pin, baseline measurement)
└── benchmarks/                   reproducible bench scripts + recorded results (JSON + plots)

Host

Primary: boltzmann (Rock 5 ITX+, RK3588, 32 GB DDR4-2666, NVMe rootfs).

Already runs mainline ~v7.0 with most peripheral drivers working.
Has the Quark UEFI / Neutron kernel stack — NPU is the next missing peripheral.
Other RK3588 hosts (ampere = CoolPi GenBook) come later for port-validation.

Why not ampere: laptop, intermittent power, in-use for other campaigns. Boltzmann is always-on with 32 GB headroom — right substrate for kernel hacking with serial-console fallback (when Quark exposes one).

Codename rationale

Rosenblatt's Mark I was custom analog hardware doing fixed-function matmul- adjacent work (weighted-sum + threshold), with weights tunable per slot via mechanical control. The RK3588 NPU is fixed-function INT8 matmul/conv hardware with weights loaded per inference. Same shape, 67 years later, with the same "how do we drive this thing from a general-purpose computer?" problem. The 1958 paper's answer was: build a control panel. The 2026 answer is: a DRM accelerator driver + a userspace runtime that maps tensor ops to MMIO + DMA. We're writing the second half.

Status

Phase	State	Date
0 — bootstrap	done	2026-05-19
1 — substrate audit	done	2026-05-19
2 — formulate	open
3..10	pending

Phase-1 closeout: docs/phases.md + docs/npu-mainline-status.md. Headline: mainline driver name is rocket (not rknpu); it's already shipped in boltzmann's kernel as a built module. Phase-2 unblock is small (DT enable + IOMMU v1.0 mitigation + modprobe), not a driver port.

5.6 KiB Raw Permalink Blame History