# Rosenblatt **Codename:** Frank Rosenblatt built the Mark I Perceptron in 1958 — the first hardware neural network (400 photocells, stepper-motor-tunable analog weights). This project lights up the RK3588 NPU on mainline Linux, so the OSS world finally owns the silicon-side of inference on that chip. **Scope (Phase 1):** small LLM running CPU + NPU mix on `boltzmann` (Rock 5 ITX+, RK3588, 32 GB DDR4). Backend: `llama.cpp` with a new `rknpu` device that offloads the heavy GEMM (matmul in attention + FFN) to the NPU's INT8 path while leaving dequant / RoPE / softmax / sampling / embedding lookup on the A76 NEON cores. **Target model (Phase 1):** `qwen2.5-1.5B-instruct` Q4_K_M GGUF. Fits in NPU's accessible memory budget, has chat tuning, public license. Stretch: `qwen2.5-3B`, `gemma3-2B`. **Out of scope (Phase 1, capture separately if pursued):** - Vision helper (object detection / OCR / face-blur) — different op mix, re-scope after Phase-1 numbers - RKNPU vendor SDK adoption — we want mainline-clean, not vendor-blob - Other Rockchip NPUs (RK3576 has the same NPU IP block — should port for free once the RK3588 path lands, but defer until Phase-1 closes) **Not goal: parity with rknn-llm vendor stack on day 1.** Vendor has hand-tuned tensor layouts + quantization; we'll be slower at first. Goal is *credible* — defined as ≥1 tok/s sustained on qwen-1.5B Q4 with the NPU actually doing the bulk of the GEMM work. The number itself isn't the point; the open path to it is. --- ## Phases (9 + 1 loop) | # | Phase | Deliverable | |---|---|---| | 1 | **Substrate** | Audit mainline NPU driver state (Tomeu Vizoso's rknpu / DRM-accel series); `/dev/accel/*` probe on boltzmann; running kernel + module inventory. `docs/npu-mainline-status.md` snapshot. | | 2 | Formulate | Pick the exact matmul shape that fits the NPU's tile-MAC array. Identify the smallest-possible op-set llama.cpp can offload. | | 3 | Analyze | Read the RKNPU2 SDK + Tomeu's rknpu uAPI to learn: register layout, DMA tensor format, INT8 quant scheme. Don't lift code — extract the spec. | | 4 | Baseline | llama.cpp pure-CPU tok/s on boltzmann for qwen-1.5B Q4_K_M. Three runs, median. Reproducible bench script in `benchmarks/`. | | 5 | Plan | rknpu backend interface design — where it plugs into ggml's compute graph; memory mapping strategy (dmabuf vs userptr); fallback path. | | 6 | Review | Janet (ARM/DRM specialist agent) reviews the NPU register-write + DMA fence strategy. Cold-eyes pass. | | 7 | Implement | rknpu ggml backend skeleton + first INT8 matmul. Bit-exact against CPU reference (Q4_K dequant + fp32 matmul). | | 8 | Verify | Compare tok/s vs Phase-4 baseline. Profile: % time in NPU vs % in CPU vs % stalled on DMA. | | 9 | Closing | Writeup at `dokuwiki.reauktion.de/doku.php?id=rosenblatt`. Benchmarks rendered. Send-to-upstream cover letter draft if quality is there. | | 10 | Memory | `project_rosenblatt.md` in claude-memory: what worked, what to avoid for the next NPU campaign (RK3576 port). | Per `feedback_dev_process.md`: rewind to Phase 1 on blocker, Phase 4 on direction change, Phase 0 on scope change. --- ## Repo layout ``` rosenblatt/ ├── README.md this file ├── TODO.md rolling punch-list ├── docs/ │ ├── npu-mainline-status.md Phase-1 audit │ ├── architecture.md CPU+NPU split, ggml backend shape │ └── phases.md per-phase log (analog to ~/src/bin/phases/) ├── kernel/ mainline-bound patches (DT bindings, rknpu driver tweaks) ├── userspace/ │ ├── npu-probe/ smallest-possible "open device + run trivial matmul" sanity │ └── llm-runtime/ llama.cpp fork with rknpu backend ├── fleet/ │ └── boltzmann.yaml host manifest (kernel + NPU driver pin, baseline measurement) └── benchmarks/ reproducible bench scripts + recorded results (JSON + plots) ``` --- ## Host Primary: **boltzmann** (Rock 5 ITX+, RK3588, 32 GB DDR4-2666, NVMe rootfs). - Already runs mainline ~v7.0 with most peripheral drivers working. - Has the Quark UEFI / Neutron kernel stack — NPU is the next missing peripheral. - Other RK3588 hosts (`ampere` = CoolPi GenBook) come later for port-validation. Why not `ampere`: laptop, intermittent power, in-use for other campaigns. Boltzmann is always-on with 32 GB headroom — right substrate for kernel hacking with serial-console fallback (when [Quark](https://git.reauktion.de/marfrit/quark) exposes one). --- ## Codename rationale Rosenblatt's Mark I was custom analog hardware doing fixed-function matmul- adjacent work (weighted-sum + threshold), with weights tunable per slot via mechanical control. The RK3588 NPU is fixed-function INT8 matmul/conv hardware with weights loaded per inference. Same shape, 67 years later, with the same "how do we drive this thing from a general-purpose computer?" problem. The 1958 paper's answer was: build a control panel. The 2026 answer is: a DRM accelerator driver + a userspace runtime that maps tensor ops to MMIO + DMA. We're writing the second half. --- ## Status | Phase | State | Date | |---|---|---| | 0 — bootstrap | done | 2026-05-19 | | 1 — substrate audit | done | 2026-05-19 | | 2 — formulate | open | | | 3..10 | pending | | Phase-1 closeout: `docs/phases.md` + `docs/npu-mainline-status.md`. Headline: mainline driver name is **`rocket`** (not `rknpu`); it's already shipped in boltzmann's kernel as a built module. Phase-2 unblock is small (DT enable + IOMMU v1.0 mitigation + modprobe), not a driver port.