commit 24adc74812d1712fe33e53074186ae22635f00a9 Author: Markus Fritsche Date: Tue May 19 11:57:48 2026 +0000 Rosenblatt: project scaffold for RK3588 NPU on mainline Codename: Frank Rosenblatt — Mark I Perceptron 1958, the first hardware neural network. This project lights up the RK3588 NPU on mainline Linux so the OSS world finally owns the silicon-side of inference on that chip. Phase-1 scope: small LLM running CPU + NPU mix on boltzmann (Rock 5 ITX+). Backend: llama.cpp with a new rknpu ggml backend offloading INT8 GEMM (attention + FFN matmuls) to the NPU's tile-MAC array while leaving dequant / RoPE / softmax / sampling / embedding on A76 NEON. Target model: qwen2.5-1.5B-instruct Q4_K_M GGUF. Scaffold layout: README.md (frame + 9+1-phase plan), TODO.md (rolling punch-list), docs/{npu-mainline-status,architecture}.md, kernel/ for DT bindings + driver tweaks, userspace/{npu-probe,llm-runtime}/, fleet/boltzmann.yaml. Next: Phase-1 substrate audit — fill the TBDs in docs/npu-mainline-status.md with the actual state of Tomeu Vizoso's rknpu / DRM-accel work on the boltzmann-running kernel. diff --git a/README.md b/README.md new file mode 100644 index 0000000..730176c --- /dev/null +++ b/README.md @@ -0,0 +1,107 @@ +# Rosenblatt + +**Codename:** Frank Rosenblatt built the Mark I Perceptron in 1958 — the first +hardware neural network (400 photocells, stepper-motor-tunable analog weights). +This project lights up the RK3588 NPU on mainline Linux, so the OSS world +finally owns the silicon-side of inference on that chip. + +**Scope (Phase 1):** small LLM running CPU + NPU mix on `boltzmann` (Rock 5 +ITX+, RK3588, 32 GB DDR4). Backend: `llama.cpp` with a new `rknpu` device that +offloads the heavy GEMM (matmul in attention + FFN) to the NPU's INT8 path +while leaving dequant / RoPE / softmax / sampling / embedding lookup on the +A76 NEON cores. + +**Target model (Phase 1):** +`qwen2.5-1.5B-instruct` Q4_K_M GGUF. Fits in NPU's accessible memory +budget, has chat tuning, public license. Stretch: `qwen2.5-3B`, +`gemma3-2B`. + +**Out of scope (Phase 1, capture separately if pursued):** +- Vision helper (object detection / OCR / face-blur) — different op mix, + re-scope after Phase-1 numbers +- RKNPU vendor SDK adoption — we want mainline-clean, not vendor-blob +- Other Rockchip NPUs (RK3576 has the same NPU IP block — should port for + free once the RK3588 path lands, but defer until Phase-1 closes) + +**Not goal: parity with rknn-llm vendor stack on day 1.** Vendor has +hand-tuned tensor layouts + quantization; we'll be slower at first. Goal +is *credible* — defined as ≥1 tok/s sustained on qwen-1.5B Q4 with the +NPU actually doing the bulk of the GEMM work. The number itself isn't the +point; the open path to it is. + +--- + +## Phases (9 + 1 loop) + +| # | Phase | Deliverable | +|---|---|---| +| 1 | **Substrate** | Audit mainline NPU driver state (Tomeu Vizoso's rknpu / DRM-accel series); `/dev/accel/*` probe on boltzmann; running kernel + module inventory. `docs/npu-mainline-status.md` snapshot. | +| 2 | Formulate | Pick the exact matmul shape that fits the NPU's tile-MAC array. Identify the smallest-possible op-set llama.cpp can offload. | +| 3 | Analyze | Read the RKNPU2 SDK + Tomeu's rknpu uAPI to learn: register layout, DMA tensor format, INT8 quant scheme. Don't lift code — extract the spec. | +| 4 | Baseline | llama.cpp pure-CPU tok/s on boltzmann for qwen-1.5B Q4_K_M. Three runs, median. Reproducible bench script in `benchmarks/`. | +| 5 | Plan | rknpu backend interface design — where it plugs into ggml's compute graph; memory mapping strategy (dmabuf vs userptr); fallback path. | +| 6 | Review | Janet (ARM/DRM specialist agent) reviews the NPU register-write + DMA fence strategy. Cold-eyes pass. | +| 7 | Implement | rknpu ggml backend skeleton + first INT8 matmul. Bit-exact against CPU reference (Q4_K dequant + fp32 matmul). | +| 8 | Verify | Compare tok/s vs Phase-4 baseline. Profile: % time in NPU vs % in CPU vs % stalled on DMA. | +| 9 | Closing | Writeup at `dokuwiki.reauktion.de/doku.php?id=rosenblatt`. Benchmarks rendered. Send-to-upstream cover letter draft if quality is there. | +| 10 | Memory | `project_rosenblatt.md` in claude-memory: what worked, what to avoid for the next NPU campaign (RK3576 port). | + +Per `feedback_dev_process.md`: rewind to Phase 1 on blocker, Phase 4 on +direction change, Phase 0 on scope change. + +--- + +## Repo layout + +``` +rosenblatt/ +├── README.md this file +├── TODO.md rolling punch-list +├── docs/ +│ ├── npu-mainline-status.md Phase-1 audit +│ ├── architecture.md CPU+NPU split, ggml backend shape +│ └── phases.md per-phase log (analog to ~/src/bin/phases/) +├── kernel/ mainline-bound patches (DT bindings, rknpu driver tweaks) +├── userspace/ +│ ├── npu-probe/ smallest-possible "open device + run trivial matmul" sanity +│ └── llm-runtime/ llama.cpp fork with rknpu backend +├── fleet/ +│ └── boltzmann.yaml host manifest (kernel + NPU driver pin, baseline measurement) +└── benchmarks/ reproducible bench scripts + recorded results (JSON + plots) +``` + +--- + +## Host + +Primary: **boltzmann** (Rock 5 ITX+, RK3588, 32 GB DDR4-2666, NVMe rootfs). +- Already runs mainline ~v7.0 with most peripheral drivers working. +- Has the Quark UEFI / Neutron kernel stack — NPU is the next missing peripheral. +- Other RK3588 hosts (`ampere` = CoolPi GenBook) come later for port-validation. + +Why not `ampere`: laptop, intermittent power, in-use for other campaigns. +Boltzmann is always-on with 32 GB headroom — right substrate for kernel +hacking with serial-console fallback (when [Quark](https://git.reauktion.de/marfrit/quark) exposes one). + +--- + +## Codename rationale + +Rosenblatt's Mark I was custom analog hardware doing fixed-function matmul- +adjacent work (weighted-sum + threshold), with weights tunable per slot via +mechanical control. The RK3588 NPU is fixed-function INT8 matmul/conv hardware +with weights loaded per inference. Same shape, 67 years later, with the same +"how do we drive this thing from a general-purpose computer?" problem. The +1958 paper's answer was: build a control panel. The 2026 answer is: a DRM +accelerator driver + a userspace runtime that maps tensor ops to MMIO + DMA. +We're writing the second half. + +--- + +## Status + +| Phase | State | Date | +|---|---|---| +| 0 — bootstrap | done | 2026-05-19 | +| 1 — substrate audit | open | | +| 2..10 | pending | | diff --git a/TODO.md b/TODO.md new file mode 100644 index 0000000..6d31b71 --- /dev/null +++ b/TODO.md @@ -0,0 +1,75 @@ +# TODO — Rosenblatt + +Rolling punch-list. Older items at bottom (move done → DONE.md when noisy). + +--- + +## Phase 1 — substrate audit + +- [ ] On boltzmann: `uname -r` → record in `fleet/boltzmann.yaml:kernel.running_version` +- [ ] `find / -path '*accel*' -name '*.ko' 2>/dev/null` — check if accel framework is built +- [ ] `ls /dev/accel/ /dev/dri/` — what's exposed? +- [ ] `lsmod | grep -iE 'rknpu|accel'` — what's loaded? +- [ ] `dmesg | grep -iE 'rknpu|npu|accel'` since boot — driver bringup log +- [ ] Tomeu's rknpu series — find on lore.kernel.org/dri-devel, capture latest + patch-set version + state (merged / in-review / dropped) → fill table in + `docs/npu-mainline-status.md` +- [ ] Check `drivers/accel/` in current torvalds tree — list in-tree + accelerators, confirm rknpu's mainline state +- [ ] Check DT bindings: `Documentation/devicetree/bindings/npu/rockchip,*.yaml` +- [ ] Inspect `arch/arm64/boot/dts/rockchip/rk3588.dtsi` for `npu` node +- [ ] If a userspace shim exists (rkneural?), capture repo URL + try + hello-world against the running kernel +- [ ] Spec-extract from BSP vendor `rockchip-npu` source — register map, + DMA descriptor format, irq handling. No code lift; spec only. + +Phase exit criteria: `docs/npu-mainline-status.md` table fully populated; +clear answer to "do we drive via accel uAPI or write our own MMIO driver." + +--- + +## Phase 2 — formulate + +- [ ] List llama.cpp ops by wallclock %, profiling qwen-1.5B Q4_K_M on CPU + (use llama.cpp's built-in perf-timer or perf record) +- [ ] Pick the exact INT8 matmul tile size the NPU prefers (read from BSP source) +- [ ] Spec out the smallest backend interface: which ops we MUST handle, + which the framework falls back to CPU +- [ ] Write `docs/op-coverage.md` + +--- + +## Phase 3 — analyze + +- [ ] RKNPU2 SDK: trace through `librknnrt.so` user-API → kernel ioctl shapes + (objdump + strings, no actual reverse-engineering of vendor blob — just + the syscall surface) +- [ ] Tomeu's accel uAPI: read driver source, understand: + - submit-job ioctl shape + - dmabuf import path + - fence-wait mechanism + - error reporting +- [ ] BSP vendor `rockchip-npu` source: register layout, DMA descriptor + struct, irq handling sequence + +--- + +## Phase 4 — baseline + +- [ ] Build vanilla llama.cpp on boltzmann (mainline branch) +- [ ] Pull qwen2.5-1.5b-instruct Q4_K_M GGUF +- [ ] `llama-bench -m qwen2.5-1.5b -p 512 -n 128` × 3 runs +- [ ] Capture JSON to `benchmarks/$(date +%F)_boltzmann_qwen1.5b_cpu_baseline.json` +- [ ] Record into `fleet/boltzmann.yaml:baseline_measurement` + +--- + +## Cross-phase / standing items + +- [ ] Mirror Tomeu's WIP branch into a local clone for kernel hacking +- [ ] Set up serial console on boltzmann for kernel-panic recovery (Quark + umbrella; check current state) +- [ ] Add `project_rosenblatt.md` to claude-memory once Phase 1 closes (so + future sessions don't re-discover the campaign) +- [ ] Decide repo home: marfrit/rosenblatt on git.reauktion.de (probably yes, + after Phase-1 substrate is captured and the README isn't embarrassing) diff --git a/docs/architecture.md b/docs/architecture.md new file mode 100644 index 0000000..d392328 --- /dev/null +++ b/docs/architecture.md @@ -0,0 +1,127 @@ +# Architecture — CPU + NPU mix for llama.cpp on RK3588 + +## The split + +llama.cpp's compute graph is built around ggml ops. We don't replace +llama.cpp's whole engine — we register a new **device backend** (in +ggml's `ggml-backend` abstraction) named `rknpu` and selectively offload +the ops that are worth the round-trip cost. + +### Goes to NPU (heavy, dense, INT8-friendly) + +- `MUL_MAT` (matrix-matrix multiply) — the workhorse, dominates wall + time. Both attention `Q @ K^T`, `attn @ V`, and FFN `up_proj`, + `down_proj`, `gate_proj` are this shape. +- `MUL_MAT_ID` (MoE-style mixture matmul) — when we eventually try a + mixture-of-experts model. Phase-1+ scope. + +### Stays on CPU (small, op-specific, or per-token) + +- Embedding lookup (`GET_ROWS`) — random-access gather, NPU has no + fast path +- `RMS_NORM` / `LAYER_NORM` — per-token reduction + element-wise +- `ROPE` — small, per-head, lots of trig +- `SOFT_MAX` — small, per-head +- Activations (`SILU`, `GELU`) — element-wise, cheap on NEON +- `SCALE`, `ADD`, `MUL` (element-wise) — cheap on NEON +- Sampling, KV cache update, tokenization — entirely host + +### KV cache: open question + +Two options: +1. **CPU-resident:** lives in normal Linux memory; NPU pulls + activations from CPU and pushes results back per layer. +2. **NPU-resident:** allocated in dmabuf, NPU reads K/V across layers + without round-trips. Cheaper per-layer but constrains model size + to NPU-accessible memory. + +Phase 5 (Plan) picks based on Phase-3 (Analyze) findings on the DMA +cost. + +--- + +## Memory mapping + +ggml's `ggml_backend_buffer_t` abstracts the buffer pool. We implement: +- `alloc_buffer(size)` → allocate a dmabuf of `size` bytes +- `free_buffer(buffer)` → release dmabuf +- `set_tensor` / `get_tensor` → CPU → NPU memcpy +- `cpy_tensor` → device-internal copy + +The dmabuf approach lets us share buffers between CPU producer (e.g. +embedding lookup output) and NPU consumer (matmul) without an extra +copy — `mmap` on CPU side, DMA-import on NPU side. + +If Tomeu's accel uAPI uses dmabuf natively, we follow that. If it +doesn't, we go through `/dev/dri/renderD*` with a thin shim. + +--- + +## Quantization strategy + +llama.cpp ships Q4_K_M as the default for ~2B models. Q4_K_M is a +4-bit weight quantization with per-group scale + min, no per-channel +scale. The NPU expects INT8 (or INT16) tensors with per-channel scale +factors. + +Two paths: +1. **Dequantize on CPU per-layer:** unpack Q4_K_M → INT8 right before + the matmul; ship INT8 to NPU. Adds a per-layer CPU pre-pass. +2. **Dequantize once at load time:** unpack the entire weight tensor + to INT8 + per-channel scales at model-load. Permanent ~2x memory + cost (Q4_K_M is ~5 bits/weight effective; INT8 is 8 bits/weight), + but no per-layer CPU work. + +Phase-1 choice: (2) — straightforward, makes the NPU path the only +thing happening at inference time, easier to profile. The memory cost +on 1.5B is ~1.5 GB INT8 weights vs ~900 MB Q4_K_M — boltzmann has +32 GB, this isn't the constraint. + +Phase-2+: revisit (1) if we go for larger models or if INT8 turns out +to be quality-loss-meaningful on the small ones. + +--- + +## Backend interface — concrete + +Mirroring ggml's existing `ggml-cuda.cu` / `ggml-metal.m` shape, we add: + +``` +ggml-rknpu.h — public API: ggml_backend_rknpu_init() etc. +ggml-rknpu.c — backend implementation: device registration, op + dispatch table, memory management +ggml-rknpu-ops.c — per-op kernels: matmul tiled to NPU's preferred + shape, INT8 quant pre-pass +``` + +In `llama.cpp/ggml/src/ggml-rknpu/`. Out-of-tree initially; if +upstream-acceptable, send for review after Phase 8. + +--- + +## Failure handling + +llama.cpp's backend abstraction already supports falling back to CPU +on per-op basis — `ggml_backend_dev_supports_op()`. We declare +`MUL_MAT` supported for INT8 / FP16 inputs with the right shape +constraints, and let the framework route everything else to CPU. + +If the NPU driver returns an error mid-inference (timeout, DMA +fence wait fail, etc.), the strategy is **abort the inference, log, +return error to caller**. We don't try to silently fall back to CPU +mid-stream because the state would be corrupted (NPU may have +partially written to the dmabuf). + +--- + +## Phase-1 milestone + +A `npu-probe` userspace binary that: +1. Opens the NPU device (whatever the mainline path is — likely + `/dev/accel/accelN` or `/dev/dri/renderD*`) +2. Allocates two small INT8 input tensors + one output (e.g. 64x64) +3. Submits a matmul via the uAPI +4. Waits, reads back, compares to a CPU reference + +This proves the substrate is alive before we touch llama.cpp. If it +doesn't work, we're back in kernel-driver land, not llama.cpp land. diff --git a/docs/npu-mainline-status.md b/docs/npu-mainline-status.md new file mode 100644 index 0000000..750a313 --- /dev/null +++ b/docs/npu-mainline-status.md @@ -0,0 +1,125 @@ +# RK3588 NPU on mainline — status audit (Phase 1) + +> **Phase 1 deliverable.** This file is the canonical snapshot of where +> mainline kernel + DRM/accel + userspace stand for the RK3588 NPU as of +> the audit date. Re-audit on each kernel-version bump or when a +> blocker hits. + +**Audit date:** _(pending — first Phase-1 task)_ +**Kernel under audit:** _(boltzmann's running kernel, capture with `uname -r`)_ + +--- + +## Hardware quick-spec + +- **NPU:** Rockchip "RKNPU2" — three independent NPU cores, each ~2 TOPS + INT8 (combined 6 TOPS). Also supports INT16, FP16, and BF16 (with + reduced throughput). Per-core local memory (~2 MB SRAM each). +- **Bus:** AXI interconnect, dedicated DMA engine, integrated power + domain (`pd_npu`). +- **TRM section:** Chapter 23 ("NPU") in the RK3588 TRM v1.0 — local + copy referenced in `reference_rk3588_trm` memory. +- **Vendor docs source-of-truth:** + https://github.com/airockchip/rknn-toolkit2 (SDK + ops list + + quantization scheme). + +--- + +## Upstream work in flight + +To audit on Phase 1: + +1. **Tomeu Vizoso's rknpu / DRM-accel series** + - Branch: search lore.kernel.org for "rknpu" + "Tomeu Vizoso" in + `linux-rockchip` / `dri-devel`. Check `linux-rockchip` ml cutoff + date per `reference_linux_rockchip_ml` memory. + - State to capture: which patches merged, which are review-pending, + which dropped on the floor. +2. **`drivers/accel/` mainline state** + - Check `drivers/accel/` in current torvalds tree — list of in-tree + accelerators (habanalabs, ivpu, qaic, etc.). Note if rknpu is + there yet. +3. **DT bindings** + - `Documentation/devicetree/bindings/npu/rockchip,rk3588-npu.yaml` + — does it exist in mainline? +4. **Userspace bridge** + - Tomeu has a userspace shim ("rkneural"?) for the DRM-accel uAPI. + Capture repo URL + current state. + +Fill the table below during Phase-1 audit: + +| Component | Mainline state | URL | Notes | +|---|---|---|---| +| `drivers/accel/rknpu/` | _TBD_ | | | +| DT bindings `rockchip,rk3588-npu.yaml` | _TBD_ | | | +| `pd_npu` power-domain in `rk3588.dtsi` | _TBD_ | | | +| `npu` node in `rk3588.dtsi` | _TBD_ | | | +| `npu` node in `rk3588-rock-5-itx-plus.dts` (boltzmann) | _TBD_ | | | +| userspace uAPI consumer | _TBD_ | | | +| Tomeu's WIP branch | _TBD_ | | | + +--- + +## Vendor stack (for spec extraction only — don't lift code) + +- **RKNPU2 SDK:** `airockchip/rknn-toolkit2`. Provides: + - `librknnrt.so` runtime (closed, vendor binary) + - `rknn-convert` ONNX → `.rknn` transcoder + - rknn-llm fork (rknn-llm) for LLM-specific quant + scheduling +- **Kernel module:** vendor `rockchip-npu` (BSP). Not mainline-compatible + (uses Rockchip's iommu/dma shim). Spec-extract per `feedback_megabitchip_semantic_match` — + read the source, don't link the binary. + +Operational rule: **mainline-clean from day 1.** No vendor blob runtime, +no vendor-kernel-only module. If we need to copy a register-layout table +from BSP source to get started, that's fine; copying a `.so` is not. + +--- + +## Hardware capability bounds + +What the NPU CAN do (Phase-1-relevant): +- INT8 GEMM up to ~2 TOPS/core × 3 cores = 6 TOPS peak (theoretical) +- INT8 tensors with per-channel quantization scales +- 3D conv (mostly unused in LLM workloads) +- LayerNorm + activation fusion in some configs + +What the NPU CANNOT do that LLM needs (so stays on CPU): +- fp32 / fp16 mixed-precision (NPU is INT8/INT16-first; some FP16 but + with throughput penalty) +- Sampling (multinomial, top-k, top-p) — pure CPU +- Sparse attention — no sparse-tile support in the NPU op set +- Embedding lookup (rare access pattern; gather not in NPU op set) +- RoPE, softmax, RMSNorm — these run on CPU because the NPU's + fixed-function pipeline doesn't have these as first-class ops + without going through a generic matmul shoehorn + +Phase-2 question: which of those CAN move to NPU later, with op-fusion or +custom kernels. + +--- + +## Throughput sanity check (theoretical only — not measured) + +qwen2.5-1.5B Q4_K_M: +- ~1.5e9 weights, attention + FFN dominated by matmul +- Per-token: roughly 2 × 1.5e9 = 3 GMAC (forward pass, weight×activation) +- INT8 path: 3 GMAC / 6 TOPS = 0.5 ms/token compute-bound +- → ~2000 tok/s if everything fit + zero overhead, ABSOLUTELY UNREALISTIC + +Realistic upper bound on first cut: probably 5-15 tok/s. Memory +bandwidth + DMA setup + CPU side will dominate. Phase-4 baseline pulls +the actual CPU-only number; Phase-8 measures how close we get with NPU +in the loop. **The goal is "credible CPU+NPU mix that's faster than +pure CPU," not "saturate the 6 TOPS rating."** + +--- + +## What this audit unblocks + +After Phase 1 completes (the table above is filled), Phase 2 can pick: +- whether to drive the NPU via Tomeu's accel uAPI (if it's far enough + along) or write directly against the MMIO regs (if not) +- which kernel branch we baseline against (mainline-rc vs Tomeu's WIP) +- which userspace shim we use (Tomeu's, write our own, or fork + llama.cpp's CUDA-style backend pattern for ggml) diff --git a/fleet/boltzmann.yaml b/fleet/boltzmann.yaml new file mode 100644 index 0000000..3a33be4 --- /dev/null +++ b/fleet/boltzmann.yaml @@ -0,0 +1,55 @@ +# rosenblatt fleet manifest — boltzmann (Rock 5 ITX+, RK3588) +# +# Phase-1 audit host. Always-on, 32 GB DDR4, NVMe rootfs. NPU silicon +# present + accessible via Rockchip-BSP vendor module today; mainline +# path TBD (see docs/npu-mainline-status.md). + +host: boltzmann +arch: arm64 +soc: rockchip/rk3588 +board: rock-5-itx-plus +distro: archlinuxarm # ALARM aarch64; boltzmann is the umbrella RK3588 host +role: primary-development # not yet primary-target (laptop targets land later) + +hardware: + cpu: 4×Cortex-A76 (2.4 GHz) + 4×Cortex-A55 (1.8 GHz) + ram: 32 GB DDR4-2666 + storage: NVMe (rootfs) + microSD (recovery) + npu: + cores: 3 + tops_int8_per_core: 2 # ~2 TOPS INT8 per core, 6 TOPS aggregate (theoretical peak) + local_sram_per_core_mib: 2 + power_domain: pd_npu + +# Phase-1 audit fills these (pending boltzmann inspection) +kernel: + running_version: TBD # uname -r snapshot at audit time + source: TBD # mainline torvalds / mmind-rockchip / custom + npu_driver: TBD # vendor rockchip-npu / mainline rknpu / none + +userspace: + rknn_vendor_runtime_installed: false # commitment: stay mainline-clean + llama_cpp_installed: TBD # via marfrit-packages or built-from-source + +baseline_measurement: + pending: true + target: | + llama.cpp pure-CPU tok/s on qwen2.5-1.5b-instruct-q4_k_m.gguf, + 3 runs, median wallclock. Use llama-bench from llama.cpp/build/bin. + ground_truth_file: benchmarks/2026-XX-XX_boltzmann_qwen1.5b_cpu_baseline.json + +bringup_sequence: + 1: substrate audit (docs/npu-mainline-status.md table filled) + 2: npu-probe runs successfully (open device → 64×64 INT8 matmul → bit-match CPU ref) + 3: llama.cpp pure-CPU baseline captured + 4: rknpu ggml backend skeleton compiles + 5: first llama.cpp matmul offload working on a single layer + 6: full forward pass via NPU for one decode step + 7: tok/s vs baseline measured + +backup_host: ampere # CoolPi GenBook — port-validation target. Phase-2+ scope. + +reverse_dependencies: + - Quark (boltzmann UEFI) — must stay bootable across kernel-rev experiments + - Neutron (boltzmann kernel build) — provides the kernel we tweak for rknpu + - Volta (boltzmann umbrella) — Rosenblatt is the third Volta-child after Quark + Neutron diff --git a/kernel/README.md b/kernel/README.md new file mode 100644 index 0000000..0625269 --- /dev/null +++ b/kernel/README.md @@ -0,0 +1,24 @@ +# kernel/ + +Mainline-bound kernel patches: DT bindings, rknpu driver tweaks, +power-domain wiring for boltzmann's `rk3588-rock-5-itx-plus.dts`. + +Empty until Phase 1 audit identifies what's actually missing in mainline. + +If Tomeu Vizoso's rknpu series is far enough along to use as-is, this +directory may stay nearly empty — we'd just carry a small DT-overlay +patch for boltzmann's board file. + +If we end up writing a driver from scratch (worst case), the structure +will mirror an upstream submission layout: +``` +0001-dt-bindings-npu-add-rockchip-rk3588-npu.patch +0002-arm64-dts-rockchip-rk3588-add-npu-node.patch +0003-arm64-dts-rockchip-rock-5-itx-plus-enable-npu.patch +0004-accel-rknpu-add-rockchip-rk3588-driver.patch +... +``` + +Tracking-wise these go through `marfrit/kernel-agent` as scope +`patches/driver/accel/rknpu/`, with `fleet/boltzmann-rosenblatt.yaml` +as the consuming manifest. diff --git a/userspace/llm-runtime/README.md b/userspace/llm-runtime/README.md new file mode 100644 index 0000000..a61c23f --- /dev/null +++ b/userspace/llm-runtime/README.md @@ -0,0 +1,32 @@ +# llm-runtime + +llama.cpp fork (or out-of-tree backend) with the rknpu ggml backend. + +Code lands here starting at Phase 5 (Plan) — too early in Phase 1. + +Until then, this directory holds: +- design notes (`docs/architecture.md` from project root is authoritative) +- the eventual `ggml-rknpu/` backend source +- patch series for upstream submission if quality reaches that bar + +## Approach + +Two paths to consider in Phase 5: + +1. **Fork llama.cpp, add backend in tree.** Easier to keep in sync; + harder to upstream because llama.cpp may not want a Rockchip-specific + backend that depends on a still-WIP mainline driver. +2. **Out-of-tree backend, load via llama.cpp's plugin API + (`-DGGML_BACKEND_DL=ON`).** Cleaner separation; tracks llama.cpp + upstream without our diff being in the way. Recommended unless we + need to patch core llama.cpp logic. + +Decision deferred to Phase 5. + +## Model + +Phase-1 target: `qwen2.5-1.5b-instruct-q4_k_m.gguf`. Source: +hf.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF or built locally with +llama-quantize. + +Stretch: qwen2.5-3B (if memory + NPU SRAM allow), gemma3-2B. diff --git a/userspace/npu-probe/README.md b/userspace/npu-probe/README.md new file mode 100644 index 0000000..df5e094 --- /dev/null +++ b/userspace/npu-probe/README.md @@ -0,0 +1,33 @@ +# npu-probe + +Smallest-possible userspace binary that: +1. Opens the NPU device (path TBD per Phase-1 audit) +2. Allocates two INT8 input tensors (64×64) + one output (64×64) +3. Submits a matmul via the uAPI in use (Tomeu's accel ioctl OR our own + shim around vendor MMIO if accel-mainline isn't ready) +4. Waits for completion (DMA fence or polled completion register) +5. Reads back the output +6. Compares to a CPU INT8 matmul reference; reports pass/fail + +**Phase-1 deliverable.** Until this works, nothing else in this repo +can be exercised against real silicon. + +## Build + +_(filled when Phase-1 audit picks the uAPI shape — `meson` or `cmake`, +no autotools)_ + +## Run + +``` +./npu-probe # default 64×64 INT8 matmul +./npu-probe --shape 128,128,128 # M,N,K override +./npu-probe --device /dev/accel/accel0 # override device path +./npu-probe --golden golden_64x64.bin # provide expected output for diff +``` + +## Why C, not Python + +Direct ioctl + dmabuf + mmap. Python wrapper layer would obscure the +exact syscall sequence we need to understand. Once npu-probe works, +a Python binding for benchmark scripts is fine.