Path B pivot + Phase 0-3 closed with first baseline numbers

This is a from-scratch initial commit on a fresh .git. The original scaffold commit (7510b56) and the earlier session's working-tree docs were lost in a 2026-05-18 10:25 working-tree wipe; the corrupted .git is preserved at .git-broken-2026-05-18/ (gitignored) for forensic inspection. Scope re-anchored from Path A (custom VPU firmware on VC7 scalar cores; blocked by BCM2712 silicon-RoT mask-ROM signature check) to Path B (QPU compute kernels via Mesa v3d / Vulkan compute or direct DRM, on stock signed Pi 5 / CM5). See README.md and docs/phase0.md for the substrate audit that closed Path A. Phases closed: Phase 0 — substrate audit; Path A blocked, Path B open; codec-back-end-fits-QPU finding (docs/phase0.md) Phase 1 — first kernel locked (VP9 / AV1 8x8 inverse DCT) with publish-before-measure R = M2/M3 decision rules (docs/phase1.md) Phase 2 — reference impls mapped; FFmpeg n7.1.3 source vendored under external/ffmpeg-snapshot/ (PROVENANCE.md pins commit f46e514 + per-file SHA-256s) (docs/phase2.md) Phase 3 — real baseline measurements on hertz (docs/phase3.md): M1 bit-exact 100.0000 % (10000/10000) M3 NEON IDCT8 single 8.171 Mblock/s (122.4 ns/block) M5a empty Vulkan submit 22.66 us M5b 1-WG noop dispatch 55.60 us M5 delta 32.95 us/dispatch => per-dispatch overhead is ~455x per-NEON-block cost; Phase 4 must batch at frame level or close to it. Build harness in place: CMakeLists.txt + tests/{bench_neon_idct.c, vp9_idct8_ref.c, bench_vulkan_dispatch.c, shaders/noop.comp} + external/ffmpeg-snapshot/config.h shim (7 defines + EXTERN_ASM). Builds clean on Debian Trixie aarch64 with cmake 3.31, ninja 1.12, libvulkan-dev 1.4.309, glslang-tools 15.1.0. Vendored FFmpeg .S assembles via the config.h shim. Next: Phase 4 (plan first QPU IDCT kernel under the M5 batching constraint) -> Phase 5 second-model review -> Phase 6 implement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:30:12 +00:00
commit dcbbc77038
22 changed files with 9030 additions and 0 deletions
@@ -0,0 +1,96 @@
+---
+name: Claude-Assisted Development Process (9(+1)-phase loop)
+description: Default workflow for any non-trivial implementation — substrate/motivation/inventory, formulate, analyze, baseline, plan, second-model review, implement, verify, closing (package+ship), memory-update; with explicit loopback edges
+type: feedback
+originSessionId: 83898ac9-e61f-4c44-8429-0154cb12d124
+---
+Markus's standardized loop for our implementation work. Apply by default whenever a task is bigger than a one-liner. Skipping phases is a deliberate choice that should be flagged, not a default.
+
+## Phase 0 — Substrate / Motivation / Inventory
+
+Pre-formulation. Lock the research question and assemble the substrate *before* Phase 1 commits to a measurable goal. Output: a `phase0_findings.md` artifact that future phases can refer back to without re-deriving.
+
+- **Research question + mechanism captured.** State the question in one sentence. Capture any operator-supplied mechanism (the "why this question, how does it work" insight) verbatim — it's the load-bearing claim Phase 1 binds against.
+- **Predecessor carry-over: state vs data.** When a campaign succeeds another, categorize what transfers. *State* (installed packages, governor settings, system tweaks, source-read file:line pointers, protocol designs, parser scripts) carries forward. *Data* (drop counts, perf percentages, threshold values, baseline floors) does not — it is reference history only. Binding cells in this campaign anchor to in-session-acquired numbers, even if the predecessor measured an identical condition.
+- **Tooling and measurement-instrument inventory.** What's installed, what would need installing, what extensions/protocols the live system actually supports. Live verification, not paper compatibility.
+- **In-session baseline anchor.** Re-run the reference rep — N=3 minimum if the baseline is load-bearing for the campaign's premise — *before* any instrument changes. **If the predecessor's reference floor doesn't replicate at N=3 in the same session, that is the campaign result.** Don't build multi-phase infrastructure on an N=1 historical floor. See `feedback_replicate_baseline_first.md`.
+- **Open questions tabled.** What's not known going into Phase 1. Phase 1 locks against the knowns; Phase 0 surfaces the unknowns explicitly so they don't slip into binding cells unverified.
+
+## Phase 1 — Goal Formulation
+Define the objective in measurable terms. State what success looks like *before* touching anything. The chosen metric is a **hypothesis** about what to measure, not an axiom — Phase 3 may invalidate it.
+
+## Phase 2 — Situation Analysis
+Document current state. Identify constraints, dependencies, known failure modes. **Reset context here** — do not carry assumptions from prior sessions; re-read CLAUDE.md, relevant memory files, run `git status`, re-verify reachability.
+
+## Phase 3 — Baseline Measurements
+Take concrete measurements *before* any changes. Paste raw output into DokuWiki at capture time — verbatim, not paraphrased. The Phase 5 artifact is the raw data, not Claude's summary.
+
+**Real data, not theatre.** Phase 3 exists to use AI capacity for absorbing wide, low-level instrumentation a human reader would skim past. Attaching strace / perf / ftrace / eBPF / custom tripwires to the process under test is real Phase 3; scraping mpv's stdout dropped-frame counter is not. Discriminator: if a human with bash and grep could produce the same baseline, it isn't Phase 3 yet — go down to the syscall / call-path / MMIO / register layer. See `feedback_phase3_no_theatre.md`.
+
+**Anti-fabrication:**
+- Every cited value traces to a visible tool invocation or verbatim paste-in. If a measurement wasn't taken, write "not measured" — never an estimate, inference, or recall from training / prior sessions / sibling-host memory.
+- Raw before derived. A derived number (FPS, p99, error rate) appears alongside the raw stream it came from, never alone.
+- Rig failure is the finding. Empty strace, dead UART, perf counter that didn't increment → that *is* the Phase 3 result. Loop back to Phase 2 to fix the rig; do not synthesize plausible-looking baseline data to keep momentum.
+
+- **If baseline reveals the Phase 1 metric was tracking the wrong thing → loop back to Phase 1** with the corrected target. (Example: "max H.264 FPS" Phase 1 metric, but baseline shows DMA-setup + sync overhead dwarfs decode → real metric is bytes-copied-per-second / EGL surface-import time, not FPS.)
+
+**Measurements describe what the system *does*, not what it *should do*.** Baseline data is evidence, not a specification. Do NOT derive API call sequences, struct layouts, or parameter values from observed behaviour (strace, perf, example output). Observable behaviour may reflect bugs, workarounds, or implementation accidents — anything you copy from it inherits those.
+
+## Phase 4 — Plan
+Formulate the approach. Identify what will and will not be touched. State expected outcome of implementation in the *same* measurable terms used in Phase 1/3.
+
+## Phase 5 — Second Model Review
+Goal, situation, measurements, plan get pasted into **DokuWiki**. Markus reviews and redacts, then initiates the handover to a fresh model instance. **Claude does not curate the artifact going to the reviewer** — that would re-introduce the blind-spot accumulation the review is meant to escape. Do not summarize when handing over; paste the actual artifacts.
+
+## Phase 6 — Implementation
+Execute the plan. Scope strictly to what was planned — resist feature creep, refactor-creep, "while I'm here" cleanups, and over-eager scope expansion. If a plan revision is needed mid-implementation, surface it explicitly and re-enter Phase 4.
+
+**Contract before code.** Before writing or modifying any call site:
+- Read the API contract — kernel docs, header comments, and upstream source for every call touched.
+- State the contract explicitly before implementing against it (in the plan, the commit message, or a comment — somewhere reviewable).
+- If the contract cannot be found: stop and surface the gap. Don't infer it from baseline behaviour or sibling code.
+
+**Copying from baseline measurements is not implementation. It is transcription of potentially broken behaviour.** A deliverable that matches baseline bytes but violates the API contract is not a deliverable — it is a deferred bug.
+
+### What "state the contract explicitly" looks like
+
+Worked example: `0012-h264-omit-scaling-matrix-frame-based.patch` in `~/src/ohm_gl_fix/phase6/step1/`. The commit message opens with the contract before any code:
+
+> VAAPI signals "explicit scaling lists are present in the bitstream" implicitly: the consumer (ffmpeg-vaapi, mpv, etc.) sends a `VAIQMatrixBufferH264` alongside `RenderPicture` iff `sps_scaling_matrix_present_flag || pps_scaling_matrix_present_flag`. When the bitstream uses default (flat) scaling, no IQMatrixBuffer arrives […]
+>
+> Earlier draft of this patch unconditionally omitted SCALING_MATRIX in FRAME_BASED. That's **corpus-correct** (bbb has no explicit scaling lists) but the **wrong predicate**: the kernel-side gating is by "matrix-supplied vs. not," not by decode mode. […]
+>
+> Contract verification (audit_0008_decode_params_2026-05-01.md + hantro_h264.c::assemble_scaling_list): the kernel uses the supplied matrix when SCALING_MATRIX is in the control batch and falls back to spec-defined defaults when absent. Mode-independent.
+
+What this gets right:
+- **Contract first**: per-control rules cited from kernel doc (`ext-ctrls-codec-stateless.rst:752`), kernel driver (`hantro_h264.c::assemble_scaling_list`), and sibling implementation (gst-plugins-bad commit 9e3e775) — *before* any patch hunks.
+- **Corpus-correct ≠ spec-correct, called out by name**: the rejected predicate ("omit SCALING_MATRIX in FRAME_BASED") *did* match the BBB baseline. It still got rejected, because the contract said the gate is "matrix-supplied vs. not," not "decode mode." This is exactly the Phase 3-derived-implementation trap.
+- **Then** the diff implements one branch per contract clause: SPS/PPS/DECODE_PARAMS always, SCALING_MATRIX iff `matrix_set`, SLICE_PARAMS iff SLICE_BASED, PRED_WEIGHTS iff SLICE_BASED + `V4L2_H264_CTRL_PRED_WEIGHTS_REQUIRED`.
+
+Mirror format anywhere reviewable: PR description, commit message body, plan section, or a header comment block. The shape is "contract clauses with citations → code that maps 1:1 to those clauses."
+
+## Phase 7 — Verification Measurements
+Repeat measurements from Phase 3. Compare explicitly against baseline.
+- **If the delta does not match Phase 4's prediction → loop back to Phase 4** (re-plan). Do not declare success when the numbers say otherwise; an unexplained delta is a finding, not a footnote.
+
+## Phase 8 — Closing (Package & Ship)
+Ship the deliverable to its consumption point. Working code that lives only in a checkout is half a deliverable — the next session has to re-discover it, the fleet doesn't get the fix, and the loop's value evaporates.
+
+- **Kernel patch → kernel-agent package.** Route through the kernel-agent flow (`fleet/<host>.yaml` + scope-tagged patches) so the kernel package gets properly built, signed, and published. Don't leave loose `.patch` files in a working tree. See `project_kernel_agent.md` for the manifest shape; `linux-ampere-fourier` and `linux-fresnel-fourier` are the canonical examples.
+- **Program / library change → marfrit-packages.** Add or update a PKGBUILD (Arch/ALARM) or debian/ tree (deb), push to `git.reauktion.de/marfrit/marfrit-packages`, and let `.gitea/workflows/build.yml` produce + sign + publish to `packages.reauktion.de`. See `project_marfrit_packages.md`. Local-only fixes go upstream as PR-quality diffs into the same overlay.
+- **Skipping is a deliberate choice.** If the change is one-shot scratch work (debugging tripwire, throw-away script), say so explicitly in the closing note. The default is: it gets packaged.
+- **Re-verify on the deploy host with the packaged artifact.** A clean Phase 7 result from a hand-rolled dev build (e.g. `meson -Dbuildtype=release && ninja`) is **not** the same as the `.pkg.tar.zst` / `.deb` that the deploy host installs. Distro packaging flags (Arch makepkg's `-O2 + FORTIFY + stack-protector-strong + stack-clash-protection` vs meson's `-O3 -DNDEBUG`, debhelper's hardening defaults, lto toggles) vectorise / unroll loops differently and routinely unmask latent UB the dev build folded away. Pull the published package down via the package manager and re-run the Phase 7 success criterion against it before closing — until that PASSes, the loop is not done. See `feedback_package_build_flags_unmask_bugs.md` for the iter39 incident that codified this.
+
+## Phase 9 — Memory Update
+Loop terminates here. Distill the lesson into a memory entry — what was the mistake the loop caught, what's the rule that would shorten the next cycle. Do not let the lesson rot in chat history.
+
+---
+
+## Loopback edges (summary)
+- Phase 3 → Phase 1 (metric was wrong)
+- Phase 7 → Phase 4 (plan didn't deliver predicted delta)
+- Any phase → Phase 0 (substrate was wrong: predecessor baseline didn't replicate, mechanism doesn't engage on this stack, or the data inverts the premise → re-anchor or honest close)
+- Phase 9 closes the loop
+
+## Why this exists
+Several recurring failures in prior work codify into individual rules — observer-first, simulate-before-flash, three-strikes-then-verify, "trust eyes not vibes," scope-strictly-to-plan, no-fake-dry-run. Those are all symptoms; this loop is the structural fix. Use it as the spine and let those rules show up as rejection patterns inside the appropriate phases.