Issues 001+002: defer LPF wd=16 + LPF vertical variants

Per user direction at cycle-4 close: file wd=16 (trend prediction validation) and vertical variants (column-stride TMU behaviour unknown) as local issues for future cycles. Progress instead to CDEF (AV1) for codec breadth. docs/issues/001 — wd=16 prediction validation. Per cycle 4 lesson 4, trend says wd=16 likely flips M4 negative. Quick incremental cycle when revisited. docs/issues/002 — vertical variants. Different memory access pattern (column-strided vs row-strided). The load-bearing unknown is whether the cycle 2 +6.9% mixed gain survives the TMU coalescing shift. If positive, deployment recipe gains symmetry; if negative, must split by orientation. Both issues have acceptance criteria + expected outcomes documented. Cycle 5 next: CDEF (AV1) — codec-breadth expansion. No Gitea repo exists for daedalus-fourier yet (project is local- only). If a tracker is wanted, create the repo and migrate these .md files. For now they live in-tree as part of the project history. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:09:51 +00:00
parent 85feba4087
commit 20e3d004ae
2 changed files with 153 additions and 0 deletions
@@ -0,0 +1,71 @@
+# Issue 001 — VP9 LPF wd=16 cycle (prediction validation)
+
+**Status**: open, not blocking
+**Type**: kernel-cycle (cycle 5 candidate)
+**Predicted verdict**: RED (M4 likely negative, per cycle 4 lesson 4)
+**Priority**: low (incremental; trend prediction)
+**Filed**: 2026-05-18
+
+## Background
+
+Cycle 4 (LPF wd=8) closed PASS with M4 delta +4.1 % vs cycle 2 wd=4's
+6.9 %. The downward trend prompted Phase 9 lesson: "wd=16 would
+probably show further R degradation; M4 may flip negative based on
+the trend line." See `docs/k4_lpf8_phase4_7.md §"Phase 9 lessons"`.
+
+This issue tracks the experiment to validate (or invalidate) that
+prediction.
+
+## What to do
+
+Cycle 5 LPF wd=16, mirroring cycle 4's compact structure:
+
+1. **Phase 3**: build `tests/bench_neon_lpf16.c` modelled on
+   `bench_neon_lpf8.c`. NEON symbol: `ff_vp9_loop_filter_h_16_16_neon`
+   (already in vendored `vp9lpf_neon.S`). Capture M3.
+2. **Phase 4-7**: write `src/v3d_lpf_h_16_16.comp` extending the
+   wd=8 kernel with the wd=16 outer-flat path (`flat8out` test, 14
+   writes per row when both flat8out and flat8in pass). New
+   contract: `dst_stride_u8 ≥ 14` (vs cycle 4's ≥ 6) because the
+   flat8out path writes at `base-7..base+6` (14 contiguous bytes).
+3. **Phase 5 review**: mandatory — wd=16 is not as incremental as
+   wd=8 (much larger conditional logic, new contract bound).
+4. **Phase 7**: measure M2, R; if M4 negative as predicted, document
+   trend confirmation and close kernel as "CPU-only" in deployment
+   recipe.
+
+## Expected outcome (per prediction)
+
+| Quantity | Predicted |
+|---|---|
+| M1 bit-exact | 100 % (same pattern as cycles 2/4) |
+| M3 NEON | ~55 Medge/s (slightly faster than wd=8) |
+| M2 QPU isolation | ~12-15 Medge/s |
+| R isolation | 0.22-0.27 (ORANGE, downward) |
+| M4 mixed vs NEON-4 | -2 % to +1 % (borderline; likely negative) |
+| 30fps margin | still 5×+ (user-facing PASS regardless) |
+
+## Acceptance criteria (issue closed when)
+
+- Cycle 5 phases 1-7 complete, committed
+- `docs/k5_lpf16_phase*.md` produced
+- Phase 7 verdict documented, deployment recipe updated either way
+- Phase 9 lesson 4 trend prediction validated or refuted
+
+## Why deferred (not done in current session)
+
+The session goal was "continue until user intervention necessary."
+User directed: file as issue, progress to cycle 5 CDEF instead.
+The trend prediction is interesting but the project's deployment
+recipe is already locked through cycle 4; cycle 5 wd=16 result
+would update at most one row of the recipe table.
+
+## Related
+
+- `docs/k4_lpf8_phase4_7.md §"Phase 9 lessons"` lesson 4 (the
+  prediction this validates)
+- `external/ffmpeg-snapshot/libavcodec/aarch64/vp9lpf_neon.S`
+  (NEON ref already vendored — symbol `ff_vp9_loop_filter_h_16_16_neon`)
+- `docs/k2_deblock_phase4.md` (cycle 2 template)
+- `docs/k4_lpf8_phase4_7.md` (cycle 4 template, the most direct
+  reference)
@@ -0,0 +1,82 @@
+# Issue 002 — VP9 LPF vertical variants (v_4_8 / v_8_8)
+
+**Status**: open, not blocking
+**Type**: kernel-cycle (cycle 5/6 candidate)
+**Predicted verdict**: similar to horizontal cousins (k2/k4 = YELLOW PASS)
+**Priority**: low (different memory pattern; completeness)
+**Filed**: 2026-05-18
+
+## Background
+
+Cycles 2 and 4 implemented the **horizontal-direction** LPF inner
+filters (`h_4_8`, `h_8_8`). The corresponding **vertical-direction**
+filters (`v_4_8`, `v_8_8`) have the same arithmetic but a different
+memory access pattern: column-strided reads of 8 pixels (one per row)
+vs row-strided reads of 8 pixels (one per column).
+
+Concretely from `vp9dsp_template.c`:
+- `h_*_*_neon`: stridea=stride, strideb=1 (advance rows, neighborhood in cols)
+- `v_*_*_neon`: stridea=1, strideb=stride (advance cols, neighborhood in rows)
+
+The vertical variant tests whether the QPU's "8 lanes per row,
+contiguous read" assumption (cycles 2/4 wd=4/wd=8) generalises to
+the strided memory pattern. The TMU's coalescing behaviour may
+differ significantly when 8 lanes need to load from 8 different
+rows of the same column (cache-line-miss-y) vs 8 different cols of
+the same row (sequential).
+
+## What to do
+
+Cycle 5 or 6 (after CDEF), one cycle per variant:
+
+1. **v_4_8** — vertical 4-tap inner, 8-pixel edge (vertical edge,
+   filter spans rows above/below).
+2. Optional **v_8_8** — vertical 8-tap inner.
+
+Each cycle: same shape as cycle 2/4 but
+- C reference: same `loop_filter` function, instantiated via
+  `lf_8_fn(v, 4, 1, stride)` (note: stridea + strideb swapped).
+- NEON: `ff_vp9_loop_filter_v_4_8_neon` (in vendored `vp9lpf_neon.S`).
+- QPU geometry: same 32-edges/WG, but per-edge memory access shape
+  changes — lanes now span 8 rows (strided by stride) of one column.
+
+## Key question to answer
+
+**Does the QPU's mixed-mode +6.9 % win (cycle 2 wd=4 horizontal)
+hold for the vertical variant?** The TMU latency / cache behaviour
+on column-strided reads is the main unknown. If positive: deployment
+recipe gains v variants symmetrically. If negative: deployment
+recipe needs to split by orientation (h on QPU, v on CPU).
+
+## Expected outcome
+
+| Quantity | Predicted |
+|---|---|
+| M1 bit-exact | 100 % |
+| M3 NEON | similar to h (NEON handles both orientations well) |
+| M2 QPU isolation | possibly LOWER than h variant (TMU column reads less coalesced) |
+| R isolation | 0.30-0.45 (ORANGE) |
+| M4 mixed | UNKNOWN — this is the load-bearing experiment |
+
+## Acceptance criteria
+
+- v_4_8 cycle 1-7 complete with M4 measurement
+- Decision: "v variants → QPU same as h" OR "v variants → CPU only"
+- Deployment recipe updated
+- Optional: v_8_8 follow-on cycle if v_4_8 was positive
+
+## Why deferred
+
+- Out of cycle 4's compressed scope (cycle 4 was a focused
+  wd=4 → wd=8 extension)
+- User-stated cycle 5 direction was CDEF (AV1 coverage), not VP9
+  variant completeness
+
+## Related
+
+- `docs/k2_deblock_phase4.md §"3. Workgroup geometry"` discusses
+  the 32-edges-per-WG mapping that needs revisiting for v variant
+- `external/ffmpeg-snapshot/libavcodec/aarch64/vp9lpf_neon.S` —
+  NEON refs already vendored for both v_4_8 and v_8_8
+- `phase0.md §2` device profile — TMU read patterns relevant for
+  the column-strided question