diff --git a/docs/issues/001-lpf-wd-16-prediction-validation.md b/docs/issues/001-lpf-wd-16-prediction-validation.md new file mode 100644 index 0000000..61c4b2b --- /dev/null +++ b/docs/issues/001-lpf-wd-16-prediction-validation.md @@ -0,0 +1,71 @@ +# Issue 001 — VP9 LPF wd=16 cycle (prediction validation) + +**Status**: open, not blocking +**Type**: kernel-cycle (cycle 5 candidate) +**Predicted verdict**: RED (M4 likely negative, per cycle 4 lesson 4) +**Priority**: low (incremental; trend prediction) +**Filed**: 2026-05-18 + +## Background + +Cycle 4 (LPF wd=8) closed PASS with M4 delta +4.1 % vs cycle 2 wd=4's ++6.9 %. The downward trend prompted Phase 9 lesson: "wd=16 would +probably show further R degradation; M4 may flip negative based on +the trend line." See `docs/k4_lpf8_phase4_7.md §"Phase 9 lessons"`. + +This issue tracks the experiment to validate (or invalidate) that +prediction. + +## What to do + +Cycle 5 LPF wd=16, mirroring cycle 4's compact structure: + +1. **Phase 3**: build `tests/bench_neon_lpf16.c` modelled on + `bench_neon_lpf8.c`. NEON symbol: `ff_vp9_loop_filter_h_16_16_neon` + (already in vendored `vp9lpf_neon.S`). Capture M3. +2. **Phase 4-7**: write `src/v3d_lpf_h_16_16.comp` extending the + wd=8 kernel with the wd=16 outer-flat path (`flat8out` test, 14 + writes per row when both flat8out and flat8in pass). New + contract: `dst_stride_u8 ≥ 14` (vs cycle 4's ≥ 6) because the + flat8out path writes at `base-7..base+6` (14 contiguous bytes). +3. **Phase 5 review**: mandatory — wd=16 is not as incremental as + wd=8 (much larger conditional logic, new contract bound). +4. **Phase 7**: measure M2, R; if M4 negative as predicted, document + trend confirmation and close kernel as "CPU-only" in deployment + recipe. + +## Expected outcome (per prediction) + +| Quantity | Predicted | +|---|---| +| M1 bit-exact | 100 % (same pattern as cycles 2/4) | +| M3 NEON | ~55 Medge/s (slightly faster than wd=8) | +| M2 QPU isolation | ~12-15 Medge/s | +| R isolation | 0.22-0.27 (ORANGE, downward) | +| M4 mixed vs NEON-4 | -2 % to +1 % (borderline; likely negative) | +| 30fps margin | still 5×+ (user-facing PASS regardless) | + +## Acceptance criteria (issue closed when) + +- Cycle 5 phases 1-7 complete, committed +- `docs/k5_lpf16_phase*.md` produced +- Phase 7 verdict documented, deployment recipe updated either way +- Phase 9 lesson 4 trend prediction validated or refuted + +## Why deferred (not done in current session) + +The session goal was "continue until user intervention necessary." +User directed: file as issue, progress to cycle 5 CDEF instead. +The trend prediction is interesting but the project's deployment +recipe is already locked through cycle 4; cycle 5 wd=16 result +would update at most one row of the recipe table. + +## Related + +- `docs/k4_lpf8_phase4_7.md §"Phase 9 lessons"` lesson 4 (the + prediction this validates) +- `external/ffmpeg-snapshot/libavcodec/aarch64/vp9lpf_neon.S` + (NEON ref already vendored — symbol `ff_vp9_loop_filter_h_16_16_neon`) +- `docs/k2_deblock_phase4.md` (cycle 2 template) +- `docs/k4_lpf8_phase4_7.md` (cycle 4 template, the most direct + reference) diff --git a/docs/issues/002-lpf-vertical-variants.md b/docs/issues/002-lpf-vertical-variants.md new file mode 100644 index 0000000..5ea8701 --- /dev/null +++ b/docs/issues/002-lpf-vertical-variants.md @@ -0,0 +1,82 @@ +# Issue 002 — VP9 LPF vertical variants (v_4_8 / v_8_8) + +**Status**: open, not blocking +**Type**: kernel-cycle (cycle 5/6 candidate) +**Predicted verdict**: similar to horizontal cousins (k2/k4 = YELLOW PASS) +**Priority**: low (different memory pattern; completeness) +**Filed**: 2026-05-18 + +## Background + +Cycles 2 and 4 implemented the **horizontal-direction** LPF inner +filters (`h_4_8`, `h_8_8`). The corresponding **vertical-direction** +filters (`v_4_8`, `v_8_8`) have the same arithmetic but a different +memory access pattern: column-strided reads of 8 pixels (one per row) +vs row-strided reads of 8 pixels (one per column). + +Concretely from `vp9dsp_template.c`: +- `h_*_*_neon`: stridea=stride, strideb=1 (advance rows, neighborhood in cols) +- `v_*_*_neon`: stridea=1, strideb=stride (advance cols, neighborhood in rows) + +The vertical variant tests whether the QPU's "8 lanes per row, +contiguous read" assumption (cycles 2/4 wd=4/wd=8) generalises to +the strided memory pattern. The TMU's coalescing behaviour may +differ significantly when 8 lanes need to load from 8 different +rows of the same column (cache-line-miss-y) vs 8 different cols of +the same row (sequential). + +## What to do + +Cycle 5 or 6 (after CDEF), one cycle per variant: + +1. **v_4_8** — vertical 4-tap inner, 8-pixel edge (vertical edge, + filter spans rows above/below). +2. Optional **v_8_8** — vertical 8-tap inner. + +Each cycle: same shape as cycle 2/4 but +- C reference: same `loop_filter` function, instantiated via + `lf_8_fn(v, 4, 1, stride)` (note: stridea + strideb swapped). +- NEON: `ff_vp9_loop_filter_v_4_8_neon` (in vendored `vp9lpf_neon.S`). +- QPU geometry: same 32-edges/WG, but per-edge memory access shape + changes — lanes now span 8 rows (strided by stride) of one column. + +## Key question to answer + +**Does the QPU's mixed-mode +6.9 % win (cycle 2 wd=4 horizontal) +hold for the vertical variant?** The TMU latency / cache behaviour +on column-strided reads is the main unknown. If positive: deployment +recipe gains v variants symmetrically. If negative: deployment +recipe needs to split by orientation (h on QPU, v on CPU). + +## Expected outcome + +| Quantity | Predicted | +|---|---| +| M1 bit-exact | 100 % | +| M3 NEON | similar to h (NEON handles both orientations well) | +| M2 QPU isolation | possibly LOWER than h variant (TMU column reads less coalesced) | +| R isolation | 0.30-0.45 (ORANGE) | +| M4 mixed | UNKNOWN — this is the load-bearing experiment | + +## Acceptance criteria + +- v_4_8 cycle 1-7 complete with M4 measurement +- Decision: "v variants → QPU same as h" OR "v variants → CPU only" +- Deployment recipe updated +- Optional: v_8_8 follow-on cycle if v_4_8 was positive + +## Why deferred + +- Out of cycle 4's compressed scope (cycle 4 was a focused + wd=4 → wd=8 extension) +- User-stated cycle 5 direction was CDEF (AV1 coverage), not VP9 + variant completeness + +## Related + +- `docs/k2_deblock_phase4.md §"3. Workgroup geometry"` discusses + the 32-edges-per-WG mapping that needs revisiting for v variant +- `external/ffmpeg-snapshot/libavcodec/aarch64/vp9lpf_neon.S` — + NEON refs already vendored for both v_4_8 and v_8_8 +- `phase0.md §2` device profile — TMU read patterns relevant for + the column-strided question