Per user direction at cycle-4 close: file wd=16 (trend prediction validation) and vertical variants (column-stride TMU behaviour unknown) as local issues for future cycles. Progress instead to CDEF (AV1) for codec breadth. docs/issues/001 — wd=16 prediction validation. Per cycle 4 lesson 4, trend says wd=16 likely flips M4 negative. Quick incremental cycle when revisited. docs/issues/002 — vertical variants. Different memory access pattern (column-strided vs row-strided). The load-bearing unknown is whether the cycle 2 +6.9% mixed gain survives the TMU coalescing shift. If positive, deployment recipe gains symmetry; if negative, must split by orientation. Both issues have acceptance criteria + expected outcomes documented. Cycle 5 next: CDEF (AV1) — codec-breadth expansion. No Gitea repo exists for daedalus-fourier yet (project is local- only). If a tracker is wanted, create the repo and migrate these .md files. For now they live in-tree as part of the project history. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2.8 KiB
Issue 001 — VP9 LPF wd=16 cycle (prediction validation)
Status: open, not blocking Type: kernel-cycle (cycle 5 candidate) Predicted verdict: RED (M4 likely negative, per cycle 4 lesson 4) Priority: low (incremental; trend prediction) Filed: 2026-05-18
Background
Cycle 4 (LPF wd=8) closed PASS with M4 delta +4.1 % vs cycle 2 wd=4's
+6.9 %. The downward trend prompted Phase 9 lesson: "wd=16 would
probably show further R degradation; M4 may flip negative based on
the trend line." See docs/k4_lpf8_phase4_7.md §"Phase 9 lessons".
This issue tracks the experiment to validate (or invalidate) that prediction.
What to do
Cycle 5 LPF wd=16, mirroring cycle 4's compact structure:
- Phase 3: build
tests/bench_neon_lpf16.cmodelled onbench_neon_lpf8.c. NEON symbol:ff_vp9_loop_filter_h_16_16_neon(already in vendoredvp9lpf_neon.S). Capture M3. - Phase 4-7: write
src/v3d_lpf_h_16_16.compextending the wd=8 kernel with the wd=16 outer-flat path (flat8outtest, 14 writes per row when both flat8out and flat8in pass). New contract:dst_stride_u8 ≥ 14(vs cycle 4's ≥ 6) because the flat8out path writes atbase-7..base+6(14 contiguous bytes). - Phase 5 review: mandatory — wd=16 is not as incremental as wd=8 (much larger conditional logic, new contract bound).
- Phase 7: measure M2, R; if M4 negative as predicted, document trend confirmation and close kernel as "CPU-only" in deployment recipe.
Expected outcome (per prediction)
| Quantity | Predicted |
|---|---|
| M1 bit-exact | 100 % (same pattern as cycles 2/4) |
| M3 NEON | ~55 Medge/s (slightly faster than wd=8) |
| M2 QPU isolation | ~12-15 Medge/s |
| R isolation | 0.22-0.27 (ORANGE, downward) |
| M4 mixed vs NEON-4 | -2 % to +1 % (borderline; likely negative) |
| 30fps margin | still 5×+ (user-facing PASS regardless) |
Acceptance criteria (issue closed when)
- Cycle 5 phases 1-7 complete, committed
docs/k5_lpf16_phase*.mdproduced- Phase 7 verdict documented, deployment recipe updated either way
- Phase 9 lesson 4 trend prediction validated or refuted
Why deferred (not done in current session)
The session goal was "continue until user intervention necessary." User directed: file as issue, progress to cycle 5 CDEF instead. The trend prediction is interesting but the project's deployment recipe is already locked through cycle 4; cycle 5 wd=16 result would update at most one row of the recipe table.
Related
docs/k4_lpf8_phase4_7.md §"Phase 9 lessons"lesson 4 (the prediction this validates)external/ffmpeg-snapshot/libavcodec/aarch64/vp9lpf_neon.S(NEON ref already vendored — symbolff_vp9_loop_filter_h_16_16_neon)docs/k2_deblock_phase4.md(cycle 2 template)docs/k4_lpf8_phase4_7.md(cycle 4 template, the most direct reference)