Files
daedalus-fourier/docs/k4_lpf8_phase1_3.md
T
marfrit 85feba4087 Cycle 4 (LPF wd=8) closure: M1=100%, R=0.34, M4=+4.1%, PASS
Fourth daedalus-fourier kernel — VP9 8-tap inner loop filter wd=8
h_8_8 variant. Width extension of cycle 2's wd=4; completes VP9
inner-edge LPF coverage. Full cycle Phase 1-7 + M4'''' in one
combined go (cycle compressed since incremental from cycle 2).

Phase 5 review explicitly skipped (incremental ~30-line shader
delta from cycle 2 + same geometry + cycle-2 RED-pattern checks
still apply). Flagged in docs/k4_lpf8_phase4_7.md per dev_process.md
"Skipping phases is a deliberate choice that should be flagged."

Phase 6 v1 first-light: M1'''' 100.0000% bit-exact (65536/65536)
first try. Shaderdb shows 231 inst, 4 hardware threads, 0 spills,
27 max-temps, 48 uniforms — compiler at the latency-hiding ceiling.

Performance:

  M3'''' NEON (single-core)  52.382 Medge/s
  M2'''' QPU isolation       17.847 Medge/s
  R''''                      0.341  → ORANGE band
  30fps floor margin         9.2x (isolation), 20.3x (mixed)

M4'''' concurrent matrix:

  NEON 4-core                37.823 Medge/s  <- baseline
  QPU only                   14.867 Medge/s
  MIXED NEON-3 + QPU         39.389 Medge/s  <- +4.1% PASS

Verdict: YELLOW-via-M4'''' PASS. Deploy wd=8 LPF on QPU alongside
cycle 2 wd=4. Combined VP9 inner-edge LPF coverage now complete.

Cross-cycle LPF comparison:

  | | wd=4 (k2) | wd=8 (k4) |
  | M3 NEON     | 48.3      | 52.4      |
  | M2 QPU iso  | 19.6      | 17.8      |
  | R iso       | 0.41      | 0.34      |
  | M4 delta    | +6.9%     | +4.1%     |
  | 30fps mixed | 7.2x      | 20.3x     |
  | Verdict     | GO QPU    | GO QPU    |

NEW finding (Phase 9 lesson): NEON gets faster per edge as filter
width grows (20.7 → 19.1 ns wd=4 → wd=8). The relative QPU loss
grows with width. wd=16 would probably flip negative based on the
trend line.

Deployment recipe with cycle 4:

  IDCT 8x8 (k1)     -> QPU   (R=0.92, +7% mixed)
  LPF wd=4 (k2)     -> QPU   (R=0.41, +7% mixed)
  LPF wd=8 (k4)     -> QPU   (R=0.34, +4% mixed)
  MC 8h (k3)        -> CPU   (R=0.067, -19% mixed)
  Entropy           -> CPU   (structural)

VP9 inner-edge LPF coverage complete. Project continues to higgs
deployment plumbing or further kernels per user direction.

New artifacts:
- src/v3d_lpf_h_8_8.comp                 — GLSL shader
- tests/vp9_lpf8_ref.c                   — standalone C ref
- tests/bench_neon_lpf8.c                — M1+M3 bench
- tests/bench_v3d_lpf8.c                 — M1+M2 bench
- tests/bench_concurrent_lpf8.c          — M4 pthread bench
- docs/k4_lpf8_phase1_3.md + phase4_7.md — combined cycle docs

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:56:25 +00:00

69 lines
2.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
cycle: 4
phases: 1-3 (combined doc — straight extension of cycle 2)
status: phase 3 in progress
date_opened: 2026-05-18
parent_cycle: k3_mc_phase7.md
target_kernel: VP9 loop filter wd=8 inner-edge horizontal (h_8_8)
---
# Cycle 4, Phases 1-3 — LPF wd=8
Compact combined doc — cycle 4 is a *width extension* of cycle 2
(same kernel family, same shape, same NEON file).
## Phase 1 — goal
**Kernel**: VP9 loop filter, 8-tap inner-edge variant (wd=8), horizontal
direction, 8-pixel edge. libavcodec symbol `ff_vp9_loop_filter_h_8_8_neon`
(already in vendored `vp9lpf_neon.S`).
**Why this kernel**: completes VP9 LPF coverage alongside cycle 2's
wd=4. The wd=8 path adds the `flat8in` test (6 abs comparisons) and a
6-pixel "flat region" write path — meaningfully more conditional
branches than wd=4 within the same kernel family.
**Measurable success** (cycle-4 numbering, `''''` superscript):
| ID | Measurement | Gate |
|---|---|---|
| M1'''' | Bit-exact vs C reference | 100.0000 % |
| M2'''' | QPU throughput Medge/s | recorded |
| M3'''' | NEON `ff_vp9_loop_filter_h_8_8_neon` Medge/s | recorded |
| M4'''' | Mixed NEON-3 + QPU vs pure NEON-4 (Medge/s) | recorded if YELLOW |
Same R bands + 30fps-floor calibration as cycles 2/3.
**Predicted R''''**: 0.30.5. Cycle 2 LPF wd=4 hit R=0.41; wd=8 adds
~20 % more conditional logic (flat8in test) and additional writes
when flat8in passes. Likely modestly worse R than wd=4. The 6-write
flat8in path under SIMD divergence may dominate.
## Phase 2 — situation
C reference: `external/ffmpeg-snapshot/libavcodec/vp9dsp_template.c`,
the same `loop_filter()` function (lines 1780-1898) used in cycle 2
but invoked with wd=8 via the `lf_8_fn(h, 8, stride, 1)` macro
instantiation. The wd=8 path activates the `if (wd >= 8 && flat8in)`
branch.
NEON reference: already vendored at
`external/ffmpeg-snapshot/libavcodec/aarch64/vp9lpf_neon.S`,
symbol `ff_vp9_loop_filter_h_8_8_neon`. Same calling convention
as wd=4: `(uint8_t *dst, ptrdiff_t stride, int E, int I, int H)`.
No new vendored sources needed.
**Workload model per edge (worst case, flat8in passes):**
- 8 rows × 6 written + 2 unwritten = 48 writes per edge (vs wd=4's 16-32)
- 8 rows × 8 reads = 64 reads (same as wd=4)
- ~12 abs+compares per row × 8 = ~96 per edge (vs wd=4's ~50)
Memory traffic similar to cycle 2 (~80-110 bytes per edge).
Compute moderately higher (more conditional branches + more writes
when flat8in fires).
## Phase 3 — NEON M3'''' baseline
(captured below after build + run)