Files
daedalus-fourier/docs/k4_lpf8_phase4_7.md
T
marfrit 85feba4087 Cycle 4 (LPF wd=8) closure: M1=100%, R=0.34, M4=+4.1%, PASS
Fourth daedalus-fourier kernel — VP9 8-tap inner loop filter wd=8
h_8_8 variant. Width extension of cycle 2's wd=4; completes VP9
inner-edge LPF coverage. Full cycle Phase 1-7 + M4'''' in one
combined go (cycle compressed since incremental from cycle 2).

Phase 5 review explicitly skipped (incremental ~30-line shader
delta from cycle 2 + same geometry + cycle-2 RED-pattern checks
still apply). Flagged in docs/k4_lpf8_phase4_7.md per dev_process.md
"Skipping phases is a deliberate choice that should be flagged."

Phase 6 v1 first-light: M1'''' 100.0000% bit-exact (65536/65536)
first try. Shaderdb shows 231 inst, 4 hardware threads, 0 spills,
27 max-temps, 48 uniforms — compiler at the latency-hiding ceiling.

Performance:

  M3'''' NEON (single-core)  52.382 Medge/s
  M2'''' QPU isolation       17.847 Medge/s
  R''''                      0.341  → ORANGE band
  30fps floor margin         9.2x (isolation), 20.3x (mixed)

M4'''' concurrent matrix:

  NEON 4-core                37.823 Medge/s  <- baseline
  QPU only                   14.867 Medge/s
  MIXED NEON-3 + QPU         39.389 Medge/s  <- +4.1% PASS

Verdict: YELLOW-via-M4'''' PASS. Deploy wd=8 LPF on QPU alongside
cycle 2 wd=4. Combined VP9 inner-edge LPF coverage now complete.

Cross-cycle LPF comparison:

  | | wd=4 (k2) | wd=8 (k4) |
  | M3 NEON     | 48.3      | 52.4      |
  | M2 QPU iso  | 19.6      | 17.8      |
  | R iso       | 0.41      | 0.34      |
  | M4 delta    | +6.9%     | +4.1%     |
  | 30fps mixed | 7.2x      | 20.3x     |
  | Verdict     | GO QPU    | GO QPU    |

NEW finding (Phase 9 lesson): NEON gets faster per edge as filter
width grows (20.7 → 19.1 ns wd=4 → wd=8). The relative QPU loss
grows with width. wd=16 would probably flip negative based on the
trend line.

Deployment recipe with cycle 4:

  IDCT 8x8 (k1)     -> QPU   (R=0.92, +7% mixed)
  LPF wd=4 (k2)     -> QPU   (R=0.41, +7% mixed)
  LPF wd=8 (k4)     -> QPU   (R=0.34, +4% mixed)
  MC 8h (k3)        -> CPU   (R=0.067, -19% mixed)
  Entropy           -> CPU   (structural)

VP9 inner-edge LPF coverage complete. Project continues to higgs
deployment plumbing or further kernels per user direction.

New artifacts:
- src/v3d_lpf_h_8_8.comp                 — GLSL shader
- tests/vp9_lpf8_ref.c                   — standalone C ref
- tests/bench_neon_lpf8.c                — M1+M3 bench
- tests/bench_v3d_lpf8.c                 — M1+M2 bench
- tests/bench_concurrent_lpf8.c          — M4 pthread bench
- docs/k4_lpf8_phase1_3.md + phase4_7.md — combined cycle docs

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:56:25 +00:00

6.3 KiB
Raw Blame History

cycle, phases, status, date_opened, parent, template
cycle phases status date_opened parent template
4 4-7 (combined) in_progress 2026-05-18 k4_lpf8_phase1_3.md k2_deblock_phase4.md (direct adaptation)

Cycle 4, Phases 4-7 — LPF wd=8

Compact — straight extension of cycle 2 LPF. Phase 4 plan inherits all of cycle-2's geometry/contracts unchanged; only the per-thread algorithm changes (adds flat8in branch).

Phase 4 — plan

Geometry: identical to cycle 2 LPF (256 invocations/WG, 2 edges per subgroup, 8 lanes per edge, 32 edges per WG, oob early-return safe).

SSBO bindings: identical to cycle 2 (meta uvec4, dst uint8_t).

Per-thread algorithm — extends cycle 2 with flat8in:

// ... same lane/edge decomposition, base/E/I/H load, p3..q3 reads,
//     fm test, !fm early return as cycle 2 ...

bool flat8in = abs(p3-p0) <= 1 && abs(p2-p0) <= 1 &&
               abs(p1-p0) <= 1 && abs(q1-q0) <= 1 &&
               abs(q2-q0) <= 1 && abs(q3-q0) <= 1;

if (flat8in) {
    /* 6-write flat-region filter */
    u_dst.dst[base-3u] = uint8_t((p3+p3+p3 + 2*p2 + p1+p0+q0 + 4) >> 3);
    u_dst.dst[base-2u] = uint8_t((p3+p3+p2 + 2*p1 + p0+q0+q1 + 4) >> 3);
    u_dst.dst[base-1u] = uint8_t((p3+p2+p1 + 2*p0 + q0+q1+q2 + 4) >> 3);
    u_dst.dst[base+0u] = uint8_t((p2+p1+p0 + 2*q0 + q1+q2+q3 + 4) >> 3);
    u_dst.dst[base+1u] = uint8_t((p1+p0+q0 + 2*q1 + q2+q3+q3 + 4) >> 3);
    u_dst.dst[base+2u] = uint8_t((p0+q0+q1 + 2*q2 + q3+q3+q3 + 4) >> 3);
} else {
    /* same hev/no-hev paths as cycle 2 */
    bool hev = abs(p1-p0) > H || abs(q1-q0) > H;
    if (hev) { /* 2-write */ }
    else     { /* 4-write */ }
}

Race safety: flat8in path writes at base-3..base+2 = 6 contiguous bytes per row. Updated contract vs cycle 2: dst_stride_u8 ≥ 6 (vs cycle 2's ≥ 4). Bench uses stride=8, satisfies. Phase 6 MUST add assert(dst_stride_u8 >= 6).

Predicted R'''': 0.30.5 (similar to wd=4's 0.41). The flat8in write-on-pass path has 50 % more writes than wd=4's no-hev path, but if flat8in passes rarely under random distributions, it's a small perturbation.

Phase 5 — review (skipped — incremental extension)

Cycle-2's phase5 review remains the relevant outside-look. The specific delta from cycle 2 to cycle 4:

  • Added flat8in branch + 6 writes
  • Stride contract relaxed-tightened from ≥4 to ≥6
  • Same geometry, same SSBOs, same race-safety pattern

The cycle-2 review's two RED-pattern checks (write race, barrier UB) remain satisfied because the geometry is unchanged. The new arithmetic is mechanically transcribed from vp9_lpf8_ref.c — risk of orientation/arithmetic bug is concrete but contained; M1'''' is the immediate gate.

Justification for skipping fresh-context review: cycle 4 changes ~30 lines of one shader and inherits everything else from cycle 2. Per dev_process.md "Skipping phases is a deliberate choice that should be flagged, not a default" — flagging here. If M1'''' fails on first run, restart with full Phase 5'''' review.

Phase 6 — implementation

(executed below — src/v3d_lpf_h_8_8.comp + tests/bench_v3d_lpf8.c)

Phase 7 — verification

v1 first-light

=== v3d LPF h_8_8 bench ===
=== M1'''': QPU vs C bit-exact ===
  edges bit-exact: 65536 / 65536 (100.0000 %)

=== M2'''': QPU throughput ===
  per-edge       = 56.0 ns
  per-dispatch   = 3672.1 us
  M2''''  = 17.847 Medge/s
  R''''   = 0.341 → ORANGE band
  30fps@1080p floor: 9.2x margin (isolation)

shaderdb: 231 inst, 4 threads, 0 spills, 27 max-temps, 48 uniforms. The 4-thread result is the meaningful one — compiler delivered. The wd=8 kernel runs at the latency-hiding ceiling from v1.

M4'''' concurrent (8s windows)

Config Medge/s vs NEON-4 30fps margin
NEON 4-core 37.823 baseline 19.5×
QPU only 14.867 7.7×
MIXED NEON-3 + QPU 39.389 +4.1 % 20.3×

M4'''' PASSES. The freed-core pattern from cycles 1+2 holds for wd=8 — smaller delta than wd=4 (+4.1 % vs +6.9 %) but still positive. The larger conditional logic (flat8in path) dilutes per-edge QPU contribution under contention (3.98 vs cycle-2's 4.00 — basically same), and NEON-4 baseline is higher (37.8 vs cycle-2's 33.7) because the per-edge NEON cost is slightly lower for wd=8 (19.1 vs cycle-2's 20.7 ns), so the relative gain shrinks.

Cross-cycle LPF comparison

k2 wd=4 k4 wd=8
M3 NEON (Medge/s) 48.285 52.382
M2 QPU isolation 19.645 17.847
R isolation 0.41 0.34
NEON-4 (Medge/s) 33.726 37.823
Mixed N-3+QPU 36.049 39.389
M4 delta +6.9 % +4.1 %
30fps margin (mixed) 7.2× 20.3×
Verdict GO QPU GO QPU

Decision per Phase 1 rules + 30fps floor

Rule Result Status
M1'''' bit-exact 100.0000 % ✓ PASS
R'''' = M2''''/M3'''' 0.341 (ORANGE) does not auto-close
M4'''' > pure NEON-4 +4.1 % ✓ PASS gate
30fps@1080p floor 20.3× mixed ✓ PASS user-facing

Verdict: YELLOW-via-M4'''' PASS. Deploy wd=8 LPF on QPU, alongside cycle-2 wd=4. Combined VP9 LPF coverage = wd=4 + wd=8 on QPU.

Phase 9 lessons

  1. Width extensions of a known-working kernel (wd=4 → wd=8) inherit the pattern reliably. v1 first-light hit M1'''' = 100 % first try on a 30-line shader delta. No iteration needed.

  2. Phase 5 review can be skipped for incremental extensions — when the delta is < ~30 lines and the cycle-2 review's pattern coverage still applies. Flagged explicitly in §"Phase 5 — review (skipped)". If M1 had failed, restart with full review. Cycle 5+ should restore mandatory review for non-incremental work.

  3. NEON gets faster per edge as filter width grows (20.7 → 19.1 ns wd=4 → wd=8). The NEON implementation is heavily optimised; the relative QPU loss grows with kernel width. Cycle 5 wd=16 would probably show further R degradation.

  4. M4 delta is the gating metric for ORANGE-band kernels. The gap from cycle-2 +6.9 % to cycle-4 +4.1 % indicates "wd=8 is borderline useful on QPU; wd=16 may flip negative."

Leaves open

  • LPF wd=16 (cycle 5 if VP9 coverage requires it; likely RED based on the trend line)
  • Vertical variants of both wd=4 and wd=8 (different memory pattern)
  • CDEF / loop restoration (AV1 kernels)
  • Phase 8 deployment plumbing (libva-v4l2-request-fourier integration)