Fourth daedalus-fourier kernel — VP9 8-tap inner loop filter wd=8 h_8_8 variant. Width extension of cycle 2's wd=4; completes VP9 inner-edge LPF coverage. Full cycle Phase 1-7 + M4'''' in one combined go (cycle compressed since incremental from cycle 2). Phase 5 review explicitly skipped (incremental ~30-line shader delta from cycle 2 + same geometry + cycle-2 RED-pattern checks still apply). Flagged in docs/k4_lpf8_phase4_7.md per dev_process.md "Skipping phases is a deliberate choice that should be flagged." Phase 6 v1 first-light: M1'''' 100.0000% bit-exact (65536/65536) first try. Shaderdb shows 231 inst, 4 hardware threads, 0 spills, 27 max-temps, 48 uniforms — compiler at the latency-hiding ceiling. Performance: M3'''' NEON (single-core) 52.382 Medge/s M2'''' QPU isolation 17.847 Medge/s R'''' 0.341 → ORANGE band 30fps floor margin 9.2x (isolation), 20.3x (mixed) M4'''' concurrent matrix: NEON 4-core 37.823 Medge/s <- baseline QPU only 14.867 Medge/s MIXED NEON-3 + QPU 39.389 Medge/s <- +4.1% PASS Verdict: YELLOW-via-M4'''' PASS. Deploy wd=8 LPF on QPU alongside cycle 2 wd=4. Combined VP9 inner-edge LPF coverage now complete. Cross-cycle LPF comparison: | | wd=4 (k2) | wd=8 (k4) | | M3 NEON | 48.3 | 52.4 | | M2 QPU iso | 19.6 | 17.8 | | R iso | 0.41 | 0.34 | | M4 delta | +6.9% | +4.1% | | 30fps mixed | 7.2x | 20.3x | | Verdict | GO QPU | GO QPU | NEW finding (Phase 9 lesson): NEON gets faster per edge as filter width grows (20.7 → 19.1 ns wd=4 → wd=8). The relative QPU loss grows with width. wd=16 would probably flip negative based on the trend line. Deployment recipe with cycle 4: IDCT 8x8 (k1) -> QPU (R=0.92, +7% mixed) LPF wd=4 (k2) -> QPU (R=0.41, +7% mixed) LPF wd=8 (k4) -> QPU (R=0.34, +4% mixed) MC 8h (k3) -> CPU (R=0.067, -19% mixed) Entropy -> CPU (structural) VP9 inner-edge LPF coverage complete. Project continues to higgs deployment plumbing or further kernels per user direction. New artifacts: - src/v3d_lpf_h_8_8.comp — GLSL shader - tests/vp9_lpf8_ref.c — standalone C ref - tests/bench_neon_lpf8.c — M1+M3 bench - tests/bench_v3d_lpf8.c — M1+M2 bench - tests/bench_concurrent_lpf8.c — M4 pthread bench - docs/k4_lpf8_phase1_3.md + phase4_7.md — combined cycle docs Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6.3 KiB
cycle, phases, status, date_opened, parent, template
| cycle | phases | status | date_opened | parent | template |
|---|---|---|---|---|---|
| 4 | 4-7 (combined) | in_progress | 2026-05-18 | k4_lpf8_phase1_3.md | k2_deblock_phase4.md (direct adaptation) |
Cycle 4, Phases 4-7 — LPF wd=8
Compact — straight extension of cycle 2 LPF. Phase 4 plan inherits all of cycle-2's geometry/contracts unchanged; only the per-thread algorithm changes (adds flat8in branch).
Phase 4 — plan
Geometry: identical to cycle 2 LPF (256 invocations/WG, 2 edges per subgroup, 8 lanes per edge, 32 edges per WG, oob early-return safe).
SSBO bindings: identical to cycle 2 (meta uvec4, dst uint8_t).
Per-thread algorithm — extends cycle 2 with flat8in:
// ... same lane/edge decomposition, base/E/I/H load, p3..q3 reads,
// fm test, !fm early return as cycle 2 ...
bool flat8in = abs(p3-p0) <= 1 && abs(p2-p0) <= 1 &&
abs(p1-p0) <= 1 && abs(q1-q0) <= 1 &&
abs(q2-q0) <= 1 && abs(q3-q0) <= 1;
if (flat8in) {
/* 6-write flat-region filter */
u_dst.dst[base-3u] = uint8_t((p3+p3+p3 + 2*p2 + p1+p0+q0 + 4) >> 3);
u_dst.dst[base-2u] = uint8_t((p3+p3+p2 + 2*p1 + p0+q0+q1 + 4) >> 3);
u_dst.dst[base-1u] = uint8_t((p3+p2+p1 + 2*p0 + q0+q1+q2 + 4) >> 3);
u_dst.dst[base+0u] = uint8_t((p2+p1+p0 + 2*q0 + q1+q2+q3 + 4) >> 3);
u_dst.dst[base+1u] = uint8_t((p1+p0+q0 + 2*q1 + q2+q3+q3 + 4) >> 3);
u_dst.dst[base+2u] = uint8_t((p0+q0+q1 + 2*q2 + q3+q3+q3 + 4) >> 3);
} else {
/* same hev/no-hev paths as cycle 2 */
bool hev = abs(p1-p0) > H || abs(q1-q0) > H;
if (hev) { /* 2-write */ }
else { /* 4-write */ }
}
Race safety: flat8in path writes at base-3..base+2 = 6
contiguous bytes per row. Updated contract vs cycle 2:
dst_stride_u8 ≥ 6 (vs cycle 2's ≥ 4). Bench uses stride=8,
satisfies. Phase 6 MUST add assert(dst_stride_u8 >= 6).
Predicted R'''': 0.3–0.5 (similar to wd=4's 0.41). The flat8in write-on-pass path has 50 % more writes than wd=4's no-hev path, but if flat8in passes rarely under random distributions, it's a small perturbation.
Phase 5 — review (skipped — incremental extension)
Cycle-2's phase5 review remains the relevant outside-look. The specific delta from cycle 2 to cycle 4:
- Added flat8in branch + 6 writes
- Stride contract relaxed-tightened from ≥4 to ≥6
- Same geometry, same SSBOs, same race-safety pattern
The cycle-2 review's two RED-pattern checks (write race, barrier UB)
remain satisfied because the geometry is unchanged. The new
arithmetic is mechanically transcribed from vp9_lpf8_ref.c —
risk of orientation/arithmetic bug is concrete but contained; M1''''
is the immediate gate.
Justification for skipping fresh-context review: cycle 4 changes ~30 lines of one shader and inherits everything else from cycle 2. Per dev_process.md "Skipping phases is a deliberate choice that should be flagged, not a default" — flagging here. If M1'''' fails on first run, restart with full Phase 5'''' review.
Phase 6 — implementation
(executed below — src/v3d_lpf_h_8_8.comp + tests/bench_v3d_lpf8.c)
Phase 7 — verification
v1 first-light
=== v3d LPF h_8_8 bench ===
=== M1'''': QPU vs C bit-exact ===
edges bit-exact: 65536 / 65536 (100.0000 %)
=== M2'''': QPU throughput ===
per-edge = 56.0 ns
per-dispatch = 3672.1 us
M2'''' = 17.847 Medge/s
R'''' = 0.341 → ORANGE band
30fps@1080p floor: 9.2x margin (isolation)
shaderdb: 231 inst, 4 threads, 0 spills, 27 max-temps, 48 uniforms. The 4-thread result is the meaningful one — compiler delivered. The wd=8 kernel runs at the latency-hiding ceiling from v1.
M4'''' concurrent (8s windows)
| Config | Medge/s | vs NEON-4 | 30fps margin |
|---|---|---|---|
| NEON 4-core | 37.823 | baseline | 19.5× |
| QPU only | 14.867 | — | 7.7× |
| MIXED NEON-3 + QPU | 39.389 | +4.1 % | 20.3× |
M4'''' PASSES. The freed-core pattern from cycles 1+2 holds for wd=8 — smaller delta than wd=4 (+4.1 % vs +6.9 %) but still positive. The larger conditional logic (flat8in path) dilutes per-edge QPU contribution under contention (3.98 vs cycle-2's 4.00 — basically same), and NEON-4 baseline is higher (37.8 vs cycle-2's 33.7) because the per-edge NEON cost is slightly lower for wd=8 (19.1 vs cycle-2's 20.7 ns), so the relative gain shrinks.
Cross-cycle LPF comparison
| k2 wd=4 | k4 wd=8 | |
|---|---|---|
| M3 NEON (Medge/s) | 48.285 | 52.382 |
| M2 QPU isolation | 19.645 | 17.847 |
| R isolation | 0.41 | 0.34 |
| NEON-4 (Medge/s) | 33.726 | 37.823 |
| Mixed N-3+QPU | 36.049 | 39.389 |
| M4 delta | +6.9 % | +4.1 % |
| 30fps margin (mixed) | 7.2× | 20.3× |
| Verdict | GO QPU | GO QPU |
Decision per Phase 1 rules + 30fps floor
| Rule | Result | Status |
|---|---|---|
| M1'''' bit-exact | 100.0000 % | ✓ PASS |
| R'''' = M2''''/M3'''' | 0.341 (ORANGE) | does not auto-close |
| M4'''' > pure NEON-4 | +4.1 % | ✓ PASS gate |
| 30fps@1080p floor | 20.3× mixed | ✓ PASS user-facing |
Verdict: YELLOW-via-M4'''' PASS. Deploy wd=8 LPF on QPU, alongside cycle-2 wd=4. Combined VP9 LPF coverage = wd=4 + wd=8 on QPU.
Phase 9 lessons
-
Width extensions of a known-working kernel (wd=4 → wd=8) inherit the pattern reliably. v1 first-light hit M1'''' = 100 % first try on a 30-line shader delta. No iteration needed.
-
Phase 5 review can be skipped for incremental extensions — when the delta is < ~30 lines and the cycle-2 review's pattern coverage still applies. Flagged explicitly in §"Phase 5 — review (skipped)". If M1 had failed, restart with full review. Cycle 5+ should restore mandatory review for non-incremental work.
-
NEON gets faster per edge as filter width grows (20.7 → 19.1 ns wd=4 → wd=8). The NEON implementation is heavily optimised; the relative QPU loss grows with kernel width. Cycle 5 wd=16 would probably show further R degradation.
-
M4 delta is the gating metric for ORANGE-band kernels. The gap from cycle-2 +6.9 % to cycle-4 +4.1 % indicates "wd=8 is borderline useful on QPU; wd=16 may flip negative."
Leaves open
- LPF wd=16 (cycle 5 if VP9 coverage requires it; likely RED based on the trend line)
- Vertical variants of both wd=4 and wd=8 (different memory pattern)
- CDEF / loop restoration (AV1 kernels)
- Phase 8 deployment plumbing (libva-v4l2-request-fourier integration)