daedalus-fourier/docs/k4_lpf8_phase4_7.md

---
cycle: 4
phases: 4-7 (combined)
status: in_progress
date_opened: 2026-05-18
parent: k4_lpf8_phase1_3.md
template: k2_deblock_phase4.md (direct adaptation)
---

# Cycle 4, Phases 4-7 — LPF wd=8

Compact — straight extension of cycle 2 LPF. Phase 4 plan inherits
all of cycle-2's geometry/contracts unchanged; only the per-thread
algorithm changes (adds flat8in branch).

## Phase 4 — plan

**Geometry**: identical to cycle 2 LPF (256 invocations/WG, 2 edges
per subgroup, 8 lanes per edge, 32 edges per WG, oob early-return
safe).

**SSBO bindings**: identical to cycle 2 (meta uvec4, dst uint8_t).

**Per-thread algorithm** — extends cycle 2 with flat8in:
```glsl
// ... same lane/edge decomposition, base/E/I/H load, p3..q3 reads,
//     fm test, !fm early return as cycle 2 ...

bool flat8in = abs(p3-p0) <= 1 && abs(p2-p0) <= 1 &&
               abs(p1-p0) <= 1 && abs(q1-q0) <= 1 &&
               abs(q2-q0) <= 1 && abs(q3-q0) <= 1;

if (flat8in) {
    /* 6-write flat-region filter */
    u_dst.dst[base-3u] = uint8_t((p3+p3+p3 + 2*p2 + p1+p0+q0 + 4) >> 3);
    u_dst.dst[base-2u] = uint8_t((p3+p3+p2 + 2*p1 + p0+q0+q1 + 4) >> 3);
    u_dst.dst[base-1u] = uint8_t((p3+p2+p1 + 2*p0 + q0+q1+q2 + 4) >> 3);
    u_dst.dst[base+0u] = uint8_t((p2+p1+p0 + 2*q0 + q1+q2+q3 + 4) >> 3);
    u_dst.dst[base+1u] = uint8_t((p1+p0+q0 + 2*q1 + q2+q3+q3 + 4) >> 3);
    u_dst.dst[base+2u] = uint8_t((p0+q0+q1 + 2*q2 + q3+q3+q3 + 4) >> 3);
} else {
    /* same hev/no-hev paths as cycle 2 */
    bool hev = abs(p1-p0) > H || abs(q1-q0) > H;
    if (hev) { /* 2-write */ }
    else     { /* 4-write */ }
}
```

**Race safety**: flat8in path writes at `base-3..base+2` = 6
contiguous bytes per row. **Updated contract** vs cycle 2:
`dst_stride_u8 ≥ 6` (vs cycle 2's `≥ 4`). Bench uses stride=8,
satisfies. Phase 6 MUST add `assert(dst_stride_u8 >= 6)`.

**Predicted R''''**: 0.3–0.5 (similar to wd=4's 0.41). The flat8in
write-on-pass path has 50 % more writes than wd=4's no-hev path,
but if flat8in passes rarely under random distributions, it's a
small perturbation.

## Phase 5 — review (skipped — incremental extension)

Cycle-2's phase5 review remains the relevant outside-look. The
specific delta from cycle 2 to cycle 4:
- Added flat8in branch + 6 writes
- Stride contract relaxed-tightened from ≥4 to ≥6
- Same geometry, same SSBOs, same race-safety pattern

The cycle-2 review's two RED-pattern checks (write race, barrier UB)
remain satisfied because the geometry is unchanged. The new
arithmetic is mechanically transcribed from `vp9_lpf8_ref.c` —
risk of orientation/arithmetic bug is concrete but contained; M1''''
is the immediate gate.

**Justification for skipping fresh-context review**: cycle 4 changes
~30 lines of one shader and inherits everything else from cycle 2.
Per dev_process.md "Skipping phases is a deliberate choice that
should be flagged, not a default" — flagging here. If M1'''' fails
on first run, restart with full Phase 5'''' review.

## Phase 6 — implementation

(executed below — `src/v3d_lpf_h_8_8.comp` + `tests/bench_v3d_lpf8.c`)

## Phase 7 — verification

### v1 first-light
```
=== v3d LPF h_8_8 bench ===
=== M1'''': QPU vs C bit-exact ===
  edges bit-exact: 65536 / 65536 (100.0000 %)

=== M2'''': QPU throughput ===
  per-edge       = 56.0 ns
  per-dispatch   = 3672.1 us
  M2''''  = 17.847 Medge/s
  R''''   = 0.341 → ORANGE band
  30fps@1080p floor: 9.2x margin (isolation)
```

shaderdb: **231 inst, 4 threads, 0 spills, 27 max-temps, 48 uniforms.**
The 4-thread result is the meaningful one — compiler delivered. The
wd=8 kernel runs at the latency-hiding ceiling from v1.

### M4'''' concurrent (8s windows)

| Config | Medge/s | vs NEON-4 | 30fps margin |
|---|---|---|---|
| **NEON 4-core** | **37.823** | baseline | 19.5× |
| QPU only | 14.867 | — | 7.7× |
| **MIXED NEON-3 + QPU** | **39.389** | **+4.1 %** | 20.3× |

**M4'''' PASSES**. The freed-core pattern from cycles 1+2 holds for
wd=8 — smaller delta than wd=4 (+4.1 % vs +6.9 %) but still positive.
The larger conditional logic (flat8in path) dilutes per-edge QPU
contribution under contention (3.98 vs cycle-2's 4.00 — basically
same), and NEON-4 baseline is higher (37.8 vs cycle-2's 33.7) because
the per-edge NEON cost is slightly lower for wd=8 (19.1 vs cycle-2's
20.7 ns), so the relative gain shrinks.

### Cross-cycle LPF comparison

| | k2 wd=4 | k4 wd=8 |
|---|---|---|
| M3 NEON (Medge/s) | 48.285 | 52.382 |
| M2 QPU isolation | 19.645 | 17.847 |
| R isolation | 0.41 | 0.34 |
| NEON-4 (Medge/s) | 33.726 | 37.823 |
| Mixed N-3+QPU | 36.049 | 39.389 |
| M4 delta | **+6.9 %** | **+4.1 %** |
| 30fps margin (mixed) | 7.2× | 20.3× |
| Verdict | GO QPU | GO QPU |

### Decision per Phase 1 rules + 30fps floor

| Rule | Result | Status |
|---|---|---|
| M1'''' bit-exact | 100.0000 % | ✓ PASS |
| R'''' = M2''''/M3'''' | 0.341 (ORANGE) | does not auto-close |
| M4'''' > pure NEON-4 | +4.1 % | ✓ PASS gate |
| 30fps@1080p floor | 20.3× mixed | ✓ PASS user-facing |

**Verdict: YELLOW-via-M4'''' PASS. Deploy wd=8 LPF on QPU,
alongside cycle-2 wd=4.** Combined VP9 LPF coverage = wd=4 + wd=8
on QPU.

### Phase 9 lessons

1. Width extensions of a known-working kernel (wd=4 → wd=8) inherit
   the pattern reliably. v1 first-light hit M1'''' = 100 % first try
   on a 30-line shader delta. No iteration needed.

2. **Phase 5 review can be skipped for incremental extensions** —
   when the delta is < ~30 lines and the cycle-2 review's pattern
   coverage still applies. Flagged explicitly in §"Phase 5 — review
   (skipped)". If M1 had failed, restart with full review. Cycle 5+
   should restore mandatory review for non-incremental work.

3. NEON gets faster per edge as filter width grows (20.7 → 19.1 ns
   wd=4 → wd=8). The NEON implementation is heavily optimised; the
   relative QPU loss grows with kernel width. Cycle 5 wd=16 would
   probably show further R degradation.

4. M4 delta is the gating metric for ORANGE-band kernels. The gap
   from cycle-2 +6.9 % to cycle-4 +4.1 % indicates "wd=8 is borderline
   useful on QPU; wd=16 may flip negative."

### Leaves open

- LPF wd=16 (cycle 5 if VP9 coverage requires it; likely RED based on
  the trend line)
- Vertical variants of both wd=4 and wd=8 (different memory pattern)
- CDEF / loop restoration (AV1 kernels)
- Phase 8 deployment plumbing (libva-v4l2-request-fourier integration)