Files
fresnel-fourier/phase7_iter5b_verification_v2.md
marfrit c773c3d2c1 iter5b-β Phase 7: PARTIAL PASS — VP9 unblocked, MPEG-2 maintained, HEVC+VP8 partial
Two acts:
Act 1 (β alone): all 5 libva codecs returned all-zero. MPEG-2 was a
regression (pre-β it worked); HEVC was unchanged (kernel returns
DQBUF FLAG_ERROR pre AND post β — same Phase 3 baseline showed it).
Root cause: ffmpeg-vaapi-copy passes surfaces_count=0 to vaCreateContext
per iter6 context.c:262 comment; my β walk of surfaces_ids[] was a
no-op → destination_planes_count stayed 0 → surface_bind_slot no-op
→ all-zero readback.

Act 2 (Commit D): cache format-uniform CAPTURE geometry in driver_data;
walk surface_heap in CreateContext; lazy-fill in CreateSurfaces2 when
fmt_valid is set; invalidate in DestroyContext. Restores MPEG-2 to
pre-β state and unlocks VP9.

Per Phase 1 criteria: criterion 1 PARTIAL (VP9 of HEVC+VP9+VP8);
criteria 2-4 PASS.

Bug 5 (NEW): HEVC libva DQBUF FLAG_ERROR — pre-existing kernel
rejection; β's OUTPUT format fix didn't address it. Transitive proof
at iter2 verified control payload shape but kernel still rejects;
some other V4L2 protocol contract aspect differs from kdirect.

Bug 6 (NEW): VP8 libva produces non-zero output with real content
(74.8% zero + 256 unique bytes incl. keyframe pixels at `93 8e 8a 89...`)
but diverges from kdirect. Decode runs; output mismatch likely
slot-rotation or partial-fill bug.

VP9 is iter5b-β's only clean PASS. Architecture-wise β succeeded:
no α'-style failure mode possible (no in-CreateSurfaces2 destructive
teardown), and the CRIT-1+CRIT-2 fixes from Phase 5 v2 review held.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 18:56:26 +00:00

117 lines
9.3 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Iteration 5b — Phase 7 v2 (verification, β + Commit D)
Captured 2026-05-12 evening on fresnel `linux-fresnel-fourier 7.0-1`, fork tip `70196f8`, backend SHA `2c6ff82cbdc156ff8910d0c7fe58e75eeecdfd6e6a1caabb049c8adf43a098b8`. β architecture survived empirical contact; Commit D fix-forward addressed the ffmpeg-vaapi-copy-passes-surfaces_count=0 surprise the first run revealed.
## Verdict
**PARTIAL PASS.** VP9 unblocked directly. MPEG-2 maintained. VP8 mechanically unblocked (decode runs, output diverges from kdirect). HEVC unchanged (separate kernel-rejection issue, pre-existing). H.264 unchanged (Bug 4 deferred as planned).
| Codec | Pre-β | Post-β + D | kdirect | sw | Verdict |
|---|---|---|---|---|---|
| H.264 1080p30 | `71ac099b…` keyframe partial | `71ac099b…` unchanged | `1e7a0bc9…` | `1e7a0bc9…` | **PARTIAL** — Bug 4 (deferred to iter6) |
| HEVC 720p | `06b2c5a0…` all-zero | `06b2c5a0…` still all-zero | `9340b832…` | `9340b832…` | **FAIL** — separate kernel-rejection issue |
| VP9 720p | `06b2c5a0…` all-zero | **`4f1565e8…`** | `4f1565e8…` | `4f1565e8…` | **PASS** ✓✓✓ (libva == kdirect == sw) |
| MPEG-2 720p | `19eefbf4…` worked | `19eefbf4…` maintained | `19eefbf4…` | `7be8cad7…` | **PASS** (libva == kdirect; HW≠SW is codec precision drift, not Bug 2) |
| VP8 720p | `06b2c5a0…` all-zero | `bcc57ed5…` real content but diverges | `136ce5cb…` | `136ce5cb…` | **PARTIAL** — decode runs, output mismatch (256 unique bytes, real keyframe content `93 8e 8a 89 …`) |
## Per Phase 1 lock criteria
1. **HEVC + VP9 + VP8 libva == kdirect == sw**: **PARTIAL — 1 of 3 (VP9 only).**
2. **MPEG-2 unchanged**: **PASS** (after Commit D fix-forward — pre-D was all-zero, post-D matches pre-iter5b).
3. **H.264 keyframe still decodes**: **PASS** (unchanged).
4. **Control-payload anchors hold**: **PASS** (β didn't touch control submission).
## What happened in two acts
### Act 1 — β alone: regression on MPEG-2, no progress elsewhere
Initial post-β sweep returned all-zero for every libva codec. Strace of HEVC: kernel decoder returned `DQBUF` with `V4L2_BUF_FLAG_ERROR` for every frame. Strace of MPEG-2 (post-β): kernel decoder succeeded (10 DQBUF, 0 ERROR) but the userspace read still got all-zero. The MPEG-2 success/userspace-zero combination pointed at a surface-state issue.
Phase 3 baseline cross-check showed:
- Pre-iter5b HEVC trace.11115+: every DQBUF had ERROR flag too — **pre-iter5b HEVC was already broken at the kernel level**, β didn't introduce this.
- Pre-iter5b MPEG-2 trace.11169: 0 ERROR, decode succeeded, userspace got real pixels — **pre-iter5b MPEG-2 worked end-to-end via libva**.
So β fixed the OUTPUT pixel format mismatch (cosmetic improvement for HEVC; correct format for VP9/VP8) but exposed an MPEG-2 regression. Root cause traced to `context.c:262` comment: `ffmpeg vaapi-copy passes surfaces_count==0 to vaCreateContext`. My β walk of `surfaces_ids[]` for the destination_* fill was therefore a no-op for vaapi-copy → `destination_planes_count` stayed 0 → `surface_bind_slot`'s `for (j=0; j<0; ...)` was a no-op → `destination_data[]` never assigned → `copy_surface_to_image` memcpy of 0 bytes from a NULL src.
### Act 2 — Commit D: cache format-uniform state in driver_data, walk surface_heap, lazy-fill on late CreateSurfaces2
Commit `70196f8` adds:
- `request.h` driver_data fields: `fmt_valid`, `fmt_planes_count`, `fmt_buffers_count`, `fmt_format_height`, `fmt_sizes[]`, `fmt_bytesperlines[]`.
- `surface.h/c`: new `surface_fill_format_uniform(driver_data, surface_object)` helper, idempotent on `destination_planes_count != 0`.
- `context.c::RequestCreateContext`: populates the cache after `v4l2_get_format(CAPTURE)`; walks `surface_heap` (not just `surfaces_ids[]`) to fill every existing surface.
- `context.c::RequestCreateContext`: lazy-fill via the helper after surface allocation.
- `context.c::RequestDestroyContext`: invalidates `fmt_valid = false`.
Post-D sweep: MPEG-2 restored, VP9 PASS, VP8 partial, HEVC unchanged.
## The remaining failures
### HEVC: pre-existing kernel-rejection issue
Pre-β HEVC trace.11115 (Phase 3 baseline, same kernel) showed 4 of 4 DQBUFs with `V4L2_BUF_FLAG_ERROR`. Post-β: same. The kernel rkvdec driver rejects HEVC decode for some reason independent of OUTPUT pixel format.
Per the iter2-close docs, HEVC was claimed PASS via transitive proof: backend control payload byte-matched kdirect's payload. The transitive proof verified controls were CORRECTLY SHAPED, not that the kernel ACCEPTED them. Kdirect submits the same shape and the kernel decodes; libva submits the same shape and the kernel returns ERROR. The difference must be in V4L2 protocol order, request_fd binding, sequence numbers, or some other contract aspect not captured in the control-payload bytes.
**Bug 5** (new): HEVC libva decode triggers `DQBUF` `V4L2_BUF_FLAG_ERROR` from rkvdec. Same kernel + same controls + same fixture works via kdirect. Difference is in the libva backend's V4L2 protocol behavior. Out of iter5b scope.
### VP8: partial — decode runs but output diverges
VP8 libva returns `bcc57ed5…` (4 147 200 bytes, 256 unique bytes, 74.8% zero, real keyframe content). Pre-β was all-zero `06b2c5a0…`. So β did unlock VP8 decode at some level. But the output bytes differ from kdirect's `136ce5cb…`.
Frame 1 first 16 bytes: `93 8e 8a 89 85 72 8c 6d 82 79 92 7e 80 80 80 80` — plausible BBB intro pixels (neutral grey-to-mid-tone). Frames 1, 2, 3 differ from each other (motion content present).
Likely candidates:
- Slot rotation: backend reads from one cap_pool slot while kernel wrote to another (similar pattern as H.264 inter-frame Bug 4).
- Partial fill: each frame's lower portion is zero (74.8% zero is a lot).
- Per-frame DPB binding issue.
**Bug 6** (new): VP8 libva decode produces non-zero but non-matching output. Out of iter5b scope.
## Architecture verdict
The β refactor delivered its core goal: **CreateSurfaces2 no longer touches V4L2 device state; CreateContext owns the OUTPUT-side lifecycle**. The Phase 7 α' failure mode is structurally eliminated — there's no in-CreateSurfaces2 destructive teardown branch to corrupt state mid-stream.
The 2-CRIT findings from Phase 5 v2 (CRIT-1 NULL guard, CRIT-2 missing `request_pool_destroy`) were correct and incorporated; without them, Phase 7 wouldn't have gotten past the first CreateContext.
Commit D was the surprise: the architecturally cleaner β created a new edge case (vaapi-copy late-surface flow) that pre-β didn't have because CreateSurfaces2's "fill destination_* immediately" pattern happened to handle it. The fix-forward restores the equivalent behavior via lazy-fill at both sites.
## What iter5b-β actually shipped
- **VP9 directly verifiable** (no more transitive-proof dependency). Campaign scoreboard: 5/5 with 1 newly direct + 4 mixed (1 prior direct MPEG-2, 1 keyframe-partial H.264, 2 partial/transitive VP8+HEVC).
- **MPEG-2 not regressed.**
- **H.264 not regressed.**
- **Bug 5** (HEVC kernel rejection) and **Bug 6** (VP8 partial output) exposed for iter6+ work.
- **Architecture cleaned up.** surface.c::CreateSurfaces2 is now ~70 LOC (was ~250). context.c::RequestCreateContext owns the OUTPUT lifecycle clearly.
## Substrate state at Phase 7 close
- Fork tip `70196f8` on noether + fresnel + gitea.
- Backend installed SHA `2c6ff82cbdc156ff8910d0c7fe58e75eeecdfd6e6a1caabb049c8adf43a098b8`.
- Kernel `linux-fresnel-fourier 7.0-1` (unchanged).
- Test fixtures unchanged.
- Phase 7 sweep artifacts at `/tmp/iter5b_p7v2/` on fresnel.
## Memory rules — candidates at iter5b-β close
The Phase 5 v2 reviewer's "before claiming no change needed, grep all call sites" discipline saved Phase 6 from CRIT-2 silently breaking the second CreateContext. Worth a memory rule:
**Proposed**: `feedback_grep_all_callsites_before_no_change_claim.md` — When a plan says "no change needed" for a code region, grep all symbols/functions/macros the change touches and enumerate the call sites. The author's surface read at Phase 4 v2 missed `request_pool_destroy` not being in DestroyContext; only the reviewer's enumeration caught it.
The Commit D fix-forward exposes a sibling rule: **trust but verify the iter6 comment**. The iter6 comment at context.c:262 explicitly named the ffmpeg-vaapi-copy surfaces_count=0 case. Phase 4 v2 plan acknowledged the comment but didn't trace the implication for the destination_* walk. Comment read → implication not derived.
Both rules can be folded into one: **plans that touch lifecycle code must enumerate ALL existing comments AND grep ALL call sites for affected symbols; "no change needed" claims must be backed by explicit enumeration**. Defer the memory rule write to Phase 8 close.
## Phase 7 → Phase 8 handoff
Phase 7 closes here. iter5b reaches partial-PASS:
- ✓ VP9 directly unblocked (the primary iter5b-β goal achieved for 1 of the 3 target codecs).
- ✓ MPEG-2 not regressed (Commit D fix-forward).
- ✓ H.264 not regressed.
- ✗ HEVC remains broken (Bug 5 — pre-existing, exposed).
- ✗ VP8 partial (Bug 6 — new, decode-runs-but-output-diverges).
The campaign scoreboard becomes "5/5 with 1 direct (MPEG-2) + 1 newly direct (VP9) + 3 mixed (H.264 keyframe-partial + VP8 partial + HEVC transitive-claimed-but-empirically-failing)."
Phase 8 close documents the iter5b-β shipped state and lists Bug 5 + Bug 6 + Bug 4 as iter6+ work targets. Memory rule update covers the empirical lessons.