iter4 Phase 7: criterion 1+2+3 PASS, criterion 4+5 FAIL — three bug classes identified

Verification on linux-fresnel-fourier 7.0-1:

PASS:
- Criterion 1: vainfo enumerates VAProfileVP9Profile0 via auto-detect.
- Criterion 2: vaCreateConfig SUCCESS (implicit).
- Criterion 3: ffmpeg-vaapi VP9 5-frame decode exit 0 at 0.307x, no
  ioctl errors.

FAIL — three distinguishable bug classes:

Bug 1 (VP9-specific, my Clause 6 parser):
  Strace of frame-1 keyframe FRAME control vs Phase 3 anchor:
  - byte 8 (lf.flags): mine=0x01 (DELTA_ENABLED only) vs ref=0x03
    (ENABLED|UPDATE).
  - byte 16 (base_q_idx): mine=0x41 (65) vs ref=0x2e (46).
  - byte 17 (delta_q_y_dc): mine=8 vs ref=0.
  Bit-trace shows my parser is 2 bits ahead of correct position by
  the time it reaches lf_delta_enabled. Fix path: faithful port of
  FFmpeg vp9.c::decode_frame_header.

Bug 2 (substrate-wide, cap_pool readback):
  Constant RGB(0, 0x4c, 0) "0x4c gray" pattern across all codecs
  (VP9, HEVC, MPEG-2, VP8). H.264 keyframe DOES read correctly with
  real RGB(0, 0xe3, 0) content; H.264 inter frames revert to 0x4c.
  Kernel decode succeeds (Phase 3 strace + ffmpeg-v4l2request
  standalone confirm). libva readback returns cap_pool init scratch.
  Sibling of iter3 dma_resv blocker but with different signature
  (constant 0x4c instead of all-zero 0x00).

Bug 3 (hantro UAPI drift):
  MPEG-2 + VP8 produce kernel "Unable to set control(s): Invalid
  argument" errors. UAPI struct sizes/fields likely shifted between
  6.19.9 and 7.0 (sibling of Phase 3 VP9 struct-size correction
  144/1947 -> 168/2040).

Three loopback options proposed (decision pending user):
- A: VP9-only fix (Clause 6 parser); accept Bug 2/3 as substrate
     pre-existing; criterion 4 transitive-only per iter3.
- B: Full loopback covering all 3 bugs; possibly requires kernel
     patches (vb2_dma_resv RFC v2).
- C: Phase 0 reset; substrate is the primary issue; pause iter4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-10 07:20:51 +00:00
parent 42b9ec333a
commit d87c940788
2 changed files with 120 additions and 0 deletions
BIN
View File
Binary file not shown.
+120
View File
@@ -0,0 +1,120 @@
# Iteration 4 — Phase 7 (verification)
Verification of iter4 Commits Z+A+B+C against the 5 Phase 1 success criteria. Captured 2026-05-10 09:0009:10 CEST on fresnel `linux-fresnel-fourier 7.0-1`.
**Verdict**: criterion 1, 2, 3 PASS. **Criteria 4 and 5 FAIL** with two distinct root causes that combine into a Phase 7 → Phase 4 (or possibly Phase 0) loopback.
## Summary table
| Criterion | Test | Result |
|---|---|---|
| 1 | vainfo enumerates VAProfileVP9Profile0 | **PASS** (auto-detect rkvdec) |
| 2 | vaCreateConfig SUCCESS | **PASS** (implicit via 3) |
| 3 | ffmpeg-vaapi VP9 decode exit 0 | **PASS** (5 frames at 0.307x, no errors) |
| 4 | HW=SW byte-identical | **FAIL** (two issues, see below) |
| 5 | 4-codec regression | **FAIL** (substrate-wide cap_pool-readback regression) |
Fork tip: `beaa914` (iter4 Commit C). Backend SHA256: `f2ff6598...`.
## Criterion 4 — VP9 HW=SW
**Test**: `ffmpeg -hwaccel vaapi -hwaccel_output_format vaapi -vf hwdownload bbb_720p10s_vp9.webm -frames:v 5 hw_%04d.png` → SHA256 vs Phase 3 SW reference PNGs.
**Result**: All 5 HW PNGs have the SAME hash `93dd9db51385...`. Pixels are constant `RGB(0, 0x4c, 0)` across all frames. Phase 3 SW reference has varied content `RGB(0x70, 0x6d, 0x55)` etc.
**Conclusion**: HW path is producing constant-fill scratch instead of decoded video. Two contributing root causes identified:
### Bug 1 — Clause 6 uncompressed-header parser bit-misalignment (VP9-specific)
Strace of my backend's submission vs Phase 3 anchor (frame-1 keyframe):
```
lf struct (16 B) quant struct (8 B)
mine: 01 00 ff ff 00 00 03 00 01 00 ... 41 08 00 08 00 00 00 00
Phase 3 anchor: 01 00 ff ff 00 00 03 00 03 00 ... 2e 00 00 00 00 00 00 00
Match: bytes 07 (lf prefix is correct, level=3, sharpness=0).
Diverge byte 8 (lf.flags): 0x01 (DELTA_ENABLED only) vs 0x03 (ENABLED|UPDATE).
Diverge byte 16 (base_q_idx): 0x41 (65) vs 0x2e (46).
Diverge byte 17 (delta_q_y_dc): 8 vs 0.
```
The parser is **2 bits ahead** by the time it reaches `lf_delta_enabled`. Bit-trace of the BBB keyframe bitstream (`82 49 83 42 40 4f f0 2c f6 06 38 ...`) shows the correct bit positions for these fields per VP9 spec section 6.2; my parser's output corresponds to reading them at positions +2 of the correct positions.
The 2-bit drift origin is somewhere in the keyframe color_config / refresh_frame_context section. Likely candidates:
- `if (1) /* color_space != CS_RGB */` shortcut may be reading wrong bits when `color_space == CS_RGB`.
- The conditional refresh_frame_context / parallel_decoding_mode reading is off-spec for some path.
- An off-by-one in the keyframe `else` (inter) branch may carry over.
The wrong `base_q_idx=65` produces garbage dequantization → kernel decodes incorrectly. This alone would invalidate VP9 decode output even if the readback path were perfect.
**Fix path**: rewrite Clause 6 as a faithful port of FFmpeg `vp9.c::decode_frame_header` lines 540700 (the canonical parser). ~150 LOC, replacing my current ~120 LOC partial port. Validate by extracting Phase 3 anchor's verbatim payload bits and confirming byte-by-byte match.
### Bug 2 — Cap-pool readback returns scratch buffer (substrate-wide)
The constant `RGB(0, 0x4c, 0)` pattern appears for **all 4 already-shipping codecs** as well, not just VP9 (see Criterion 5 below). This is a `linux-fresnel-fourier 7.0-1`-substrate-level issue, not VP9-specific. Even if Bug 1 is fixed, criterion 4 will still fail without resolving Bug 2.
**Symptom**: kernel decodes successfully (Phase 3's strace + ffmpeg-v4l2request standalone test confirm decode succeeds), but our libva backend's cap_pool readback returns the buffer init pattern, not the kernel's decoded pixels. The pages may be cached or the slot may be unbound.
This is sibling to the iter3 dma_resv issue but with a different signature (constant `0x4c` instead of all-zero `0x00`). Documented in memory `reference_dmabuf_resv_blocker.md`; the new substrate may have made it worse, broader, or simply different.
## Criterion 5 — 4-codec regression block
`ffmpeg -hwaccel vaapi -hwaccel_output_format vaapi -vf hwdownload` for each codec, 3 frames each, SHA256 of resulting PNGs.
| Codec | Driver | Frame 1 hash | Frames 2-3 hash | Verdict |
|---|---|---|---|---|
| H.264 | rkvdec auto | `50ca48b6...` (real `RGB(0, 0xe3, 0)` content) | `7c70fdda...` (stuck `RGB(0, 0x4c, 0)`) | **PARTIAL** — keyframe decodes, inter frames stuck |
| HEVC | rkvdec auto | `85243d2c...` (constant `RGB(0, 0x4c, 0)`) | same | **FAIL** |
| MPEG-2 | hantro env | `93dd9db5...` (same constant as VP9) | same | **FAIL** + kernel `Unable to set control(s)` errors |
| VP8 | hantro env | `f8103ae1...` (constant) | same | **FAIL** + kernel `Unable to set control(s)` errors |
Hash `93dd9db5...` appears across VP9 and MPEG-2 — confirms shared cap_pool init pattern (`0x4c` = decimal 76).
**Cap-pool readback regression**: H.264 keyframe DOES read correctly, indicating the readback path partially works. Inter frames return scratch — the cap_pool slot rotation between QBUF cycles isn't being driven by the kernel's actual decoded frame data. This is the sibling of Bug 2 above.
**Hantro `Unable to set control(s)` errors**: a kernel-side rejection on hantro for MPEG-2/VP8. Substrate change appears to have shifted hantro's expected control structure or fields; iter1 (MPEG-2) and iter3 (VP8) were tested on 6.19.9 — UAPI likely drifted between 6.19.9 and 7.0 the same way VP9 did (Phase 3 baseline showed VP9 struct sizes 168/2040 differ from Phase 2's 6.19.9-based estimates 144/1947).
## Distinguishing the two bug classes
**Bug 1 (VP9 parser)** — pure iter4 work; localized to `src/vp9.c::vp9_parse_uncompressed_header_lf_quant`. Doesn't affect H.264/HEVC/MPEG-2/VP8.
**Bug 2 (cap-pool readback)** — substrate-wide; affects ALL codecs on `linux-fresnel-fourier 7.0-1`. Was masked on `linux-eos-arm 6.19.9-99` because rkvdec's cap_pool worked there; iter3 hit the hantro variant via dma_resv. Now broader.
**Bug 3 (hantro UAPI drift)** — substrate-related; affects MPEG-2 + VP8 specifically with kernel-rejection ioctl errors. Probably struct-size or new-required-field in the kernel UAPI for hantro stateless controls. Distinct from Bug 2's silent-pass-but-wrong-buffer pattern.
## Path forward — three options
### Option A — VP9-only Phase 4 loopback for Bug 1 alone
Fix Clause 6 parser. Re-run Phase 6 + Phase 7. Accept Bug 2 as "criterion-4 transitive proof only" per iter3 precedent (memory `reference_dmabuf_resv_blocker.md`). Accept Bug 3 as substrate-induced regression of iter1+iter3 (file as new memory entry, defer fix).
**Pro**: fastest path to a "VP9 controls correct" demonstration; preserves iter4's nominal scope.
**Con**: criterion 4 stays effectively-not-direct; campaign scoreboard slips from "5/5 direct" to "5/5 transitive". Bug 2 + Bug 3 are *campaign-wide* issues that this iter shouldn't paper over.
### Option B — Full Phase 4 loopback covering Bugs 1 + 2 + 3
Fix Clause 6 parser. Investigate Bug 2 (cap-pool readback regression) — likely requires kernel-side patch (vb2_dma_resv RFC v2 per memory `reference_fresnel_kernel_substrate.md`) OR libva-side cap_pool refactor. Investigate Bug 3 (hantro UAPI drift) — re-run iter1 + iter3 strace baselines on 7.0 to see what changed.
**Pro**: addresses root causes; restores 4-codec block; criterion 4 direct.
**Con**: scope expansion, possibly substantial. Bug 2 may require landing kernel patches that aren't ready yet (RFC v1 rejected, v2 in design).
### Option C — Phase 0 reset: investigate substrate as the primary issue
Step back from iter4. Document that 6.19 → 7.0 substrate change has broken the campaign's Phase 1 baseline assumptions. Choose one of:
- Roll fresnel back to a kernel that has the working cap_pool path (NOT 6.19.9 since decommissioned, but possibly some intermediate kernel before whatever regressed).
- Land kernel patches in `kernel-agent` to fix Bug 2 + Bug 3 substrate issues, rebuild `linux-fresnel-fourier`, retest from iter1 baseline.
**Pro**: addresses the real underlying issue. iter4 + iter1 + iter2 + iter3 all become re-anchorable.
**Con**: large scope. Multi-week. iter4 specifically gets paused.
## Substrate state at Phase 7 close
- Fork at iter4 Commit C tip `beaa914` (Phase 6 build clean, Criterion 1+2+3 PASS).
- Phase 7 captures persisted: `fresnel:/tmp/iter4_phase7/` and `noether:~/src/fresnel-fourier/iter4_phase7.tgz`.
- Phase 4 plan ready for amendment after option-decision.
- Memory rules carry forward; no new memory entries added in Phase 7 (the findings are iteration-specific, not cross-cutting yet).
## Decision point
Pick A / B / C before Phase 4 loopback or campaign re-orientation.