Files
fresnel-fourier/phase7_iter4_verification.md
marfrit d87c940788 iter4 Phase 7: criterion 1+2+3 PASS, criterion 4+5 FAIL — three bug classes identified
Verification on linux-fresnel-fourier 7.0-1:

PASS:
- Criterion 1: vainfo enumerates VAProfileVP9Profile0 via auto-detect.
- Criterion 2: vaCreateConfig SUCCESS (implicit).
- Criterion 3: ffmpeg-vaapi VP9 5-frame decode exit 0 at 0.307x, no
  ioctl errors.

FAIL — three distinguishable bug classes:

Bug 1 (VP9-specific, my Clause 6 parser):
  Strace of frame-1 keyframe FRAME control vs Phase 3 anchor:
  - byte 8 (lf.flags): mine=0x01 (DELTA_ENABLED only) vs ref=0x03
    (ENABLED|UPDATE).
  - byte 16 (base_q_idx): mine=0x41 (65) vs ref=0x2e (46).
  - byte 17 (delta_q_y_dc): mine=8 vs ref=0.
  Bit-trace shows my parser is 2 bits ahead of correct position by
  the time it reaches lf_delta_enabled. Fix path: faithful port of
  FFmpeg vp9.c::decode_frame_header.

Bug 2 (substrate-wide, cap_pool readback):
  Constant RGB(0, 0x4c, 0) "0x4c gray" pattern across all codecs
  (VP9, HEVC, MPEG-2, VP8). H.264 keyframe DOES read correctly with
  real RGB(0, 0xe3, 0) content; H.264 inter frames revert to 0x4c.
  Kernel decode succeeds (Phase 3 strace + ffmpeg-v4l2request
  standalone confirm). libva readback returns cap_pool init scratch.
  Sibling of iter3 dma_resv blocker but with different signature
  (constant 0x4c instead of all-zero 0x00).

Bug 3 (hantro UAPI drift):
  MPEG-2 + VP8 produce kernel "Unable to set control(s): Invalid
  argument" errors. UAPI struct sizes/fields likely shifted between
  6.19.9 and 7.0 (sibling of Phase 3 VP9 struct-size correction
  144/1947 -> 168/2040).

Three loopback options proposed (decision pending user):
- A: VP9-only fix (Clause 6 parser); accept Bug 2/3 as substrate
     pre-existing; criterion 4 transitive-only per iter3.
- B: Full loopback covering all 3 bugs; possibly requires kernel
     patches (vb2_dma_resv RFC v2).
- C: Phase 0 reset; substrate is the primary issue; pause iter4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 07:20:51 +00:00

121 lines
8.6 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Iteration 4 — Phase 7 (verification)
Verification of iter4 Commits Z+A+B+C against the 5 Phase 1 success criteria. Captured 2026-05-10 09:0009:10 CEST on fresnel `linux-fresnel-fourier 7.0-1`.
**Verdict**: criterion 1, 2, 3 PASS. **Criteria 4 and 5 FAIL** with two distinct root causes that combine into a Phase 7 → Phase 4 (or possibly Phase 0) loopback.
## Summary table
| Criterion | Test | Result |
|---|---|---|
| 1 | vainfo enumerates VAProfileVP9Profile0 | **PASS** (auto-detect rkvdec) |
| 2 | vaCreateConfig SUCCESS | **PASS** (implicit via 3) |
| 3 | ffmpeg-vaapi VP9 decode exit 0 | **PASS** (5 frames at 0.307x, no errors) |
| 4 | HW=SW byte-identical | **FAIL** (two issues, see below) |
| 5 | 4-codec regression | **FAIL** (substrate-wide cap_pool-readback regression) |
Fork tip: `beaa914` (iter4 Commit C). Backend SHA256: `f2ff6598...`.
## Criterion 4 — VP9 HW=SW
**Test**: `ffmpeg -hwaccel vaapi -hwaccel_output_format vaapi -vf hwdownload bbb_720p10s_vp9.webm -frames:v 5 hw_%04d.png` → SHA256 vs Phase 3 SW reference PNGs.
**Result**: All 5 HW PNGs have the SAME hash `93dd9db51385...`. Pixels are constant `RGB(0, 0x4c, 0)` across all frames. Phase 3 SW reference has varied content `RGB(0x70, 0x6d, 0x55)` etc.
**Conclusion**: HW path is producing constant-fill scratch instead of decoded video. Two contributing root causes identified:
### Bug 1 — Clause 6 uncompressed-header parser bit-misalignment (VP9-specific)
Strace of my backend's submission vs Phase 3 anchor (frame-1 keyframe):
```
lf struct (16 B) quant struct (8 B)
mine: 01 00 ff ff 00 00 03 00 01 00 ... 41 08 00 08 00 00 00 00
Phase 3 anchor: 01 00 ff ff 00 00 03 00 03 00 ... 2e 00 00 00 00 00 00 00
Match: bytes 07 (lf prefix is correct, level=3, sharpness=0).
Diverge byte 8 (lf.flags): 0x01 (DELTA_ENABLED only) vs 0x03 (ENABLED|UPDATE).
Diverge byte 16 (base_q_idx): 0x41 (65) vs 0x2e (46).
Diverge byte 17 (delta_q_y_dc): 8 vs 0.
```
The parser is **2 bits ahead** by the time it reaches `lf_delta_enabled`. Bit-trace of the BBB keyframe bitstream (`82 49 83 42 40 4f f0 2c f6 06 38 ...`) shows the correct bit positions for these fields per VP9 spec section 6.2; my parser's output corresponds to reading them at positions +2 of the correct positions.
The 2-bit drift origin is somewhere in the keyframe color_config / refresh_frame_context section. Likely candidates:
- `if (1) /* color_space != CS_RGB */` shortcut may be reading wrong bits when `color_space == CS_RGB`.
- The conditional refresh_frame_context / parallel_decoding_mode reading is off-spec for some path.
- An off-by-one in the keyframe `else` (inter) branch may carry over.
The wrong `base_q_idx=65` produces garbage dequantization → kernel decodes incorrectly. This alone would invalidate VP9 decode output even if the readback path were perfect.
**Fix path**: rewrite Clause 6 as a faithful port of FFmpeg `vp9.c::decode_frame_header` lines 540700 (the canonical parser). ~150 LOC, replacing my current ~120 LOC partial port. Validate by extracting Phase 3 anchor's verbatim payload bits and confirming byte-by-byte match.
### Bug 2 — Cap-pool readback returns scratch buffer (substrate-wide)
The constant `RGB(0, 0x4c, 0)` pattern appears for **all 4 already-shipping codecs** as well, not just VP9 (see Criterion 5 below). This is a `linux-fresnel-fourier 7.0-1`-substrate-level issue, not VP9-specific. Even if Bug 1 is fixed, criterion 4 will still fail without resolving Bug 2.
**Symptom**: kernel decodes successfully (Phase 3's strace + ffmpeg-v4l2request standalone test confirm decode succeeds), but our libva backend's cap_pool readback returns the buffer init pattern, not the kernel's decoded pixels. The pages may be cached or the slot may be unbound.
This is sibling to the iter3 dma_resv issue but with a different signature (constant `0x4c` instead of all-zero `0x00`). Documented in memory `reference_dmabuf_resv_blocker.md`; the new substrate may have made it worse, broader, or simply different.
## Criterion 5 — 4-codec regression block
`ffmpeg -hwaccel vaapi -hwaccel_output_format vaapi -vf hwdownload` for each codec, 3 frames each, SHA256 of resulting PNGs.
| Codec | Driver | Frame 1 hash | Frames 2-3 hash | Verdict |
|---|---|---|---|---|
| H.264 | rkvdec auto | `50ca48b6...` (real `RGB(0, 0xe3, 0)` content) | `7c70fdda...` (stuck `RGB(0, 0x4c, 0)`) | **PARTIAL** — keyframe decodes, inter frames stuck |
| HEVC | rkvdec auto | `85243d2c...` (constant `RGB(0, 0x4c, 0)`) | same | **FAIL** |
| MPEG-2 | hantro env | `93dd9db5...` (same constant as VP9) | same | **FAIL** + kernel `Unable to set control(s)` errors |
| VP8 | hantro env | `f8103ae1...` (constant) | same | **FAIL** + kernel `Unable to set control(s)` errors |
Hash `93dd9db5...` appears across VP9 and MPEG-2 — confirms shared cap_pool init pattern (`0x4c` = decimal 76).
**Cap-pool readback regression**: H.264 keyframe DOES read correctly, indicating the readback path partially works. Inter frames return scratch — the cap_pool slot rotation between QBUF cycles isn't being driven by the kernel's actual decoded frame data. This is the sibling of Bug 2 above.
**Hantro `Unable to set control(s)` errors**: a kernel-side rejection on hantro for MPEG-2/VP8. Substrate change appears to have shifted hantro's expected control structure or fields; iter1 (MPEG-2) and iter3 (VP8) were tested on 6.19.9 — UAPI likely drifted between 6.19.9 and 7.0 the same way VP9 did (Phase 3 baseline showed VP9 struct sizes 168/2040 differ from Phase 2's 6.19.9-based estimates 144/1947).
## Distinguishing the two bug classes
**Bug 1 (VP9 parser)** — pure iter4 work; localized to `src/vp9.c::vp9_parse_uncompressed_header_lf_quant`. Doesn't affect H.264/HEVC/MPEG-2/VP8.
**Bug 2 (cap-pool readback)** — substrate-wide; affects ALL codecs on `linux-fresnel-fourier 7.0-1`. Was masked on `linux-eos-arm 6.19.9-99` because rkvdec's cap_pool worked there; iter3 hit the hantro variant via dma_resv. Now broader.
**Bug 3 (hantro UAPI drift)** — substrate-related; affects MPEG-2 + VP8 specifically with kernel-rejection ioctl errors. Probably struct-size or new-required-field in the kernel UAPI for hantro stateless controls. Distinct from Bug 2's silent-pass-but-wrong-buffer pattern.
## Path forward — three options
### Option A — VP9-only Phase 4 loopback for Bug 1 alone
Fix Clause 6 parser. Re-run Phase 6 + Phase 7. Accept Bug 2 as "criterion-4 transitive proof only" per iter3 precedent (memory `reference_dmabuf_resv_blocker.md`). Accept Bug 3 as substrate-induced regression of iter1+iter3 (file as new memory entry, defer fix).
**Pro**: fastest path to a "VP9 controls correct" demonstration; preserves iter4's nominal scope.
**Con**: criterion 4 stays effectively-not-direct; campaign scoreboard slips from "5/5 direct" to "5/5 transitive". Bug 2 + Bug 3 are *campaign-wide* issues that this iter shouldn't paper over.
### Option B — Full Phase 4 loopback covering Bugs 1 + 2 + 3
Fix Clause 6 parser. Investigate Bug 2 (cap-pool readback regression) — likely requires kernel-side patch (vb2_dma_resv RFC v2 per memory `reference_fresnel_kernel_substrate.md`) OR libva-side cap_pool refactor. Investigate Bug 3 (hantro UAPI drift) — re-run iter1 + iter3 strace baselines on 7.0 to see what changed.
**Pro**: addresses root causes; restores 4-codec block; criterion 4 direct.
**Con**: scope expansion, possibly substantial. Bug 2 may require landing kernel patches that aren't ready yet (RFC v1 rejected, v2 in design).
### Option C — Phase 0 reset: investigate substrate as the primary issue
Step back from iter4. Document that 6.19 → 7.0 substrate change has broken the campaign's Phase 1 baseline assumptions. Choose one of:
- Roll fresnel back to a kernel that has the working cap_pool path (NOT 6.19.9 since decommissioned, but possibly some intermediate kernel before whatever regressed).
- Land kernel patches in `kernel-agent` to fix Bug 2 + Bug 3 substrate issues, rebuild `linux-fresnel-fourier`, retest from iter1 baseline.
**Pro**: addresses the real underlying issue. iter4 + iter1 + iter2 + iter3 all become re-anchorable.
**Con**: large scope. Multi-week. iter4 specifically gets paused.
## Substrate state at Phase 7 close
- Fork at iter4 Commit C tip `beaa914` (Phase 6 build clean, Criterion 1+2+3 PASS).
- Phase 7 captures persisted: `fresnel:/tmp/iter4_phase7/` and `noether:~/src/fresnel-fourier/iter4_phase7.tgz`.
- Phase 4 plan ready for amendment after option-decision.
- Memory rules carry forward; no new memory entries added in Phase 7 (the findings are iteration-specific, not cross-cutting yet).
## Decision point
Pick A / B / C before Phase 4 loopback or campaign re-orientation.