Phase 3 Candidate K executed: H-D (slot rotation) ELIMINATED via instrumented bind+read site logging. Slot v4l2_index matches at BeginPicture and at vaGetImage for every surface; destination_data[0] matches slot->map[0]. No rotation mismatch. H-A/B/C/D all eliminated. H-E (kernel-side hantro VP8 partial-write) confirmed by elimination. The libva backend submits correct controls, correct slice bytes, correct slices_size, correct slot indices. Kernel writes erratic partial content (per-frame Y plane transitions at row 536, 24, ... — not a clean buffer-size truncation, not slot rotation). iter6 close PARTIAL: 5 of 6 Phase 1 criteria PASS; criterion 1 (libva_vp8 == kdirect) PARTIAL — kernel-side fix needed, out of iter6's locked backend-only scope. No patches landed. Fresnel substrate unchanged: fork tip 70196f8, backend SHA 2c6ff82c... (identical to iter5b-β close). Net deliverable: Phase 3 narrowing reduces Bug-6 hypothesis space from 5 to 1. Future iter7+ (or kernel-agent campaign) picks up the kernel-side investigation. Pattern recognized: iter2 HEVC transitive PASS masked Bug 5; iter3 VP8 transitive PASS masked Bug 6. Both surfaced under direct verification post-iter5b-β. Transitive proofs against ONE artifact (control payload) don't catch bugs in OTHER artifacts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8.6 KiB
Iteration 6 — Phase 3 (empirical narrowing for Bug 6)
Captured 2026-05-12 evening. Bug 6 is further narrowed but not fully root-caused in this Phase 3 session. Three of five hypotheses eliminated; remaining two need deeper kernel-side investigation.
Eliminations
H-A — slice data corruption: ELIMINATED
Instrumented picture.c::RequestEndPicture to dump surface_object->source_data[0..slices_size] to /tmp/iter6_slice_libva_N.bin right before QBUF on OUTPUT.
- Frame 0 (keyframe): 300614 bytes dumped.
- Frame 1: 417 bytes.
- Frame 2: 1122 bytes.
- ...
Extracted raw VP8 frames from the fixture (ffmpeg -i bbb_720p10s_vp8.webm -c:v copy -f rawvideo). Frame 0 starts with the 10-byte VP8 keyframe header (d0 1a 0b 9d 01 2a 00 05 d0 02) then the control-partition data.
SHA256 of raw frame 0 bytes 10..300624 = SHA256 of libva slice 0 dump:
9e74956c75388e8a8c5f4d8f747e2ac99801b5fef14fe890b708b9d0272e9407
Byte-identical. The libva backend submits the correct VP8 frame bytes (post-header, as expected for VAAPI's pre-parsed-coder-state submission convention).
H-B — slices_size wrong on OUTPUT QBUF: ELIMINATED
Extracted from the control payload: fp_size = 22742, dct_part_sizes[0] = 277872. Expected slice data = 22742 + 277872 = 300614. Dumped slice size = 300614 exactly. slices_size is right.
H-C — CAPTURE-side cache coherency: ELIMINATED (probable)
Added msync(MS_SYNC | MS_INVALIDATE) instrumentation before the copy_surface_to_image memcpy in image.c. msync returned EINVAL (page-alignment issue on V4L2 mmap addresses), but the output hash was unchanged with or without the attempt: bcc57ed5c9021d02a3134949c6e483f13df22ff1f1dc0764097570fbcc4904e6 both runs.
Stronger argument against H-C: VP9 uses the same picture.c → image.c readback path and produces byte-identical output to kdirect. If cache coherency were the bug, VP9 would be broken too.
Control payload byte-equality: pre-eliminated at Phase 2
VP8 keyframe control payload byte-identical between libva and kdirect on the current substrate. Inter-frame payloads differ only in reference timestamps (libva: wall-clock ns; kdirect: small pts-derived; both internally consistent — kernel uses them as opaque keys to look up CAPTURE buffers).
Remaining hypotheses
H-D — CAPTURE slot rotation mismatch: ELIMINATED (user pick Candidate K, executed 2026-05-12)
Instrumented surface_bind_slot in surface.c and copy_surface_to_image in image.c to log slot indices and destination_data[] pointers. Re-ran VP8 sweep.
Empirical result (excerpt):
H-D bind: surface=0xaaab0111d630 slot=v4l2_index=0 dst_index=0 map[0]=0xffffa465e000
H-D bind: surface=0xaaab01122110 slot=v4l2_index=1 dst_index=1 map[0]=0xffffa450c000
H-D bind: surface=0xaaab01126bf0 slot=v4l2_index=2 dst_index=2 map[0]=0xffffa43ba000
H-D read: surface=0xaaab0111d630 dst_index=0 current_slot=… destination_data[0]=0xffffa465e000 destination_data[1]=0xffffa473f000
H-D read: surface=0xaaab01122110 dst_index=1 current_slot=… destination_data[0]=0xffffa450c000 destination_data[1]=0xffffa45ed000
H-D read: surface=0xaaab01126bf0 dst_index=2 current_slot=… destination_data[0]=0xffffa43ba000 destination_data[1]=0xffffa449b000
For each surface: the slot v4l2_index at surface_bind_slot (BeginPicture time) matches the dst_index at copy_surface_to_image (vaGetImage time). The destination_data[0] pointer matches slot->map[0] returned by cap_pool_acquire. No slot rotation mismatch. H-D eliminated.
(Bonus observation: cap_pool acquires slots in increasing index order — slot 0, 1, 2, 3, … through 12+ over a 3-frame decode. LRU semantics working as designed.)
H-E — kernel-side hantro VP8 quirk: CONFIRMED by elimination of H-A/B/C/D
The output bytes show an erratic partial-write pattern:
| Frame | First fully-zero row (Y plane) | First fully-zero row (UV plane) | Real-content rows |
|---|---|---|---|
| 0 (keyframe) | 536 of 720 | 134 of 360 | 0..535 (Y) + 0..133 (UV) |
| 1 (inter) | 24 of 720 | (not measured) | 0..23 (Y) |
| 2 (inter) | (not measured) | (not measured) | — |
Per-frame transition rows differ (536, 24, …). Not a simple "first N rows decoded, rest zero" pattern with fixed N. Not a slot-rotation bug (would produce shifted real content, not partial-then-zero). Not a buffer-size truncation (would be a clean cutoff at a consistent row, not per-frame).
Plausible H-E sub-hypotheses:
- Kernel decoder runs asynchronously and DQBUF returns BEFORE the kernel finishes writing all macroblock rows. Each frame stops at a different row depending on how lucky the timing was when DQBUF returned. Despite V4L2 spec saying DQBUF blocks until VB2_BUF_STATE_DONE, perhaps hantro VP8 path has a bug where it signals DONE early.
- Kernel decoder rate-limited or interrupted at random macroblock counts due to some kernel-internal scheduler / IRQ issue.
- vb2_dma_resv-style cache invalidation gap (the iter5-rejected RFC v2 patches addressed this for DMA-BUF-import; maybe also matters for the libva-MMAP-EXPBUF readback path despite Phase 5 iter5 analysis showing the fence doesn't reach the consumer).
H-E doesn't yield to a single small backend patch. Would need kernel ftrace / instrumented hantro driver to confirm or deny.
What's confirmed
- iter5b-β fixed the OUTPUT pixel format bug (Bug 2 for VP8 specifically). Pre-β VP8 was all-zero because hantro substituted MPEG2_DECODER codec_mode. Post-β VP8 has VP8_FRAME OUTPUT format → kernel ACTUALLY dispatches to VP8 decoder → partial output (Bug 6).
- The libva backend's VP8 control bytes are correct (byte-identical to kdirect on the same hardware).
- The libva backend's slice data is correct (byte-identical to the raw VP8 bitstream post-header).
slices_size(the OUTPUT QBUF bytesused) is correct (matchesfp_size + sum(dct_part_sizes)).
What's NOT yet confirmed
- Whether H-D (slot rotation) is happening — needs instrumentation.
- Whether H-E (kernel-side partial-write) is happening — needs kernel-side investigation.
- Whether the erratic per-frame transition rows have a discoverable pattern that points at a specific kernel bug.
Phase 4 candidates
Given Phase 3 narrowing, iter6 has multiple possible directions:
Candidate K — Continue H-D investigation (next session)
Add slot-index logging at cap_pool_acquire + image.c::copy_surface_to_image; run sweep; verify indices match. If they diverge → fix in backend's slot binding. If they match → H-D eliminated, proceed to H-E.
Estimated wallclock: 1-2 hours next session.
Candidate L — Move to H-E kernel-side investigation
Pivot to kernel-side ftrace, hantro source-read, possibly local kernel patches. Substantially heavier; aligns with the original iter5 Candidate B (kernel work) that user rejected at iter5b open.
Estimated wallclock: multi-session.
Candidate M — Park Bug 6, pick a different bug
Phase 3 narrowing established Bug 6 is a kernel-side partial-write issue, not a quick backend fix. Drop iter6 from Bug 6, switch to:
- Bug 4 (H.264 inter race-loss) — also kernel-related, but the iter4 prior work touched H.264 backend extensively so backend instrumentation is more familiar.
- Bug 5 (HEVC DQBUF FLAG_ERROR) — pre-existing kernel rejection; diff strace of libva vs kdirect HEVC.
- iter4-B1 (auto-detect device discrimination) — pure backend, ~100 LOC.
Candidate N — Document Bug 6 partial root cause, close iter6 PARTIAL
iter6 closes with: "Bug 6 narrowed but kernel-side, deferred to iter7+. iter6's Phase 3 work establishes H-A/B/C are NOT Bug 6's cause; H-D/E remain. Substrate state unchanged; no regression introduced."
iter6 hands off Bugs 4/5/6 to iter7+. Memory updates for the iter6 lesson on transitive-proof partial-coverage.
Decision point
User picked Candidate K. Phase 3 executed. H-D eliminated, H-E confirmed. iter6's Phase 1 locked scope was backend-only ("backend-side fix expected"). Bug 6 is kernel-side. iter6 closes PARTIAL — Phase 3 narrowing delivered as the iter6 contribution; Bug 6 fix deferred to iter7+ (would target kernel-side work, similar to original iter5 Candidate B scope).
Remaining user pick: which target for iter7.
Substrate state at iter6 Phase 3 close
- Fork tip
70196f8(iter5b-β + Commit D). Unchanged. - Backend installed SHA
2c6ff82cbdc156ff8910d0c7fe58e75eeecdfd6e6a1caabb049c8adf43a098b8. Phase 3 diagnostic instrumentation reverted. - Kernel
7.0.0-fresnel-fourier. Unchanged. - Phase 3 artifacts at fresnel
/tmp/iter5b_p7v2/,/tmp/iter6_slice_libva_*.bin,/tmp/vp8_raw.bin. Plus/tmp/vp8_libva_traces/and/tmp/vp8_kdirect_traces/on noether (strace captures + extract scripts).