# Iteration 6 — Phase 2 (situation analysis) Captured 2026-05-12 evening, immediately after Phase 1 lock. Empirical diff of VP8 control payloads (libva backend vs ffmpeg-v4l2request kdirect) ruled out the obvious "wrong fields in `v4l2_ctrl_vp8_frame`" hypothesis. Bug 6's root cause lives elsewhere. ## Byte-level pixel divergence (Phase 1 anchor recap) Raw NV12-derived yuv420p, 3 frames @720p: | Frame | Y match | U match | V match | |---|---|---|---| | 0 (keyframe) | 3779/921600 (0.4%) | 2728/230400 (1.2%) | 3001/230400 (1.3%) | | 1 (inter) | 93/921600 (0.0%) | 46/230400 (0.0%) | 127/230400 (0.1%) | | 2 (inter) | 33/921600 (0.0%) | 1/230400 (0.0%) | 21/230400 (0.0%) | Frame 0 first 16 bytes (libva): `93 8e 8a 89 85 72 8c 6d 82 79 92 7e 80 80 80 80` — plausibly real decoded content. Frame 0 first 16 bytes (kdirect): different (real decoded content from the SAME bitstream). Both decoded SOMETHING; the somethings disagree. The decoder ran (no DQBUF ERROR like HEVC) but produced bytes that don't match the same kernel's kdirect output for the same bitstream. ## Control-payload diff (libva vs kdirect on current substrate) Method: strace both paths with `-x -s 99999`, extract every `VIDIOC_S_EXT_CTRLS` payload at `id=0xa409c8` (`V4L2_CID_STATELESS_VP8_FRAME`), unescape strace's hex string back to raw bytes, byte-compare across payload indices. Result (full table at `/tmp/vp8diff2.py`): - Payload 0 + 1 (frame 0 keyframe — both TRY_EXT_CTRLS + S_EXT_CTRLS for the same frame): **0 bytes differ**. `last_frame_ts = gold_frame_ts = alt_frame_ts = 0x0`, `key=True`, `fp_size=22742`, `fp_hb=6550`. **byte-identical between libva and kdirect.** - Payload 2+ (inter frames): **24 bytes differ at offsets 1200..1223** — exactly the three `__u64` reference timestamps (`last_frame_ts`, `golden_frame_ts`, `alt_frame_ts`). libva uses wall-clock ns from `gettimeofday(&surface_object->timestamp, NULL)` (e.g. `0x18aeea0df8c7c628`); kdirect uses small pts-derived values (e.g. `0x1388 = 5000`). Both are valid timestamps as far as the kernel is concerned — the kernel uses them as keys to look up CAPTURE buffers; absolute values don't matter as long as they're internally consistent (the same value used at the producer-side OUTPUT QBUF when the reference frame was decoded, and at the consumer-side reference control when reading the reference). **Verdict**: control payloads for keyframes are byte-identical. For inter frames the only diff is reference timestamps (different domains, both internally consistent). Bug 6 is NOT in `vp8.c::vp8_set_controls` — the control bytes are correct. ## Where does the divergence live? Hypothesis space narrowed: ### H-A — Slice data corruption in libva path `picture.c::codec_store_buffer` at line 78 does `memcpy(surface_object->source_data + slices_size, buffer_object->data, buffer_object->size * buffer_object->count)`. If `source_data` points to the wrong slot, or `slices_size` is wrong, or the memcpy size is wrong, the OUTPUT buffer contents don't match what kdirect submits. Pre-iter5b VP8 ran through this same code path and produced all-zero (because of the iter5b-fixed OUTPUT format mismatch). Now post-β VP8 runs the same code path with the correct format — so any pre-existing slice-data bug surfaces only now. Diagnostic step: dump the OUTPUT buffer contents right before QBUF and compare to what kdirect writes. If different bytes, slice-data path is the bug. ### H-B — `slices_size` (bytesused on OUTPUT QBUF) wrong `picture.c::RequestEndPicture` line 462-464: ```c rc = v4l2_queue_buffer(driver_data->video_fd, request_fd, output_type, &surface_object->timestamp, surface_object->source_index, surface_object->slices_size, 1); ``` `slices_size` is the bytesused for OUTPUT. Earlier strace of HEVC showed `bytesused=0` on the top-level v4l2_buffer field in strace — but strace doesn't show `m.planes[].bytesused` for MPLANE. So we don't know from strace alone whether slices_size is being set correctly on the wire. Diagnostic step: instrument the v4l2_queue_buffer call site to log `slices_size` per frame, OR use a kernel ftrace event to see what bytesused the kernel sees on the OUTPUT queue. ### H-C — Cache coherency on the OUTPUT bitstream side CPU writes slice data to mmap'd OUTPUT buffer; kernel reads via DMA. On RK3399 with non-coherent DMA (memory `feedback_rockchip_pixel_verify_path.md` documents the cache-mmap weirdness), if no cache flush happens before QBUF, the kernel sees stale buffer contents. Diagnostic step: add an explicit `msync(MS_SYNC)` before QBUF on the OUTPUT buffer and see if output changes. ### H-D — CAPTURE-side slot rotation cap_pool acquires a slot at BeginPicture; binds destination_data via surface_bind_slot. If the slot index passed to QBUF (`surface_object->destination_index`) doesn't match the slot whose mmap is bound to `destination_data`, the readback reads the WRONG slot. Diagnostic step: log slot index per frame at QBUF time AND at copy_surface_to_image time. ### H-E — kernel-side hantro VP8 quirk we don't understand yet Some kernel-side handling we haven't traced. ## Why H-A / H-B / H-C / H-D didn't surface pre-iter5b Pre-β VP8 produced all-zero because the OUTPUT pixel format was H264_SLICE (not VP8_FRAME). Hantro substituted MPEG2_DECODER codec_mode (per `feedback_unconditional_codec_state.md` analysis). The kernel never actually attempted VP8 decode. So no slice-data bug, bytesused bug, cache bug, or slot bug could surface — the kernel-side dispatch was wrong before any of those mattered. Post-β fixed the format → kernel now actually attempts VP8 decode → underlying bugs become visible. ## iter3 transitive-proof reframing iter3 Phase 5 C1+C2 verified specific control fields (`first_part_size`, `first_part_header_bits`) matched kdirect. iter3's "PASS via transitive proof" was technically correct for what it verified. It did NOT verify slice-data bytes, bytesused-on-the-wire, cache flush, or CAPTURE slot rotation. Bug 6 lives in one of those — invisible to iter3's proof structure. This is the same shape as iter2's HEVC "PASS via transitive proof" missing Bug 5. The pattern: transitive proofs against ONE artifact (control payload) don't catch bugs in OTHER artifacts (slice data, ioctl ordering, cache state). Future "transitive PASS" claims should explicitly enumerate which artifacts are NOT proven equivalent. ## Phase 3 plan (empirical narrowing) iter6 Phase 3 (the next phase, since Phase 2 is wrapping here) will narrow which of H-A through H-E is Bug 6: 1. **Dump libva's OUTPUT-buffer slice-data bytes right before QBUF**. Compare to kdirect's bytes (extractable from kdirect strace). If different → H-A confirmed. 2. **Log `slices_size` per frame**. Compare to expected size (from VAAPI's `buffer_object->size * count` plus accumulation). Verify it matches first_part_size + dct_part_sizes total. 3. **Insert an `msync(MS_SYNC, source_data, slices_size)` before QBUF**. If output changes → H-C. 4. **Log `surface_object->destination_index` vs `surface_object->current_slot->v4l2_index`** at each cap_pool_acquire and at copy_surface_to_image time. If they ever diverge → H-D. 5. If none of 1-4 reveal the bug → H-E and deeper kernel-side debugging. iter6 Phase 3 is empirical investigation; Phase 4 is the plan based on findings. ## Memory rules touched - `feedback_review_empirical_over_theoretical.md` (Direction 2): re-verify before declaring. Phase 2 author hypothesized "wrong vp8.c control field"; empirical diff disproved it. Hypothesis pivoted to slice-data / bytesused / cache / slot rotation. - `feedback_trace_fix_mechanism_to_consumer.md`: tracing chain producer→primitive→consumer-read-site. For Bug 6 the consumer is "the kernel's decode path reading the OUTPUT bitstream"; the producer is "libva backend writing slice data to the OUTPUT mmap." The trace must continue from libva's memcpy through to kernel DMA read. ## Phase 5 review hand-off (when Phase 4 lands) Phase 5 review of Bug 6's fix plan must: 1. Verify the diagnostic that picked H-A/B/C/D/E was empirical (run the test, observe). 2. Verify the fix patches the right site (the artifact that diagnostic identified). 3. Verify regression-free for the other 4 codecs. ## Phase 4 anticipated Once H-A through H-E is narrowed to ONE root cause, Phase 4 writes a small patch (likely 5-50 LOC) against that site. Phase 6 lands the patch. Phase 7 re-runs sweep. ## Substrate state at iter6 Phase 2 close - Fork tip `70196f8` on noether + fresnel + gitea. - Backend installed at `/usr/lib/dri/v4l2_request_drv_video.so` SHA `2c6ff82c…`. - Kernel `linux-fresnel-fourier 7.0-1`. - Phase 1 VP8-related anchors at `/tmp/iter5b_p7v2/`, `/tmp/vp8_libva_traces/`, `/tmp/vp8_kdirect_traces/`. Phase 3 artifacts will accumulate here. - VP9, MPEG-2, H.264 keyframe-partial, HEVC all-zero status unchanged from iter5b-β close.