# Iteration 8 — Phase 5 Review Reviewer: Sonnet-grade architect (independent read). Reviewed 2026-05-13. Documents read in order: phase0_findings_iter8.md, phase2_iter8_situation.md, phase3_iter8_findings.md, phase4_iter8_plan.md. Source files examined: `src/surface.c`, `src/surface.h`, `src/request.h`, `src/cap_pool.h`, `src/context.c` (lines 130–225), `src/video.h`, `src/utils.c`. --- ## Critical findings (CRIT) ### CRIT-1 — `request_log` prepends a prefix on EVERY call; the inner-loop hex dump will not produce a readable hex line **Empirically verified.** `utils.c:37`: `fprintf(stderr, "%s: ", V4L2_REQUEST_STR_VENDOR)` fires on every `request_log(...)` call before the format string. The plan's dump loop (phase4 lines 62–65): ```c for (unsigned int i = 0; i < 32; i++) request_log("%02x ", d[i]); request_log("\n"); ``` will emit 32 separate stderr lines, each prefixed with `"v4l2-request: "`: ``` v4l2-request: 81 v4l2-request: 81 v4l2-request: 80 ... ``` This is not the intended single hex line. The newline call at the end also emits a standalone `"v4l2-request: \n"` prefix line. The dump is still technically readable if you strip the prefix from each line and join them, but in practice when the output is captured from a multi-threaded ffmpeg run (which also emits its own stderr lines), it becomes nearly impossible to associate individual hex bytes with their plane/frame context. **Severity: CRIT** because the primary purpose of γ is empirical disambiguation. If the dump output is unreadable it can produce a false negative — the author scans the log, sees the prefix lines but can't reconstruct the hex, and draws wrong conclusions about plane content. **Fix**: Write the full hex line to a local `char buf[128]` with `snprintf`, then emit it with a single `request_log("%s\n", buf)`. Alternatively, call `fprintf(stderr, ...)` directly inside the loop (no prefix) and emit one `request_log(...)` per logical output line. Either approach is ~5 LOC delta from the plan sketch. --- ### CRIT-2 — γ placement: the plan says "after line 388 (`cap_pool_mark_decoded`)" but the surrounding guard `if (surface_object->current_slot != NULL)` matters **Empirically verified against `surface.c` lines 387–389.** ```c if (surface_object->current_slot != NULL) cap_pool_mark_decoded(&driver_data->capture_pool, surface_object->current_slot); ``` The plan inserts the dump after `cap_pool_mark_decoded`. At that point: - `destination_data[p]` is still valid: it was populated at `BeginPicture` → `surface_bind_slot` and the slot has not been released (slot state is now DECODED, not FREE). Correct. - `destination_planes_count` is also valid. Correct. However the plan's sketch uses `surface_object->destination_planes_count` and `destination_data[p]` outside any null-guard. If the dump fires on an error path where `current_slot == NULL` (i.e., the `if` above was not entered), `destination_data[p]` may hold a stale or null pointer from a prior decode cycle or from `RequestCreateSurfaces2`'s zeroed state. The plan's env-var check is outside the `if (surface_object->current_slot != NULL)` guard. **Fix**: Place the dump inside the same `if (surface_object->current_slot != NULL)` guard, or move it directly below `cap_pool_mark_decoded` with an explicit null-check on `destination_data[0]`. This is a 2-line change to the sketch. --- ## Important findings (IMP) ### IMP-1 — The bug redefinition (phase 3): "stale data from prior run" as alternative explanation for the 16×32 patch is not fully eliminated Phase 3 asserts (line 143): "Bug 4 is NOT primarily an inter-frame race. Even the keyframe (frame 1, IDR) fails to decode fully." The supporting evidence (line 30): the 16×32 patch values cluster around 0x7c..0x83 (luma neutral, smooth gradients). Phase 3 explicitly states these are "real decoded image data." Phase 3 also notes (line 30): "kdirect frame 1 fills the entire Y plane with proper Y values; the 0x7c-0x83 cluster in libva is NOT in kdirect's frame 1 (kdirect has Y≈0x10 black throughout)." This means the 16×32 patch content does NOT match frame 1's actual content. Phase 3 acknowledges this in scenario A (line 124): "the kernel decoded one tile then aborted." But there is a second-order alternative that phase 3 does not fully close: **the 512 bytes are stale from a previous decode run's CAPTURE buffer content that was never zeroed before the current run reused the slot.** Specifically: - `cap_pool_init` mmaps the CAPTURE buffers. If the kernel driver (rkvdec) doesn't zero-initialize the physical pages on REQBUFS, the pages may hold data from a previous use of the same physical memory. - BBB 1080p intro frames have luma values around 0x10 (near-black). But other content — including previous test runs of a different test clip — may have had neutral-gray luma in the 0x7c-0x83 range. - The 16×32 region is exactly one macroblock row × 2 rows = a stripe that rkvdec's hardware decode engine commonly writes first (tile/MB-order). If the 16×32 patch is stale residue rather than partial decode output, the interpretation changes: the kernel may have written NOTHING for H.264 frame 1 via libva, not "started then aborted." The bug would then be "kernel rejects silently, CAPTURE buffer never touched" — same as HEVC's all-zero (Bug 5) but with 512 bytes of pre-existing stale bytes. **How to distinguish**: γ partially addresses this — if the dump shows the 512 bytes immediately after DQBUF, it would at least confirm the timing (the data is there at DQBUF time, not injected later). But a stale-residue theory predicts the dump also shows those 512 bytes. The definitive test would be to deliberately zero the CAPTURE buffer (with `memset(slot->map[0], 0, slot->map_lengths[0])`) after `cap_pool_acquire` and before QBUF, then run the H.264 sweep. If the 16×32 patch disappears → stale residue. If it persists → kernel really writes it. **This is an author re-verification candidate.** The plan should note this alternative in Phase 7 execution: add one `memset` run to discriminate. ### IMP-2 — γ vs α-first: the plan's reasoning ("three distinct signatures cannot be explained without empirical narrowing") is sound but the actual cost of α is understated The plan characterizes the γ-vs-α choice as: "three distinct signatures (VP9 correct, HEVC zeroed, H.264 partial-leak) cannot be explained by a single root cause hypothesis without empirical narrowing." This is partially correct. However, the three-signature argument does NOT apply to the specific α candidates identified in phase3: - **SPS.constraint_set_flags** (libva=0, kdirect=2): this is a per-codec change (H.264 only, in `h264_set_controls`). Changing it risks nothing for VP9/HEVC/MPEG-2/VP8. - **DECODE_PARAMS.dpb[].bottom_field_order_cnt**: similarly H.264-specific. Neither change touches a shared code path. The three-signature argument is an argument against a shared-path fix — not against an H.264-specific 10-LOC fix. The `feedback_unconditional_codec_state.md` rule is satisfied by profile-gating; both α candidates are already gated to H.264. A counter-argument for γ-first remains: even if α's candidates are correctly identified, if the kernel didn't write anything (stale-residue scenario per IMP-1), fixing control fields changes the wire protocol but still leaves a silent-reject. γ tells us whether the kernel writes before we guess which field to fix. Given the 16×32 ambiguity, γ-first is defensible. But the plan should acknowledge this tradeoff explicitly rather than citing the three- signature argument, which is weaker than it appears. **Recommendation**: keep γ-first but update the rationale in Phase 4 (or note it here for Phase 6 drafting). ### IMP-3 — Plan says "add to line 388 (cap_pool_mark_decoded)" but does not say WHERE in RequestSyncSurface to put the dump relative to `status = VA_STATUS_SUCCESS` and `goto complete` Looking at `surface.c` lines 387–394: ```c if (surface_object->current_slot != NULL) cap_pool_mark_decoded(&driver_data->capture_pool, surface_object->current_slot); surface_object->status = VASurfaceDisplaying; status = VA_STATUS_SUCCESS; goto complete; ``` The plan should insert the dump between `cap_pool_mark_decoded` and `surface_object->status = VASurfaceDisplaying` — i.e., still on the success path only, before the goto. This is unambiguous once you read the code but the plan sketch doesn't reference the surrounding context, leaving Phase 6 implementer to decide. ### IMP-4 — Phase 4 predicts 3 outcomes for γ but the "stale slot" scenario maps cleanly to a 4th diagnostic state that the dump can't distinguish alone The plan's three-outcome table (phase4 lines 92-96): | Dump shows | Interpretation | |---|---| | plane[0] only 16×32 populated | Kernel didn't write | | plane[0] fully populated | libva mis-reads | | plane[0] populated with other content | Slot binding wrong | Scenario "plane[0] only 16×32 populated" is assigned to "Kernel didn't write." But per IMP-1, it could also be "Kernel wrote NOTHING and 16×32 is stale residue from prior kernel use." These require different follow-up: "Kernel didn't write" → control-fill fix. "Stale residue" → the control fix may still be correct, but the first-run memset experiment should be done first to correctly classify. The plan should add a row: | plane[0] non-zero 16×32 only, VP9 dump is fully populated | Inconclusive (stale vs partial-write); run memset-zero-before-QBUF experiment | --- ## Minor findings (MIN) ### MIN-1 — env-var caching uses file-scope statics; these would be shared across concurrent contexts if libva ever has two active contexts in the same process `surface.c` is compiled into a single shared library loaded once per process. File-scope `static bool dump_env_checked` is process-global. For the diagnostic use case this is fine (we want one env check per process). Just note it so future reviewers don't flag it as a bug. ### MIN-2 — `sz > 1024 ? 1024 : sz` in the non-zero-scan loop: for 1920×1088 NV12 plane[0], `sz = 1920×1088 = 2,088,960`. Sampling only the first 1024 bytes is intentional for log volume but may completely miss whether the kernel wrote content that's not in the first 1024 bytes For H.264 1080p in FRAME_BASED mode, rkvdec writes output MB-order or tile- order. If the kernel writes the first MB row (bytes 0..30720 for 1920-wide) but not the rest, the 1024-byte sample catches it. If the kernel writes only the LAST macroblock row (pathological), the 1024-byte scan reports zero non-zero bytes and misclassifies the dump. Suggest extending the sample to cover at minimum one full MB row: `min(sz, 1920 * 16) = 30720` for H.264 1080p. This is MIN because the leading hypothesis has content at the TOP-LEFT (the 16×32 patch is at offset 0), so the 1024-byte scan would catch it. But make the note for future runs with different content. ### MIN-3 — `surface_id` in the dump log is the VAAPI surface ID (integer offset into object_heap), not a human-readable frame number. The log line `"CAPTURE buffer dump for surface %d"` will print something like `67108864` (0x4000000 SURFACE_ID_OFFSET). This is not wrong, just potentially confusing. Adding `surface_object->destination_index` (already in the plan) is the cleaner identifier for correlating with strace DQBUF logs. --- ## Author re-verification list Per `feedback_review_empirical_over_theoretical.md`: the following are CLAIMS by this reviewer that the author should empirically verify before incorporating into an amended plan. 1. **CRIT-1 (request_log prefix)**: Verify by running any existing call site where `request_log` is called in a loop and inspecting stderr. The claim (each call emits its own `"v4l2-request: "` prefix) follows directly from `utils.c:37`, but if there is a buffered-output path or a different compile guard in the installed binary, the observation may differ. Cheap test: `grep -A3 "request_log" src/utils.c` and read `utils.c:33-42` (which I did — see source read above). 2. **CRIT-2 (null guard on destination_data)**: Verify by tracing when `destination_data[0]` can be null at the dump insertion point. `surface.c` `RequestCreateSurfaces2:207` zero-inits `params`, `slices_count`, etc. but `destination_data[]` is not explicitly zero-inited in that function; it relies on the zero-init from `object_heap_allocate`. Whether `object_heap_allocate` zero-fills depends on the heap implementation. Author should grep `object_heap_allocate` to confirm zero-fill behavior before relying on `destination_data[0] == NULL` as a safe null-check. 3. **IMP-1 (stale-residue alternative)**: The claim that the kernel may not zero-initialize rkvdec CAPTURE buffers on REQBUFS is based on typical DMA-coherent allocator behavior on RK3399. On some kernels, dma_alloc_* does zero physical pages at allocation time. The author should verify on fresnel by running a memset-before-QBUF experiment (described in IMP-1) rather than accepting either interpretation theoretically. 4. **context.c:201–205 (destination_sizes for planes>0)**: Reviewer reads `destination_sizes[1] = destination_sizes[0] / 2` for single-buffer NV12. This is the UV plane size = half the Y plane size (NV12 chroma is half- height half-width interleaved). Verify that this matches what `v4l2_get_format` returns for CAPTURE sizeimage on rkvdec H.264 1080p — the G_FMT sizeimage may include alignment padding (e.g. `4177920 = 1920 × 1088 × 3/2` without padding). If `destination_sizes[0]` is set from `bytesperlines[0] * format_height = 1920 × 1088 = 2,088,960`, then `destination_sizes[1] = 1,044,480`. But if the kernel reports `sizeimage = 4177920` as a single plane total, the split may differ. Confirm the specific value in the γ dump output for plane[1].sz. --- ## Amendments to Phase 6 These are the mechanical changes to apply when implementing γ: **Amendment 1 (CRIT-1 fix)**: Replace the inner hex-dump loops with buffered `snprintf` output emitted in a single `request_log` call: ```c /* Replace the per-byte request_log calls with: */ char hexbuf[128]; int pos = 0; for (unsigned int i = 0; i < 32 && i < sz; i++) pos += snprintf(hexbuf + pos, sizeof(hexbuf) - pos, "%02x ", d[i]); request_log(" plane[%u] hex[0..32]: %s\n", p, hexbuf); pos = 0; if (sz >= 32) { for (unsigned int i = 0; i < 32; i++) pos += snprintf(hexbuf + pos, sizeof(hexbuf) - pos, "%02x ", d[sz - 32 + i]); request_log(" plane[%u] tail hex[%zu..%zu]: %s\n", p, sz - 32, sz - 1, hexbuf); } ``` **Amendment 2 (CRIT-2 fix)**: Wrap the entire dump block in the existing null guard: ```c if (surface_object->current_slot != NULL && dump_env != NULL && dump_env[0] == '1') { /* dump block here */ } ``` Or equivalently, place the dump immediately INSIDE the existing `if (surface_object->current_slot != NULL)` block in `RequestSyncSurface`, after the `cap_pool_mark_decoded` call. **Amendment 3 (IMP-1 / experiment)**: For Phase 7 execution, add a second test run with an explicit memset-zero of the CAPTURE slot immediately after `cap_pool_acquire` in `picture.c::RequestBeginPicture` (or add a diagnostic `memset` in `cap_pool_init`). Compare H.264 frame 1 output before and after the memset. If the 16×32 patch disappears → stale-residue; adjust IMP outcome table accordingly. **Amendment 4 (IMP-3 / placement)**: Phase 6 implementer should place the dump block between `cap_pool_mark_decoded(...)` and `surface_object->status = VASurfaceDisplaying` — i.e., on the happy-path only, not inside an error label. Do not place it after `goto complete`. **Amendment 5 (MIN-2 / scan window)**: Change the non-zero scan limit from 1024 to `min(sz, (unsigned int)(destination_bytesperlines[0] * 16))` for plane 0 (covers one full macroblock row), and from 1024 to `min(sz, 1024)` for plane 1 (UV is smaller, 1024 is fine). This requires access to `surface_object->destination_bytesperlines[0]` inside the dump block, which is available in scope. --- ## Summary Two CRIT issues, both in the γ implementation sketch rather than in the strategy choice. The γ-first rationale is defensible but IMP-1 (stale-residue alternative) adds a fourth diagnostic state that the plan's three-outcome table doesn't handle. The fixes are all small (Amendment 1 is the most important — broken hex formatting would make the dump useless). No plan-level strategy change is required; the γ-then-α path stands.