# Iteration 8 — Phase 5 Review

Reviewer: Sonnet-grade architect (independent read). Reviewed 2026-05-13.

Documents read in order: phase0_findings_iter8.md, phase2_iter8_situation.md,
phase3_iter8_findings.md, phase4_iter8_plan.md.

Source files examined: `src/surface.c`, `src/surface.h`, `src/request.h`,
`src/cap_pool.h`, `src/context.c` (lines 130–225), `src/video.h`,
`src/utils.c`.

---

## Critical findings (CRIT)

### CRIT-1 — `request_log` prepends a prefix on EVERY call; the inner-loop hex dump will not produce a readable hex line

**Empirically verified.**

`utils.c:37`: `fprintf(stderr, "%s: ", V4L2_REQUEST_STR_VENDOR)` fires on every
`request_log(...)` call before the format string. The plan's dump loop
(phase4 lines 62–65):

```c
for (unsigned int i = 0; i < 32; i++)
    request_log("%02x ", d[i]);
request_log("\n");
```

will emit 32 separate stderr lines, each prefixed with `"v4l2-request: "`:

```
v4l2-request: 81
v4l2-request: 81
v4l2-request: 80
...
```

This is not the intended single hex line. The newline call at the end also
emits a standalone `"v4l2-request: \n"` prefix line. The dump is still
technically readable if you strip the prefix from each line and join them,
but in practice when the output is captured from a multi-threaded ffmpeg run
(which also emits its own stderr lines), it becomes nearly impossible to
associate individual hex bytes with their plane/frame context.

**Severity: CRIT** because the primary purpose of γ is empirical
disambiguation. If the dump output is unreadable it can produce a false
negative — the author scans the log, sees the prefix lines but can't
reconstruct the hex, and draws wrong conclusions about plane content.

**Fix**: Write the full hex line to a local `char buf[128]` with `snprintf`,
then emit it with a single `request_log("%s\n", buf)`. Alternatively, call
`fprintf(stderr, ...)` directly inside the loop (no prefix) and emit one
`request_log(...)` per logical output line. Either approach is ~5 LOC delta
from the plan sketch.

---

### CRIT-2 — γ placement: the plan says "after line 388 (`cap_pool_mark_decoded`)" but the surrounding guard `if (surface_object->current_slot != NULL)` matters

**Empirically verified against `surface.c` lines 387–389.**

```c
if (surface_object->current_slot != NULL)
    cap_pool_mark_decoded(&driver_data->capture_pool,
                          surface_object->current_slot);
```

The plan inserts the dump after `cap_pool_mark_decoded`. At that point:

- `destination_data[p]` is still valid: it was populated at `BeginPicture` →
  `surface_bind_slot` and the slot has not been released (slot state is now
  DECODED, not FREE). Correct.
- `destination_planes_count` is also valid. Correct.

However the plan's sketch uses `surface_object->destination_planes_count` and
`destination_data[p]` outside any null-guard. If the dump fires on an error
path where `current_slot == NULL` (i.e., the `if` above was not entered),
`destination_data[p]` may hold a stale or null pointer from a prior decode
cycle or from `RequestCreateSurfaces2`'s zeroed state. The plan's env-var
check is outside the `if (surface_object->current_slot != NULL)` guard.

**Fix**: Place the dump inside the same `if (surface_object->current_slot != NULL)` guard, or move it directly below `cap_pool_mark_decoded` with an
explicit null-check on `destination_data[0]`. This is a 2-line change to the
sketch.

---

## Important findings (IMP)

### IMP-1 — The bug redefinition (phase 3): "stale data from prior run" as alternative explanation for the 16×32 patch is not fully eliminated

Phase 3 asserts (line 143): "Bug 4 is NOT primarily an inter-frame race.
Even the keyframe (frame 1, IDR) fails to decode fully."

The supporting evidence (line 30): the 16×32 patch values cluster around
0x7c..0x83 (luma neutral, smooth gradients). Phase 3 explicitly states these
are "real decoded image data."

Phase 3 also notes (line 30): "kdirect frame 1 fills the entire Y plane with
proper Y values; the 0x7c-0x83 cluster in libva is NOT in kdirect's frame 1
(kdirect has Y≈0x10 black throughout)."

This means the 16×32 patch content does NOT match frame 1's actual content.
Phase 3 acknowledges this in scenario A (line 124): "the kernel decoded one
tile then aborted." But there is a second-order alternative that phase 3 does
not fully close: **the 512 bytes are stale from a previous decode run's
CAPTURE buffer content that was never zeroed before the current run reused
the slot.** Specifically:

- `cap_pool_init` mmaps the CAPTURE buffers. If the kernel driver (rkvdec)
  doesn't zero-initialize the physical pages on REQBUFS, the pages may hold
  data from a previous use of the same physical memory.
- BBB 1080p intro frames have luma values around 0x10 (near-black). But
  other content — including previous test runs of a different test clip — may
  have had neutral-gray luma in the 0x7c-0x83 range.
- The 16×32 region is exactly one macroblock row × 2 rows = a stripe that
  rkvdec's hardware decode engine commonly writes first (tile/MB-order).

If the 16×32 patch is stale residue rather than partial decode output, the
interpretation changes: the kernel may have written NOTHING for H.264 frame 1
via libva, not "started then aborted." The bug would then be "kernel rejects
silently, CAPTURE buffer never touched" — same as HEVC's all-zero (Bug 5) but
with 512 bytes of pre-existing stale bytes.

**How to distinguish**: γ partially addresses this — if the dump shows the 512
bytes immediately after DQBUF, it would at least confirm the timing (the data
is there at DQBUF time, not injected later). But a stale-residue theory
predicts the dump also shows those 512 bytes. The definitive test would be to
deliberately zero the CAPTURE buffer (with `memset(slot->map[0], 0,
slot->map_lengths[0])`) after `cap_pool_acquire` and before QBUF, then run
the H.264 sweep. If the 16×32 patch disappears → stale residue. If it
persists → kernel really writes it.

**This is an author re-verification candidate.** The plan should note this
alternative in Phase 7 execution: add one `memset` run to discriminate.

### IMP-2 — γ vs α-first: the plan's reasoning ("three distinct signatures
cannot be explained without empirical narrowing") is sound but the actual
cost of α is understated

The plan characterizes the γ-vs-α choice as: "three distinct signatures
(VP9 correct, HEVC zeroed, H.264 partial-leak) cannot be explained by a
single root cause hypothesis without empirical narrowing."

This is partially correct. However, the three-signature argument does NOT
apply to the specific α candidates identified in phase3:

- **SPS.constraint_set_flags** (libva=0, kdirect=2): this is a per-codec
  change (H.264 only, in `h264_set_controls`). Changing it risks nothing for
  VP9/HEVC/MPEG-2/VP8.
- **DECODE_PARAMS.dpb[].bottom_field_order_cnt**: similarly H.264-specific.

Neither change touches a shared code path. The three-signature argument is
an argument against a shared-path fix — not against an H.264-specific
10-LOC fix. The `feedback_unconditional_codec_state.md` rule is satisfied by
profile-gating; both α candidates are already gated to H.264.

A counter-argument for γ-first remains: even if α's candidates are correctly
identified, if the kernel didn't write anything (stale-residue scenario per
IMP-1), fixing control fields changes the wire protocol but still leaves a
silent-reject. γ tells us whether the kernel writes before we guess which
field to fix. Given the 16×32 ambiguity, γ-first is defensible. But the plan
should acknowledge this tradeoff explicitly rather than citing the three-
signature argument, which is weaker than it appears.

**Recommendation**: keep γ-first but update the rationale in Phase 4 (or
note it here for Phase 6 drafting).

### IMP-3 — Plan says "add to line 388 (cap_pool_mark_decoded)" but does not
say WHERE in RequestSyncSurface to put the dump relative to `status = VA_STATUS_SUCCESS` and `goto complete`

Looking at `surface.c` lines 387–394:

```c
if (surface_object->current_slot != NULL)
    cap_pool_mark_decoded(&driver_data->capture_pool,
                          surface_object->current_slot);

surface_object->status = VASurfaceDisplaying;
status = VA_STATUS_SUCCESS;
goto complete;
```

The plan should insert the dump between `cap_pool_mark_decoded` and
`surface_object->status = VASurfaceDisplaying` — i.e., still on the success
path only, before the goto. This is unambiguous once you read the code but
the plan sketch doesn't reference the surrounding context, leaving Phase 6
implementer to decide.

### IMP-4 — Phase 4 predicts 3 outcomes for γ but the "stale slot" scenario
maps cleanly to a 4th diagnostic state that the dump can't distinguish alone

The plan's three-outcome table (phase4 lines 92-96):

| Dump shows | Interpretation |
|---|---|
| plane[0] only 16×32 populated | Kernel didn't write |
| plane[0] fully populated | libva mis-reads |
| plane[0] populated with other content | Slot binding wrong |

Scenario "plane[0] only 16×32 populated" is assigned to "Kernel didn't
write." But per IMP-1, it could also be "Kernel wrote NOTHING and 16×32 is
stale residue from prior kernel use." These require different follow-up:
"Kernel didn't write" → control-fill fix. "Stale residue" → the control fix
may still be correct, but the first-run memset experiment should be done
first to correctly classify.

The plan should add a row:

| plane[0] non-zero 16×32 only, VP9 dump is fully populated | Inconclusive (stale vs partial-write); run memset-zero-before-QBUF experiment |

---

## Minor findings (MIN)

### MIN-1 — env-var caching uses file-scope statics; these would be shared
across concurrent contexts if libva ever has two active contexts in the same
process

`surface.c` is compiled into a single shared library loaded once per process.
File-scope `static bool dump_env_checked` is process-global. For the
diagnostic use case this is fine (we want one env check per process). Just
note it so future reviewers don't flag it as a bug.

### MIN-2 — `sz > 1024 ? 1024 : sz` in the non-zero-scan loop: for 1920×1088
NV12 plane[0], `sz = 1920×1088 = 2,088,960`. Sampling only the first 1024
bytes is intentional for log volume but may completely miss whether the
kernel wrote content that's not in the first 1024 bytes

For H.264 1080p in FRAME_BASED mode, rkvdec writes output MB-order or tile-
order. If the kernel writes the first MB row (bytes 0..30720 for 1920-wide)
but not the rest, the 1024-byte sample catches it. If the kernel writes only
the LAST macroblock row (pathological), the 1024-byte scan reports zero
non-zero bytes and misclassifies the dump. Suggest extending the sample to
cover at minimum one full MB row: `min(sz, 1920 * 16) = 30720` for H.264
1080p.

This is MIN because the leading hypothesis has content at the TOP-LEFT (the
16×32 patch is at offset 0), so the 1024-byte scan would catch it. But make
the note for future runs with different content.

### MIN-3 — `surface_id` in the dump log is the VAAPI surface ID (integer
offset into object_heap), not a human-readable frame number. The log line
`"CAPTURE buffer dump for surface %d"` will print something like `67108864`
(0x4000000 SURFACE_ID_OFFSET). This is not wrong, just potentially confusing.
Adding `surface_object->destination_index` (already in the plan) is the
cleaner identifier for correlating with strace DQBUF logs.

---

## Author re-verification list

Per `feedback_review_empirical_over_theoretical.md`: the following are CLAIMS
by this reviewer that the author should empirically verify before incorporating
into an amended plan.

1. **CRIT-1 (request_log prefix)**: Verify by running any existing call site
   where `request_log` is called in a loop and inspecting stderr. The claim
   (each call emits its own `"v4l2-request: "` prefix) follows directly from
   `utils.c:37`, but if there is a buffered-output path or a different compile
   guard in the installed binary, the observation may differ. Cheap test:
   `grep -A3 "request_log" src/utils.c` and read `utils.c:33-42` (which I did
   — see source read above).

2. **CRIT-2 (null guard on destination_data)**: Verify by tracing when
   `destination_data[0]` can be null at the dump insertion point. `surface.c`
   `RequestCreateSurfaces2:207` zero-inits `params`, `slices_count`, etc. but
   `destination_data[]` is not explicitly zero-inited in that function; it
   relies on the zero-init from `object_heap_allocate`. Whether
   `object_heap_allocate` zero-fills depends on the heap implementation.
   Author should grep `object_heap_allocate` to confirm zero-fill behavior
   before relying on `destination_data[0] == NULL` as a safe null-check.

3. **IMP-1 (stale-residue alternative)**: The claim that the kernel may not
   zero-initialize rkvdec CAPTURE buffers on REQBUFS is based on typical
   DMA-coherent allocator behavior on RK3399. On some kernels, dma_alloc_*
   does zero physical pages at allocation time. The author should verify on
   fresnel by running a memset-before-QBUF experiment (described in IMP-1)
   rather than accepting either interpretation theoretically.

4. **context.c:201–205 (destination_sizes for planes>0)**: Reviewer reads
   `destination_sizes[1] = destination_sizes[0] / 2` for single-buffer NV12.
   This is the UV plane size = half the Y plane size (NV12 chroma is half-
   height half-width interleaved). Verify that this matches what
   `v4l2_get_format` returns for CAPTURE sizeimage on rkvdec H.264 1080p —
   the G_FMT sizeimage may include alignment padding (e.g. `4177920 = 1920 ×
   1088 × 3/2` without padding). If `destination_sizes[0]` is set from
   `bytesperlines[0] * format_height = 1920 × 1088 = 2,088,960`, then
   `destination_sizes[1] = 1,044,480`. But if the kernel reports
   `sizeimage = 4177920` as a single plane total, the split may differ.
   Confirm the specific value in the γ dump output for plane[1].sz.

---

## Amendments to Phase 6

These are the mechanical changes to apply when implementing γ:

**Amendment 1 (CRIT-1 fix)**: Replace the inner hex-dump loops with
buffered `snprintf` output emitted in a single `request_log` call:

```c
/* Replace the per-byte request_log calls with: */
char hexbuf[128];
int pos = 0;
for (unsigned int i = 0; i < 32 && i < sz; i++)
    pos += snprintf(hexbuf + pos, sizeof(hexbuf) - pos, "%02x ", d[i]);
request_log("  plane[%u] hex[0..32]: %s\n", p, hexbuf);

pos = 0;
if (sz >= 32) {
    for (unsigned int i = 0; i < 32; i++)
        pos += snprintf(hexbuf + pos, sizeof(hexbuf) - pos,
                        "%02x ", d[sz - 32 + i]);
    request_log("  plane[%u] tail hex[%zu..%zu]: %s\n",
                p, sz - 32, sz - 1, hexbuf);
}
```

**Amendment 2 (CRIT-2 fix)**: Wrap the entire dump block in the
existing null guard:

```c
if (surface_object->current_slot != NULL && dump_env != NULL && dump_env[0] == '1') {
    /* dump block here */
}
```

Or equivalently, place the dump immediately INSIDE the existing
`if (surface_object->current_slot != NULL)` block in `RequestSyncSurface`,
after the `cap_pool_mark_decoded` call.

**Amendment 3 (IMP-1 / experiment)**: For Phase 7 execution, add a second
test run with an explicit memset-zero of the CAPTURE slot immediately
after `cap_pool_acquire` in `picture.c::RequestBeginPicture` (or add a
diagnostic `memset` in `cap_pool_init`). Compare H.264 frame 1 output before
and after the memset. If the 16×32 patch disappears → stale-residue; adjust
IMP outcome table accordingly.

**Amendment 4 (IMP-3 / placement)**: Phase 6 implementer should place the
dump block between `cap_pool_mark_decoded(...)` and
`surface_object->status = VASurfaceDisplaying` — i.e., on the happy-path
only, not inside an error label. Do not place it after `goto complete`.

**Amendment 5 (MIN-2 / scan window)**: Change the non-zero scan limit from
1024 to `min(sz, (unsigned int)(destination_bytesperlines[0] * 16))` for
plane 0 (covers one full macroblock row), and from 1024 to
`min(sz, 1024)` for plane 1 (UV is smaller, 1024 is fine). This requires
access to `surface_object->destination_bytesperlines[0]` inside the dump
block, which is available in scope.

---

## Summary

Two CRIT issues, both in the γ implementation sketch rather than in the
strategy choice. The γ-first rationale is defensible but IMP-1 (stale-residue
alternative) adds a fourth diagnostic state that the plan's three-outcome
table doesn't handle. The fixes are all small (Amendment 1 is the most
important — broken hex formatting would make the dump useless). No plan-level
strategy change is required; the γ-then-α path stands.