Files

T

marfrit d4c04b4a3b iter8 Phase 5: sonnet-architect review — 2 CRIT + 4 IMP + 3 MIN

CRIT-1: request_log prepends prefix on every call; per-byte loop in γ
sketch would emit 32 prefix-only lines. Fix: snprintf buffered emit.

CRIT-2: γ dump block missing null guard on destination_data[]; the
plan's env-var check is outside the current_slot != NULL guard. Fix:
nest the dump inside the existing slot-null guard.

IMP-1: "stale residue from prior decode" not eliminated as alternative
explanation for the 16x32 patch. Add memset-zero-before-QBUF experiment
to Phase 7 to discriminate.

IMP-2: γ-first defensible but on IMP-1 grounds, not the
three-signature argument (which is weaker than stated).

IMP-3/4 placement clarifications. MIN-1/2/3 cosmetic.

5 mechanical amendments locked for Phase 6. γ-first strategy stands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-13 11:55:51 +00:00

16 KiB

Raw Blame History

Iteration 8 — Phase 5 Review

Reviewer: Sonnet-grade architect (independent read). Reviewed 2026-05-13.

Documents read in order: phase0_findings_iter8.md, phase2_iter8_situation.md, phase3_iter8_findings.md, phase4_iter8_plan.md.

Source files examined: src/surface.c, src/surface.h, src/request.h, src/cap_pool.h, src/context.c (lines 130–225), src/video.h, src/utils.c.

Critical findings (CRIT)

CRIT-1 — `request_log` prepends a prefix on EVERY call; the inner-loop hex dump will not produce a readable hex line

Empirically verified.

utils.c:37: fprintf(stderr, "%s: ", V4L2_REQUEST_STR_VENDOR) fires on every request_log(...) call before the format string. The plan's dump loop (phase4 lines 62–65):

for (unsigned int i = 0; i < 32; i++)
    request_log("%02x ", d[i]);
request_log("\n");

will emit 32 separate stderr lines, each prefixed with "v4l2-request: ":

v4l2-request: 81
v4l2-request: 81
v4l2-request: 80
...

This is not the intended single hex line. The newline call at the end also emits a standalone "v4l2-request: \n" prefix line. The dump is still technically readable if you strip the prefix from each line and join them, but in practice when the output is captured from a multi-threaded ffmpeg run (which also emits its own stderr lines), it becomes nearly impossible to associate individual hex bytes with their plane/frame context.

Severity: CRIT because the primary purpose of γ is empirical disambiguation. If the dump output is unreadable it can produce a false negative — the author scans the log, sees the prefix lines but can't reconstruct the hex, and draws wrong conclusions about plane content.

Fix: Write the full hex line to a local char buf[128] with snprintf, then emit it with a single request_log("%s\n", buf). Alternatively, call fprintf(stderr, ...) directly inside the loop (no prefix) and emit one request_log(...) per logical output line. Either approach is ~5 LOC delta from the plan sketch.

CRIT-2 — γ placement: the plan says "after line 388 (`cap_pool_mark_decoded`)" but the surrounding guard `if (surface_object->current_slot != NULL)` matters

Empirically verified against surface.c lines 387–389.

if (surface_object->current_slot != NULL)
    cap_pool_mark_decoded(&driver_data->capture_pool,
                          surface_object->current_slot);

The plan inserts the dump after cap_pool_mark_decoded. At that point:

destination_data[p] is still valid: it was populated at BeginPicture → surface_bind_slot and the slot has not been released (slot state is now DECODED, not FREE). Correct.
destination_planes_count is also valid. Correct.

However the plan's sketch uses surface_object->destination_planes_count and destination_data[p] outside any null-guard. If the dump fires on an error path where current_slot == NULL (i.e., the if above was not entered), destination_data[p] may hold a stale or null pointer from a prior decode cycle or from RequestCreateSurfaces2's zeroed state. The plan's env-var check is outside the if (surface_object->current_slot != NULL) guard.

Fix: Place the dump inside the same if (surface_object->current_slot != NULL) guard, or move it directly below cap_pool_mark_decoded with an explicit null-check on destination_data[0]. This is a 2-line change to the sketch.

Important findings (IMP)

IMP-1 — The bug redefinition (phase 3): "stale data from prior run" as alternative explanation for the 16×32 patch is not fully eliminated

Phase 3 asserts (line 143): "Bug 4 is NOT primarily an inter-frame race. Even the keyframe (frame 1, IDR) fails to decode fully."

The supporting evidence (line 30): the 16×32 patch values cluster around 0x7c..0x83 (luma neutral, smooth gradients). Phase 3 explicitly states these are "real decoded image data."

Phase 3 also notes (line 30): "kdirect frame 1 fills the entire Y plane with proper Y values; the 0x7c-0x83 cluster in libva is NOT in kdirect's frame 1 (kdirect has Y≈0x10 black throughout)."

This means the 16×32 patch content does NOT match frame 1's actual content. Phase 3 acknowledges this in scenario A (line 124): "the kernel decoded one tile then aborted." But there is a second-order alternative that phase 3 does not fully close: the 512 bytes are stale from a previous decode run's CAPTURE buffer content that was never zeroed before the current run reused the slot. Specifically:

cap_pool_init mmaps the CAPTURE buffers. If the kernel driver (rkvdec) doesn't zero-initialize the physical pages on REQBUFS, the pages may hold data from a previous use of the same physical memory.
BBB 1080p intro frames have luma values around 0x10 (near-black). But other content — including previous test runs of a different test clip — may have had neutral-gray luma in the 0x7c-0x83 range.
The 16×32 region is exactly one macroblock row × 2 rows = a stripe that rkvdec's hardware decode engine commonly writes first (tile/MB-order).

If the 16×32 patch is stale residue rather than partial decode output, the interpretation changes: the kernel may have written NOTHING for H.264 frame 1 via libva, not "started then aborted." The bug would then be "kernel rejects silently, CAPTURE buffer never touched" — same as HEVC's all-zero (Bug 5) but with 512 bytes of pre-existing stale bytes.

How to distinguish: γ partially addresses this — if the dump shows the 512 bytes immediately after DQBUF, it would at least confirm the timing (the data is there at DQBUF time, not injected later). But a stale-residue theory predicts the dump also shows those 512 bytes. The definitive test would be to deliberately zero the CAPTURE buffer (with memset(slot->map[0], 0, slot->map_lengths[0])) after cap_pool_acquire and before QBUF, then run the H.264 sweep. If the 16×32 patch disappears → stale residue. If it persists → kernel really writes it.

This is an author re-verification candidate. The plan should note this alternative in Phase 7 execution: add one memset run to discriminate.

IMP-2 — γ vs α-first: the plan's reasoning ("three distinct signatures

cannot be explained without empirical narrowing") is sound but the actual cost of α is understated

The plan characterizes the γ-vs-α choice as: "three distinct signatures (VP9 correct, HEVC zeroed, H.264 partial-leak) cannot be explained by a single root cause hypothesis without empirical narrowing."

This is partially correct. However, the three-signature argument does NOT apply to the specific α candidates identified in phase3:

SPS.constraint_set_flags (libva=0, kdirect=2): this is a per-codec change (H.264 only, in h264_set_controls). Changing it risks nothing for VP9/HEVC/MPEG-2/VP8.
DECODE_PARAMS.dpb[].bottom_field_order_cnt: similarly H.264-specific.

Neither change touches a shared code path. The three-signature argument is an argument against a shared-path fix — not against an H.264-specific 10-LOC fix. The feedback_unconditional_codec_state.md rule is satisfied by profile-gating; both α candidates are already gated to H.264.

A counter-argument for γ-first remains: even if α's candidates are correctly identified, if the kernel didn't write anything (stale-residue scenario per IMP-1), fixing control fields changes the wire protocol but still leaves a silent-reject. γ tells us whether the kernel writes before we guess which field to fix. Given the 16×32 ambiguity, γ-first is defensible. But the plan should acknowledge this tradeoff explicitly rather than citing the three- signature argument, which is weaker than it appears.

Recommendation: keep γ-first but update the rationale in Phase 4 (or note it here for Phase 6 drafting).

IMP-3 — Plan says "add to line 388 (cap_pool_mark_decoded)" but does not

say WHERE in RequestSyncSurface to put the dump relative to status = VA_STATUS_SUCCESS and goto complete

Looking at surface.c lines 387–394:

if (surface_object->current_slot != NULL)
    cap_pool_mark_decoded(&driver_data->capture_pool,
                          surface_object->current_slot);

surface_object->status = VASurfaceDisplaying;
status = VA_STATUS_SUCCESS;
goto complete;

The plan should insert the dump between cap_pool_mark_decoded and surface_object->status = VASurfaceDisplaying — i.e., still on the success path only, before the goto. This is unambiguous once you read the code but the plan sketch doesn't reference the surrounding context, leaving Phase 6 implementer to decide.

IMP-4 — Phase 4 predicts 3 outcomes for γ but the "stale slot" scenario

maps cleanly to a 4th diagnostic state that the dump can't distinguish alone

The plan's three-outcome table (phase4 lines 92-96):

Dump shows	Interpretation
plane[0] only 16×32 populated	Kernel didn't write
plane[0] fully populated	libva mis-reads
plane[0] populated with other content	Slot binding wrong

Scenario "plane[0] only 16×32 populated" is assigned to "Kernel didn't write." But per IMP-1, it could also be "Kernel wrote NOTHING and 16×32 is stale residue from prior kernel use." These require different follow-up: "Kernel didn't write" → control-fill fix. "Stale residue" → the control fix may still be correct, but the first-run memset experiment should be done first to correctly classify.

The plan should add a row:

| plane[0] non-zero 16×32 only, VP9 dump is fully populated | Inconclusive (stale vs partial-write); run memset-zero-before-QBUF experiment |

Minor findings (MIN)

MIN-1 — env-var caching uses file-scope statics; these would be shared

across concurrent contexts if libva ever has two active contexts in the same process

surface.c is compiled into a single shared library loaded once per process. File-scope static bool dump_env_checked is process-global. For the diagnostic use case this is fine (we want one env check per process). Just note it so future reviewers don't flag it as a bug.

MIN-2 — `sz > 1024 ? 1024 : sz` in the non-zero-scan loop: for 1920×1088

NV12 plane[0], sz = 1920×1088 = 2,088,960. Sampling only the first 1024 bytes is intentional for log volume but may completely miss whether the kernel wrote content that's not in the first 1024 bytes

For H.264 1080p in FRAME_BASED mode, rkvdec writes output MB-order or tile- order. If the kernel writes the first MB row (bytes 0..30720 for 1920-wide) but not the rest, the 1024-byte sample catches it. If the kernel writes only the LAST macroblock row (pathological), the 1024-byte scan reports zero non-zero bytes and misclassifies the dump. Suggest extending the sample to cover at minimum one full MB row: min(sz, 1920 * 16) = 30720 for H.264 1080p.

This is MIN because the leading hypothesis has content at the TOP-LEFT (the 16×32 patch is at offset 0), so the 1024-byte scan would catch it. But make the note for future runs with different content.

MIN-3 — `surface_id` in the dump log is the VAAPI surface ID (integer

offset into object_heap), not a human-readable frame number. The log line "CAPTURE buffer dump for surface %d" will print something like 67108864 (0x4000000 SURFACE_ID_OFFSET). This is not wrong, just potentially confusing. Adding surface_object->destination_index (already in the plan) is the cleaner identifier for correlating with strace DQBUF logs.

Author re-verification list

Per feedback_review_empirical_over_theoretical.md: the following are CLAIMS by this reviewer that the author should empirically verify before incorporating into an amended plan.

CRIT-1 (request_log prefix): Verify by running any existing call site where request_log is called in a loop and inspecting stderr. The claim (each call emits its own "v4l2-request: " prefix) follows directly from utils.c:37, but if there is a buffered-output path or a different compile guard in the installed binary, the observation may differ. Cheap test: grep -A3 "request_log" src/utils.c and read utils.c:33-42 (which I did — see source read above).
CRIT-2 (null guard on destination_data): Verify by tracing when destination_data[0] can be null at the dump insertion point. surface.c RequestCreateSurfaces2:207 zero-inits params, slices_count, etc. but destination_data[] is not explicitly zero-inited in that function; it relies on the zero-init from object_heap_allocate. Whether object_heap_allocate zero-fills depends on the heap implementation. Author should grep object_heap_allocate to confirm zero-fill behavior before relying on destination_data[0] == NULL as a safe null-check.
IMP-1 (stale-residue alternative): The claim that the kernel may not zero-initialize rkvdec CAPTURE buffers on REQBUFS is based on typical DMA-coherent allocator behavior on RK3399. On some kernels, dma_alloc_* does zero physical pages at allocation time. The author should verify on fresnel by running a memset-before-QBUF experiment (described in IMP-1) rather than accepting either interpretation theoretically.
context.c:201–205 (destination_sizes for planes>0): Reviewer reads destination_sizes[1] = destination_sizes[0] / 2 for single-buffer NV12. This is the UV plane size = half the Y plane size (NV12 chroma is half- height half-width interleaved). Verify that this matches what v4l2_get_format returns for CAPTURE sizeimage on rkvdec H.264 1080p — the G_FMT sizeimage may include alignment padding (e.g. 4177920 = 1920 × 1088 × 3/2 without padding). If destination_sizes[0] is set from bytesperlines[0] * format_height = 1920 × 1088 = 2,088,960, then destination_sizes[1] = 1,044,480. But if the kernel reports sizeimage = 4177920 as a single plane total, the split may differ. Confirm the specific value in the γ dump output for plane[1].sz.

Amendments to Phase 6

These are the mechanical changes to apply when implementing γ:

Amendment 1 (CRIT-1 fix): Replace the inner hex-dump loops with buffered snprintf output emitted in a single request_log call:

/* Replace the per-byte request_log calls with: */
char hexbuf[128];
int pos = 0;
for (unsigned int i = 0; i < 32 && i < sz; i++)
    pos += snprintf(hexbuf + pos, sizeof(hexbuf) - pos, "%02x ", d[i]);
request_log("  plane[%u] hex[0..32]: %s\n", p, hexbuf);

pos = 0;
if (sz >= 32) {
    for (unsigned int i = 0; i < 32; i++)
        pos += snprintf(hexbuf + pos, sizeof(hexbuf) - pos,
                        "%02x ", d[sz - 32 + i]);
    request_log("  plane[%u] tail hex[%zu..%zu]: %s\n",
                p, sz - 32, sz - 1, hexbuf);
}

Amendment 2 (CRIT-2 fix): Wrap the entire dump block in the existing null guard:

if (surface_object->current_slot != NULL && dump_env != NULL && dump_env[0] == '1') {
    /* dump block here */
}

Or equivalently, place the dump immediately INSIDE the existing if (surface_object->current_slot != NULL) block in RequestSyncSurface, after the cap_pool_mark_decoded call.

Amendment 3 (IMP-1 / experiment): For Phase 7 execution, add a second test run with an explicit memset-zero of the CAPTURE slot immediately after cap_pool_acquire in picture.c::RequestBeginPicture (or add a diagnostic memset in cap_pool_init). Compare H.264 frame 1 output before and after the memset. If the 16×32 patch disappears → stale-residue; adjust IMP outcome table accordingly.

Amendment 4 (IMP-3 / placement): Phase 6 implementer should place the dump block between cap_pool_mark_decoded(...) and surface_object->status = VASurfaceDisplaying — i.e., on the happy-path only, not inside an error label. Do not place it after goto complete.

Amendment 5 (MIN-2 / scan window): Change the non-zero scan limit from 1024 to min(sz, (unsigned int)(destination_bytesperlines[0] * 16)) for plane 0 (covers one full macroblock row), and from 1024 to min(sz, 1024) for plane 1 (UV is smaller, 1024 is fine). This requires access to surface_object->destination_bytesperlines[0] inside the dump block, which is available in scope.

Summary

Two CRIT issues, both in the γ implementation sketch rather than in the strategy choice. The γ-first rationale is defensible but IMP-1 (stale-residue alternative) adds a fourth diagnostic state that the plan's three-outcome table doesn't handle. The fixes are all small (Amendment 1 is the most important — broken hex formatting would make the dump useless). No plan-level strategy change is required; the γ-then-α path stands.

16 KiB Raw Blame History Unescape Escape