From d4c04b4a3bafd95ba987dbc82f37a78e1e733e1b Mon Sep 17 00:00:00 2001 From: Markus Fritsche Date: Wed, 13 May 2026 11:55:51 +0000 Subject: [PATCH] =?UTF-8?q?iter8=20Phase=205:=20sonnet-architect=20review?= =?UTF-8?q?=20=E2=80=94=202=20CRIT=20+=204=20IMP=20+=203=20MIN?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit CRIT-1: request_log prepends prefix on every call; per-byte loop in γ sketch would emit 32 prefix-only lines. Fix: snprintf buffered emit. CRIT-2: γ dump block missing null guard on destination_data[]; the plan's env-var check is outside the current_slot != NULL guard. Fix: nest the dump inside the existing slot-null guard. IMP-1: "stale residue from prior decode" not eliminated as alternative explanation for the 16x32 patch. Add memset-zero-before-QBUF experiment to Phase 7 to discriminate. IMP-2: γ-first defensible but on IMP-1 grounds, not the three-signature argument (which is weaker than stated). IMP-3/4 placement clarifications. MIN-1/2/3 cosmetic. 5 mechanical amendments locked for Phase 6. γ-first strategy stands. Co-Authored-By: Claude Opus 4.7 (1M context) --- phase5_iter8_review.md | 362 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 362 insertions(+) create mode 100644 phase5_iter8_review.md diff --git a/phase5_iter8_review.md b/phase5_iter8_review.md new file mode 100644 index 0000000..901f191 --- /dev/null +++ b/phase5_iter8_review.md @@ -0,0 +1,362 @@ +# Iteration 8 — Phase 5 Review + +Reviewer: Sonnet-grade architect (independent read). Reviewed 2026-05-13. + +Documents read in order: phase0_findings_iter8.md, phase2_iter8_situation.md, +phase3_iter8_findings.md, phase4_iter8_plan.md. + +Source files examined: `src/surface.c`, `src/surface.h`, `src/request.h`, +`src/cap_pool.h`, `src/context.c` (lines 130–225), `src/video.h`, +`src/utils.c`. + +--- + +## Critical findings (CRIT) + +### CRIT-1 — `request_log` prepends a prefix on EVERY call; the inner-loop hex dump will not produce a readable hex line + +**Empirically verified.** + +`utils.c:37`: `fprintf(stderr, "%s: ", V4L2_REQUEST_STR_VENDOR)` fires on every +`request_log(...)` call before the format string. The plan's dump loop +(phase4 lines 62–65): + +```c +for (unsigned int i = 0; i < 32; i++) + request_log("%02x ", d[i]); +request_log("\n"); +``` + +will emit 32 separate stderr lines, each prefixed with `"v4l2-request: "`: + +``` +v4l2-request: 81 +v4l2-request: 81 +v4l2-request: 80 +... +``` + +This is not the intended single hex line. The newline call at the end also +emits a standalone `"v4l2-request: \n"` prefix line. The dump is still +technically readable if you strip the prefix from each line and join them, +but in practice when the output is captured from a multi-threaded ffmpeg run +(which also emits its own stderr lines), it becomes nearly impossible to +associate individual hex bytes with their plane/frame context. + +**Severity: CRIT** because the primary purpose of γ is empirical +disambiguation. If the dump output is unreadable it can produce a false +negative — the author scans the log, sees the prefix lines but can't +reconstruct the hex, and draws wrong conclusions about plane content. + +**Fix**: Write the full hex line to a local `char buf[128]` with `snprintf`, +then emit it with a single `request_log("%s\n", buf)`. Alternatively, call +`fprintf(stderr, ...)` directly inside the loop (no prefix) and emit one +`request_log(...)` per logical output line. Either approach is ~5 LOC delta +from the plan sketch. + +--- + +### CRIT-2 — γ placement: the plan says "after line 388 (`cap_pool_mark_decoded`)" but the surrounding guard `if (surface_object->current_slot != NULL)` matters + +**Empirically verified against `surface.c` lines 387–389.** + +```c +if (surface_object->current_slot != NULL) + cap_pool_mark_decoded(&driver_data->capture_pool, + surface_object->current_slot); +``` + +The plan inserts the dump after `cap_pool_mark_decoded`. At that point: + +- `destination_data[p]` is still valid: it was populated at `BeginPicture` → + `surface_bind_slot` and the slot has not been released (slot state is now + DECODED, not FREE). Correct. +- `destination_planes_count` is also valid. Correct. + +However the plan's sketch uses `surface_object->destination_planes_count` and +`destination_data[p]` outside any null-guard. If the dump fires on an error +path where `current_slot == NULL` (i.e., the `if` above was not entered), +`destination_data[p]` may hold a stale or null pointer from a prior decode +cycle or from `RequestCreateSurfaces2`'s zeroed state. The plan's env-var +check is outside the `if (surface_object->current_slot != NULL)` guard. + +**Fix**: Place the dump inside the same `if (surface_object->current_slot != NULL)` guard, or move it directly below `cap_pool_mark_decoded` with an +explicit null-check on `destination_data[0]`. This is a 2-line change to the +sketch. + +--- + +## Important findings (IMP) + +### IMP-1 — The bug redefinition (phase 3): "stale data from prior run" as alternative explanation for the 16×32 patch is not fully eliminated + +Phase 3 asserts (line 143): "Bug 4 is NOT primarily an inter-frame race. +Even the keyframe (frame 1, IDR) fails to decode fully." + +The supporting evidence (line 30): the 16×32 patch values cluster around +0x7c..0x83 (luma neutral, smooth gradients). Phase 3 explicitly states these +are "real decoded image data." + +Phase 3 also notes (line 30): "kdirect frame 1 fills the entire Y plane with +proper Y values; the 0x7c-0x83 cluster in libva is NOT in kdirect's frame 1 +(kdirect has Y≈0x10 black throughout)." + +This means the 16×32 patch content does NOT match frame 1's actual content. +Phase 3 acknowledges this in scenario A (line 124): "the kernel decoded one +tile then aborted." But there is a second-order alternative that phase 3 does +not fully close: **the 512 bytes are stale from a previous decode run's +CAPTURE buffer content that was never zeroed before the current run reused +the slot.** Specifically: + +- `cap_pool_init` mmaps the CAPTURE buffers. If the kernel driver (rkvdec) + doesn't zero-initialize the physical pages on REQBUFS, the pages may hold + data from a previous use of the same physical memory. +- BBB 1080p intro frames have luma values around 0x10 (near-black). But + other content — including previous test runs of a different test clip — may + have had neutral-gray luma in the 0x7c-0x83 range. +- The 16×32 region is exactly one macroblock row × 2 rows = a stripe that + rkvdec's hardware decode engine commonly writes first (tile/MB-order). + +If the 16×32 patch is stale residue rather than partial decode output, the +interpretation changes: the kernel may have written NOTHING for H.264 frame 1 +via libva, not "started then aborted." The bug would then be "kernel rejects +silently, CAPTURE buffer never touched" — same as HEVC's all-zero (Bug 5) but +with 512 bytes of pre-existing stale bytes. + +**How to distinguish**: γ partially addresses this — if the dump shows the 512 +bytes immediately after DQBUF, it would at least confirm the timing (the data +is there at DQBUF time, not injected later). But a stale-residue theory +predicts the dump also shows those 512 bytes. The definitive test would be to +deliberately zero the CAPTURE buffer (with `memset(slot->map[0], 0, +slot->map_lengths[0])`) after `cap_pool_acquire` and before QBUF, then run +the H.264 sweep. If the 16×32 patch disappears → stale residue. If it +persists → kernel really writes it. + +**This is an author re-verification candidate.** The plan should note this +alternative in Phase 7 execution: add one `memset` run to discriminate. + +### IMP-2 — γ vs α-first: the plan's reasoning ("three distinct signatures +cannot be explained without empirical narrowing") is sound but the actual +cost of α is understated + +The plan characterizes the γ-vs-α choice as: "three distinct signatures +(VP9 correct, HEVC zeroed, H.264 partial-leak) cannot be explained by a +single root cause hypothesis without empirical narrowing." + +This is partially correct. However, the three-signature argument does NOT +apply to the specific α candidates identified in phase3: + +- **SPS.constraint_set_flags** (libva=0, kdirect=2): this is a per-codec + change (H.264 only, in `h264_set_controls`). Changing it risks nothing for + VP9/HEVC/MPEG-2/VP8. +- **DECODE_PARAMS.dpb[].bottom_field_order_cnt**: similarly H.264-specific. + +Neither change touches a shared code path. The three-signature argument is +an argument against a shared-path fix — not against an H.264-specific +10-LOC fix. The `feedback_unconditional_codec_state.md` rule is satisfied by +profile-gating; both α candidates are already gated to H.264. + +A counter-argument for γ-first remains: even if α's candidates are correctly +identified, if the kernel didn't write anything (stale-residue scenario per +IMP-1), fixing control fields changes the wire protocol but still leaves a +silent-reject. γ tells us whether the kernel writes before we guess which +field to fix. Given the 16×32 ambiguity, γ-first is defensible. But the plan +should acknowledge this tradeoff explicitly rather than citing the three- +signature argument, which is weaker than it appears. + +**Recommendation**: keep γ-first but update the rationale in Phase 4 (or +note it here for Phase 6 drafting). + +### IMP-3 — Plan says "add to line 388 (cap_pool_mark_decoded)" but does not +say WHERE in RequestSyncSurface to put the dump relative to `status = VA_STATUS_SUCCESS` and `goto complete` + +Looking at `surface.c` lines 387–394: + +```c +if (surface_object->current_slot != NULL) + cap_pool_mark_decoded(&driver_data->capture_pool, + surface_object->current_slot); + +surface_object->status = VASurfaceDisplaying; +status = VA_STATUS_SUCCESS; +goto complete; +``` + +The plan should insert the dump between `cap_pool_mark_decoded` and +`surface_object->status = VASurfaceDisplaying` — i.e., still on the success +path only, before the goto. This is unambiguous once you read the code but +the plan sketch doesn't reference the surrounding context, leaving Phase 6 +implementer to decide. + +### IMP-4 — Phase 4 predicts 3 outcomes for γ but the "stale slot" scenario +maps cleanly to a 4th diagnostic state that the dump can't distinguish alone + +The plan's three-outcome table (phase4 lines 92-96): + +| Dump shows | Interpretation | +|---|---| +| plane[0] only 16×32 populated | Kernel didn't write | +| plane[0] fully populated | libva mis-reads | +| plane[0] populated with other content | Slot binding wrong | + +Scenario "plane[0] only 16×32 populated" is assigned to "Kernel didn't +write." But per IMP-1, it could also be "Kernel wrote NOTHING and 16×32 is +stale residue from prior kernel use." These require different follow-up: +"Kernel didn't write" → control-fill fix. "Stale residue" → the control fix +may still be correct, but the first-run memset experiment should be done +first to correctly classify. + +The plan should add a row: + +| plane[0] non-zero 16×32 only, VP9 dump is fully populated | Inconclusive (stale vs partial-write); run memset-zero-before-QBUF experiment | + +--- + +## Minor findings (MIN) + +### MIN-1 — env-var caching uses file-scope statics; these would be shared +across concurrent contexts if libva ever has two active contexts in the same +process + +`surface.c` is compiled into a single shared library loaded once per process. +File-scope `static bool dump_env_checked` is process-global. For the +diagnostic use case this is fine (we want one env check per process). Just +note it so future reviewers don't flag it as a bug. + +### MIN-2 — `sz > 1024 ? 1024 : sz` in the non-zero-scan loop: for 1920×1088 +NV12 plane[0], `sz = 1920×1088 = 2,088,960`. Sampling only the first 1024 +bytes is intentional for log volume but may completely miss whether the +kernel wrote content that's not in the first 1024 bytes + +For H.264 1080p in FRAME_BASED mode, rkvdec writes output MB-order or tile- +order. If the kernel writes the first MB row (bytes 0..30720 for 1920-wide) +but not the rest, the 1024-byte sample catches it. If the kernel writes only +the LAST macroblock row (pathological), the 1024-byte scan reports zero +non-zero bytes and misclassifies the dump. Suggest extending the sample to +cover at minimum one full MB row: `min(sz, 1920 * 16) = 30720` for H.264 +1080p. + +This is MIN because the leading hypothesis has content at the TOP-LEFT (the +16×32 patch is at offset 0), so the 1024-byte scan would catch it. But make +the note for future runs with different content. + +### MIN-3 — `surface_id` in the dump log is the VAAPI surface ID (integer +offset into object_heap), not a human-readable frame number. The log line +`"CAPTURE buffer dump for surface %d"` will print something like `67108864` +(0x4000000 SURFACE_ID_OFFSET). This is not wrong, just potentially confusing. +Adding `surface_object->destination_index` (already in the plan) is the +cleaner identifier for correlating with strace DQBUF logs. + +--- + +## Author re-verification list + +Per `feedback_review_empirical_over_theoretical.md`: the following are CLAIMS +by this reviewer that the author should empirically verify before incorporating +into an amended plan. + +1. **CRIT-1 (request_log prefix)**: Verify by running any existing call site + where `request_log` is called in a loop and inspecting stderr. The claim + (each call emits its own `"v4l2-request: "` prefix) follows directly from + `utils.c:37`, but if there is a buffered-output path or a different compile + guard in the installed binary, the observation may differ. Cheap test: + `grep -A3 "request_log" src/utils.c` and read `utils.c:33-42` (which I did + — see source read above). + +2. **CRIT-2 (null guard on destination_data)**: Verify by tracing when + `destination_data[0]` can be null at the dump insertion point. `surface.c` + `RequestCreateSurfaces2:207` zero-inits `params`, `slices_count`, etc. but + `destination_data[]` is not explicitly zero-inited in that function; it + relies on the zero-init from `object_heap_allocate`. Whether + `object_heap_allocate` zero-fills depends on the heap implementation. + Author should grep `object_heap_allocate` to confirm zero-fill behavior + before relying on `destination_data[0] == NULL` as a safe null-check. + +3. **IMP-1 (stale-residue alternative)**: The claim that the kernel may not + zero-initialize rkvdec CAPTURE buffers on REQBUFS is based on typical + DMA-coherent allocator behavior on RK3399. On some kernels, dma_alloc_* + does zero physical pages at allocation time. The author should verify on + fresnel by running a memset-before-QBUF experiment (described in IMP-1) + rather than accepting either interpretation theoretically. + +4. **context.c:201–205 (destination_sizes for planes>0)**: Reviewer reads + `destination_sizes[1] = destination_sizes[0] / 2` for single-buffer NV12. + This is the UV plane size = half the Y plane size (NV12 chroma is half- + height half-width interleaved). Verify that this matches what + `v4l2_get_format` returns for CAPTURE sizeimage on rkvdec H.264 1080p — + the G_FMT sizeimage may include alignment padding (e.g. `4177920 = 1920 × + 1088 × 3/2` without padding). If `destination_sizes[0]` is set from + `bytesperlines[0] * format_height = 1920 × 1088 = 2,088,960`, then + `destination_sizes[1] = 1,044,480`. But if the kernel reports + `sizeimage = 4177920` as a single plane total, the split may differ. + Confirm the specific value in the γ dump output for plane[1].sz. + +--- + +## Amendments to Phase 6 + +These are the mechanical changes to apply when implementing γ: + +**Amendment 1 (CRIT-1 fix)**: Replace the inner hex-dump loops with +buffered `snprintf` output emitted in a single `request_log` call: + +```c +/* Replace the per-byte request_log calls with: */ +char hexbuf[128]; +int pos = 0; +for (unsigned int i = 0; i < 32 && i < sz; i++) + pos += snprintf(hexbuf + pos, sizeof(hexbuf) - pos, "%02x ", d[i]); +request_log(" plane[%u] hex[0..32]: %s\n", p, hexbuf); + +pos = 0; +if (sz >= 32) { + for (unsigned int i = 0; i < 32; i++) + pos += snprintf(hexbuf + pos, sizeof(hexbuf) - pos, + "%02x ", d[sz - 32 + i]); + request_log(" plane[%u] tail hex[%zu..%zu]: %s\n", + p, sz - 32, sz - 1, hexbuf); +} +``` + +**Amendment 2 (CRIT-2 fix)**: Wrap the entire dump block in the +existing null guard: + +```c +if (surface_object->current_slot != NULL && dump_env != NULL && dump_env[0] == '1') { + /* dump block here */ +} +``` + +Or equivalently, place the dump immediately INSIDE the existing +`if (surface_object->current_slot != NULL)` block in `RequestSyncSurface`, +after the `cap_pool_mark_decoded` call. + +**Amendment 3 (IMP-1 / experiment)**: For Phase 7 execution, add a second +test run with an explicit memset-zero of the CAPTURE slot immediately +after `cap_pool_acquire` in `picture.c::RequestBeginPicture` (or add a +diagnostic `memset` in `cap_pool_init`). Compare H.264 frame 1 output before +and after the memset. If the 16×32 patch disappears → stale-residue; adjust +IMP outcome table accordingly. + +**Amendment 4 (IMP-3 / placement)**: Phase 6 implementer should place the +dump block between `cap_pool_mark_decoded(...)` and +`surface_object->status = VASurfaceDisplaying` — i.e., on the happy-path +only, not inside an error label. Do not place it after `goto complete`. + +**Amendment 5 (MIN-2 / scan window)**: Change the non-zero scan limit from +1024 to `min(sz, (unsigned int)(destination_bytesperlines[0] * 16))` for +plane 0 (covers one full macroblock row), and from 1024 to +`min(sz, 1024)` for plane 1 (UV is smaller, 1024 is fine). This requires +access to `surface_object->destination_bytesperlines[0]` inside the dump +block, which is available in scope. + +--- + +## Summary + +Two CRIT issues, both in the γ implementation sketch rather than in the +strategy choice. The γ-first rationale is defensible but IMP-1 (stale-residue +alternative) adds a fourth diagnostic state that the plan's three-outcome +table doesn't handle. The fixes are all small (Amendment 1 is the most +important — broken hex formatting would make the dump useless). No plan-level +strategy change is required; the γ-then-α path stands.