Files
fresnel-fourier/phase5_iter5b_review_v2.md
T
marfrit 3508a2cfeb iter5b Phase 5 v2: 2 CRIT findings — NULL guard + missing request_pool_destroy
CRIT-1: context.c:64-66 video_format==NULL guard rejects every first
β CreateContext. β moves the probe from CreateSurfaces2 into
CreateContext itself, so the guard fires before any new logic runs.
Fix: remove guard, move CAPTURE probe to top of CreateContext.

CRIT-2: DestroyContext lacks request_pool_destroy. Empirical grep
shows only surface.c:220 (which β strips) calls it per-session.
Without amendment, second CreateContext gets pool->initialized=true
with stale slot pointers → QBUF EINVAL. Fix: add request_pool_destroy
to DestroyContext before REQBUFS(0). C3 (surface.c strip) and CRIT-2
fix MUST land together.

Plus IMP-1 (mplane assumption wrong for SUNXI_TILED_NV12) + IMP-2
(surface_reset_format_cache becomes dead under C7) + IMP-3 (error
recovery comment).

Phase 6 BLOCKED pending CRIT-1 + CRIT-2 fixes. Author confirmed
both at code level — Phase 5 caught what Phase 4 v2's surface read
missed ("DestroyContext teardown — no change needed" — wrong; was
incomplete).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 13:50:08 +00:00

12 KiB
Raw Blame History

Iteration 5b — Phase 5 review v2 (of option-β plan)

Phase 5 architect review of phase4_iter5b_plan_v2.md by sonnet-architect, 2026-05-12. Second Phase 5 pass for iter5b after the first Phase 5 cleared α' (which then failed at Phase 7).

Executive summary

Option β's architectural choice is correct — the α' failure mode is genuinely eliminated. CreateContext is the right lifecycle site because config_id is unambiguously available there. But two Critical blocking issues exist that the plan glossed over: (1) video_format == NULL guard at context.c:64-66 rejects every first β CreateContext; (2) DestroyContext lacks request_pool_destroy, leaving stale output_pool state across context cycles. Stripping surface.c (C3) without fixing both makes the second CreateContext fail in a worse way than α' did.

Plus three Important findings and four Non-Amendment confirmations.

Critical findings (BLOCKING Phase 6)

CRIT-1 — video_format == NULL guard at context.c:64-66 will reject every first β CreateContext

In current code:

video_format = driver_data->video_format;     // context.c:64
if (video_format == NULL)                      // context.c:65
    return VA_STATUS_ERROR_OPERATION_FAILED;   // context.c:66
output_type = v4l2_type_video_output(video_format->v4l2_mplane);   // :68
capture_type = v4l2_type_video_capture(video_format->v4l2_mplane); // :69

Pre-β: CreateSurfaces2 (surface.c:253) sets driver_data->video_format before CreateContext runs. Under β, that probe MOVES into CreateContext. First CreateContext call: driver_data->video_format == NULL → early return.

Required fix: Remove the early-return at lines 65-66. Move CAPTURE-format probe (which sets driver_data->video_format) to the top of CreateContext. Derive output_type/capture_type after the probe.

CRIT-2 — request_pool_destroy missing from DestroyContext: second CreateContext re-uses stale pool

DestroyContext does:

  • v4l2_set_stream(OUTPUT, false) (line 276)
  • v4l2_set_stream(CAPTURE, false) (line 280)
  • RequestDestroySurfaces (line 286)
  • v4l2_request_buffers(OUTPUT, 0) (line 296) — REQBUFS(0) frees kernel-side OUTPUT buffers
  • cap_pool_destroy (line 306-307) — frees CAPTURE side
  • surface_reset_format_cache (line 319)

Missing: request_pool_destroy(&driver_data->output_pool). Empirical grep:

src/request_pool.c:93:    request_pool_destroy(pool);     # internal error path
src/request_pool.c:97:void request_pool_destroy(...)      # definition
src/request.c:361:    request_pool_destroy(...)            # Terminate
src/surface.c:220:        request_pool_destroy(...)        # resolution-change in CreateSurfaces2

Only surface.c:220 calls it from a per-session path — and β strips that line.

request_pool_init at request_pool.c:35-36 has if (pool->initialized) return 0;. Under β:

  1. First CreateContext: request_pool_init succeeds, pool->initialized=true.
  2. DestroyContext: REQBUFS(0) frees kernel buffers BUT pool->initialized stays true, slot mmap pointers stay populated.
  3. Second CreateContext: request_pool_init sees initialized=true, returns 0 — pool retains stale slot indices pointing at kernel buffers that no longer exist.
  4. Next QBUF on any slot → EINVAL.

Required fix: Add request_pool_destroy(&driver_data->output_pool) to DestroyContext BEFORE v4l2_request_buffers(OUTPUT, 0). Order matters: munmap (request_pool_destroy) first, then REQBUFS(0) (kernel buffer free). Mirror surface.c:219-220's existing if (output_pool.initialized) request_pool_destroy(...) shape.

C3 (surface.c strip) and CRIT-2 fix (DestroyContext add) MUST land in the same commit or strictly sequenced.

Important findings

IMP-1 — output_type mplane assumption: hardcoded v4l2_type_video_output(true) is wrong for SUNXI_TILED_NV12

Plan's C4 opens with output_type_pre = v4l2_type_video_output(true); to set OUTPUT format before CAPTURE probe. Works on fresnel (NV12 = mplane=true) but fails on cedrus/Allwinner targets where SUNXI_TILED_NV12 has v4l2_mplane=false.

Fix: Reorder — probe CAPTURE first (the find_format calls don't require OUTPUT to be set), derive output_type from video_format->v4l2_mplane, THEN S_FMT(OUTPUT). Eliminates the output_type_pre hack.

Empirical correctness: v4l2_find_format enumerates available formats without requiring any S_FMT first — confirmed by surface.c:238-253 doing exactly that probe.

IMP-2 — driver_data->video_format retained across DestroyContext; surface_reset_format_cache becomes dead

video_format is a pointer into static formats[] in video.c, never freed. Skipping the probe on second CreateContext via if (!driver_data->video_format) is intentional and correct.

But: DestroyContext's surface_reset_format_cache call (line 319) and its comment (lines 309-318) reference last_output_width/height which C7 deletes. Under β, surface_reset_format_cache is empty/dead.

Fix: C7 should also delete surface_reset_format_cache and the call in DestroyContext, plus update the comment to reflect β's CreateContext-centric lifecycle.

IMP-3 — Error recovery in CreateContext (β): partial state on S_FMT-succeeds + cap_pool_init-fails

If S_FMT(OUTPUT) succeeds then cap_pool_init fails, the existing error path (context.c:244-252) frees ids+context_object but leaves OUTPUT format set. cap_pool's own error_cleanup at cap_pool.c:125-137 handles its own REQBUFS(0), so cap_pool failure is self-recovering. OUTPUT S_FMT is not destructive when no buffers exist yet.

Safe but messy. Phase 6 should add a comment in the error path noting what state remains, so future readers don't think the rollback is complete.

Non-amendment findings

NC-1 — α' failure mode eliminated by β: CONFIRMED

α' crash: CreateSurfaces2 fires mid-stream → cap_pool_destroy → stale destination_data → SIGSEGV. Under β: CreateSurfaces2 touches no V4L2 device state and allocates no V4L2 buffers. No cap_pool_destroy from CreateSurfaces2. Multi-CreateSurfaces2 mid-stream is harmless.

NC-2 — Multi-context without intervening DestroyContext: SAFE

Second CreateContext sees STREAMON-on-OUTPUT EBUSY → returns OPERATION_FAILED. No silent corruption. Real consumers don't do this anyway.

NC-3 — Field partitioning for surface_bind_slot: CORRECT

Format-uniform fields (destination_planes_count, destination_sizes, destination_offsets, destination_bytesperlines, destination_buffers_count) move to CreateContext (β). Per-slot fields (destination_index, destination_map[], destination_data[]) stay at BeginPicture's surface_bind_slot. The split is exactly what surface.h:63-72 documents.

Minor: destination_buffers_count is also set at surface_bind_slot (line 101). Redundant but harmless (slot->buffers_count matches video_format->v4l2_buffers_count).

NC-4 — mpv probe-then-real pattern: CORRECT under β with CRIT-2 fix

Probe-CreateContext → DestroyContext (with CRIT-2 fix: full pool teardown) → Real-CreateContext. Clean rebuild at the new resolution. Without CRIT-2 fix this path breaks; with CRIT-2 it works.

Plan-specific concern answers

Concern Verdict
1 (Error-recovery branches) IMP-3
2 (Multi-context shared driver_data) NC-2 + IMP-2
3 (Surface ID lookup fails) Handled via standard SURFACE() NULL check + error return
4 (picture_width vs surface->width/height) Pre-existing quirk, β doesn't introduce
5 (destination_data fill ordering) NC-3
6 (ohm backwards compat) Safe — MPEG-2 gets explicit format that hantro previously substituted; H.264 unchanged
7 (CreateContext without DestroyContext between) NC-2
8 (mpv probe-then-real) NC-4 (depends on CRIT-2 fix)
9 (destination_* transplant correctness) Mechanical; preserve all 3 branches of video_format->v4l2_buffers_count switch
10 (CreateConfig error paths) object_heap_allocate zeroes the struct; if CreateConfig returns ERROR before reaching the pixelformat assignment, the field is 0; CreateContext's if (pixelformat == 0) defensive check catches it

Summary scorecard

Clause Status Finding
C1 (codec.h/c helper) PASS Identical to reverted commit; no issue
C2 (CreateConfig wire-up) PASS Correct
C3 (strip surface.c) CONDITIONAL Must land with CRIT-2 fix
C4 (CreateContext new flow) FAIL: CRIT-1 + IMP-1 NULL guard removal; probe order before S_FMT
C5 (DestroyContext: "no change") FAIL: CRIT-2 Plan said "no change needed" — wrong; must add request_pool_destroy
C6 (surface.h unchanged) PASS No edit needed
C7 (delete last_output_*) PASS Recommend extending to delete surface_reset_format_cache too (IMP-2)
C8 (Phase 7 verification) PASS Same methodology
C9 (Phase 8 close) CONDITIONAL Depends on CRIT fixes

Phase 6 BLOCKED pending CRIT-1 + CRIT-2 fixes.

Required amendments before Phase 6

  1. CRIT-1: Remove video_format == NULL guard at context.c:65-66. Move CAPTURE-format probe to top of CreateContext. Derive output_type/capture_type after probe.

  2. CRIT-2: Add request_pool_destroy(&driver_data->output_pool) to DestroyContext BEFORE v4l2_request_buffers(OUTPUT, 0). Mirror surface.c:219-220's existing pattern.

  3. IMP-1: Reorder C4 — CAPTURE probe FIRST, then derive output_type from video_format->v4l2_mplane, then S_FMT(OUTPUT). Eliminates output_type_pre hack.

  4. IMP-2 (folded into C7): Delete surface_reset_format_cache (function + call in DestroyContext + the stale comment block). Field deletion + function deletion happen together.

  5. C3 + CRIT-2 commit ordering: Phase 6 must commit C3 (strip surface.c) and CRIT-2 fix (DestroyContext add) together or strictly sequenced. Cannot commit C3 alone — it would leave NO call site for request_pool_destroy in per-session teardown.

Author re-verification (post-review, 2026-05-12)

Per feedback_review_empirical_over_theoretical.md Direction 2:

CRIT-1 — CONFIRMED at code level

context.c:64-66 verified by direct read. The early-return guarantees first-CreateContext failure under β.

CRIT-2 — CONFIRMED at code level

Grep of request_pool_destroy callsites: only surface.c:220 and request.c:361 (Terminate) plus internal request_pool.c:93. β strips surface.c:220. Without amendment, ZERO per-session callsites.

IMP-1, IMP-2 — CONFIRMED

mplane=false case exists for SUNXI_TILED_NV12 (video.c entries). surface_reset_format_cache reads only last_output_width/height — fully dead under C7's field deletion.

Lessons

The first Phase 5 review (of α') cleared a plan that then failed Phase 7. The second Phase 5 review (this one) caught two architectural gaps before Phase 6. Empirical-over-theoretical worked: the reviewer read the source to verify "DestroyContext does the right teardown" — the plan claimed it did, but the source disagreed.

The author (Phase 4 v2 writer, me) read context.c:276-307 and noted "STREAMOFF + REQBUFS(0) + cap_pool_destroy at context.c:276-307. No change needed." — missing that request_pool_destroy is absent. Surface-level read vs full empirical traversal.

Memory rule reinforcement

feedback_review_empirical_over_theoretical.md (Direction 2): the reviewer here applied it correctly by enumerating request_pool_destroy callsites via grep before claiming the function existed only in surface.c. The author should apply Direction 1 in writing future plans: don't claim "no change needed" without enumerating call/use sites empirically.

Worth a fresh memory rule? Maybe: "Before writing 'no change needed' in a plan, grep for the symbol/function the change would touch. List the call sites. Confirm the change wouldn't break any of them." Pin this lesson when iter5b closes.