iter5b Phase 5 v2: 2 CRIT findings — NULL guard + missing request_pool_destroy

CRIT-1: context.c:64-66 video_format==NULL guard rejects every first
β CreateContext. β moves the probe from CreateSurfaces2 into
CreateContext itself, so the guard fires before any new logic runs.
Fix: remove guard, move CAPTURE probe to top of CreateContext.

CRIT-2: DestroyContext lacks request_pool_destroy. Empirical grep
shows only surface.c:220 (which β strips) calls it per-session.
Without amendment, second CreateContext gets pool->initialized=true
with stale slot pointers → QBUF EINVAL. Fix: add request_pool_destroy
to DestroyContext before REQBUFS(0). C3 (surface.c strip) and CRIT-2
fix MUST land together.

Plus IMP-1 (mplane assumption wrong for SUNXI_TILED_NV12) + IMP-2
(surface_reset_format_cache becomes dead under C7) + IMP-3 (error
recovery comment).

Phase 6 BLOCKED pending CRIT-1 + CRIT-2 fixes. Author confirmed
both at code level — Phase 5 caught what Phase 4 v2's surface read
missed ("DestroyContext teardown — no change needed" — wrong; was
incomplete).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-12 13:50:08 +00:00
parent 5abea730a0
commit 3508a2cfeb
+173
View File
@@ -0,0 +1,173 @@
# Iteration 5b — Phase 5 review v2 (of option-β plan)
Phase 5 architect review of `phase4_iter5b_plan_v2.md` by sonnet-architect, 2026-05-12. Second Phase 5 pass for iter5b after the first Phase 5 cleared α' (which then failed at Phase 7).
## Executive summary
Option β's architectural choice is correct — the α' failure mode is genuinely eliminated. CreateContext is the right lifecycle site because `config_id` is unambiguously available there. **But two Critical blocking issues exist** that the plan glossed over: (1) `video_format == NULL` guard at context.c:64-66 rejects every first β CreateContext; (2) DestroyContext lacks `request_pool_destroy`, leaving stale output_pool state across context cycles. Stripping surface.c (C3) without fixing both makes the second CreateContext fail in a worse way than α' did.
Plus three Important findings and four Non-Amendment confirmations.
## Critical findings (BLOCKING Phase 6)
### CRIT-1 — `video_format == NULL` guard at context.c:64-66 will reject every first β CreateContext
In current code:
```c
video_format = driver_data->video_format; // context.c:64
if (video_format == NULL) // context.c:65
return VA_STATUS_ERROR_OPERATION_FAILED; // context.c:66
output_type = v4l2_type_video_output(video_format->v4l2_mplane); // :68
capture_type = v4l2_type_video_capture(video_format->v4l2_mplane); // :69
```
Pre-β: CreateSurfaces2 (surface.c:253) sets `driver_data->video_format` before CreateContext runs. Under β, that probe MOVES into CreateContext. First CreateContext call: `driver_data->video_format == NULL` → early return.
**Required fix**: Remove the early-return at lines 65-66. Move CAPTURE-format probe (which sets `driver_data->video_format`) to the top of CreateContext. Derive `output_type`/`capture_type` after the probe.
### CRIT-2 — `request_pool_destroy` missing from DestroyContext: second CreateContext re-uses stale pool
DestroyContext does:
- v4l2_set_stream(OUTPUT, false) (line 276)
- v4l2_set_stream(CAPTURE, false) (line 280)
- RequestDestroySurfaces (line 286)
- v4l2_request_buffers(OUTPUT, 0) (line 296) — REQBUFS(0) frees kernel-side OUTPUT buffers
- cap_pool_destroy (line 306-307) — frees CAPTURE side
- surface_reset_format_cache (line 319)
**Missing**: `request_pool_destroy(&driver_data->output_pool)`. Empirical grep:
```
src/request_pool.c:93: request_pool_destroy(pool); # internal error path
src/request_pool.c:97:void request_pool_destroy(...) # definition
src/request.c:361: request_pool_destroy(...) # Terminate
src/surface.c:220: request_pool_destroy(...) # resolution-change in CreateSurfaces2
```
Only surface.c:220 calls it from a per-session path — and β strips that line.
`request_pool_init` at request_pool.c:35-36 has `if (pool->initialized) return 0;`. Under β:
1. First CreateContext: request_pool_init succeeds, pool->initialized=true.
2. DestroyContext: REQBUFS(0) frees kernel buffers BUT pool->initialized stays true, slot mmap pointers stay populated.
3. Second CreateContext: request_pool_init sees initialized=true, returns 0 — pool retains stale slot indices pointing at kernel buffers that no longer exist.
4. Next QBUF on any slot → EINVAL.
**Required fix**: Add `request_pool_destroy(&driver_data->output_pool)` to DestroyContext BEFORE `v4l2_request_buffers(OUTPUT, 0)`. Order matters: munmap (request_pool_destroy) first, then REQBUFS(0) (kernel buffer free). Mirror surface.c:219-220's existing `if (output_pool.initialized) request_pool_destroy(...)` shape.
C3 (surface.c strip) and CRIT-2 fix (DestroyContext add) MUST land in the same commit or strictly sequenced.
## Important findings
### IMP-1 — `output_type` mplane assumption: hardcoded `v4l2_type_video_output(true)` is wrong for SUNXI_TILED_NV12
Plan's C4 opens with `output_type_pre = v4l2_type_video_output(true);` to set OUTPUT format before CAPTURE probe. Works on fresnel (NV12 = mplane=true) but fails on cedrus/Allwinner targets where SUNXI_TILED_NV12 has v4l2_mplane=false.
**Fix**: Reorder — probe CAPTURE first (the find_format calls don't require OUTPUT to be set), derive output_type from `video_format->v4l2_mplane`, THEN S_FMT(OUTPUT). Eliminates the `output_type_pre` hack.
Empirical correctness: `v4l2_find_format` enumerates available formats without requiring any S_FMT first — confirmed by surface.c:238-253 doing exactly that probe.
### IMP-2 — `driver_data->video_format` retained across DestroyContext; `surface_reset_format_cache` becomes dead
`video_format` is a pointer into static `formats[]` in video.c, never freed. Skipping the probe on second CreateContext via `if (!driver_data->video_format)` is intentional and correct.
But: DestroyContext's `surface_reset_format_cache` call (line 319) and its comment (lines 309-318) reference `last_output_width/height` which C7 deletes. Under β, `surface_reset_format_cache` is empty/dead.
**Fix**: C7 should also delete `surface_reset_format_cache` and the call in DestroyContext, plus update the comment to reflect β's CreateContext-centric lifecycle.
### IMP-3 — Error recovery in CreateContext (β): partial state on S_FMT-succeeds + cap_pool_init-fails
If S_FMT(OUTPUT) succeeds then cap_pool_init fails, the existing error path (context.c:244-252) frees ids+context_object but leaves OUTPUT format set. cap_pool's own error_cleanup at cap_pool.c:125-137 handles its own REQBUFS(0), so cap_pool failure is self-recovering. OUTPUT S_FMT is not destructive when no buffers exist yet.
**Safe but messy.** Phase 6 should add a comment in the error path noting what state remains, so future readers don't think the rollback is complete.
## Non-amendment findings
### NC-1 — α' failure mode eliminated by β: CONFIRMED
α' crash: CreateSurfaces2 fires mid-stream → cap_pool_destroy → stale destination_data → SIGSEGV. Under β: CreateSurfaces2 touches no V4L2 device state and allocates no V4L2 buffers. No cap_pool_destroy from CreateSurfaces2. Multi-CreateSurfaces2 mid-stream is harmless.
### NC-2 — Multi-context without intervening DestroyContext: SAFE
Second CreateContext sees STREAMON-on-OUTPUT EBUSY → returns OPERATION_FAILED. No silent corruption. Real consumers don't do this anyway.
### NC-3 — Field partitioning for surface_bind_slot: CORRECT
Format-uniform fields (`destination_planes_count`, `destination_sizes`, `destination_offsets`, `destination_bytesperlines`, `destination_buffers_count`) move to CreateContext (β). Per-slot fields (`destination_index`, `destination_map[]`, `destination_data[]`) stay at BeginPicture's surface_bind_slot. The split is exactly what surface.h:63-72 documents.
Minor: `destination_buffers_count` is also set at surface_bind_slot (line 101). Redundant but harmless (slot->buffers_count matches video_format->v4l2_buffers_count).
### NC-4 — mpv probe-then-real pattern: CORRECT under β with CRIT-2 fix
Probe-CreateContext → DestroyContext (with CRIT-2 fix: full pool teardown) → Real-CreateContext. Clean rebuild at the new resolution. Without CRIT-2 fix this path breaks; with CRIT-2 it works.
## Plan-specific concern answers
| Concern | Verdict |
|---|---|
| 1 (Error-recovery branches) | IMP-3 |
| 2 (Multi-context shared driver_data) | NC-2 + IMP-2 |
| 3 (Surface ID lookup fails) | Handled via standard SURFACE() NULL check + error return |
| 4 (picture_width vs surface->width/height) | Pre-existing quirk, β doesn't introduce |
| 5 (destination_data fill ordering) | NC-3 |
| 6 (ohm backwards compat) | Safe — MPEG-2 gets explicit format that hantro previously substituted; H.264 unchanged |
| 7 (CreateContext without DestroyContext between) | NC-2 |
| 8 (mpv probe-then-real) | NC-4 (depends on CRIT-2 fix) |
| 9 (destination_* transplant correctness) | Mechanical; preserve all 3 branches of `video_format->v4l2_buffers_count` switch |
| 10 (CreateConfig error paths) | object_heap_allocate zeroes the struct; if CreateConfig returns ERROR before reaching the pixelformat assignment, the field is 0; CreateContext's `if (pixelformat == 0)` defensive check catches it |
## Summary scorecard
| Clause | Status | Finding |
|---|---|---|
| C1 (codec.h/c helper) | PASS | Identical to reverted commit; no issue |
| C2 (CreateConfig wire-up) | PASS | Correct |
| C3 (strip surface.c) | CONDITIONAL | Must land with CRIT-2 fix |
| C4 (CreateContext new flow) | FAIL: CRIT-1 + IMP-1 | NULL guard removal; probe order before S_FMT |
| C5 (DestroyContext: "no change") | **FAIL: CRIT-2** | Plan said "no change needed" — wrong; must add request_pool_destroy |
| C6 (surface.h unchanged) | PASS | No edit needed |
| C7 (delete last_output_*) | PASS | Recommend extending to delete surface_reset_format_cache too (IMP-2) |
| C8 (Phase 7 verification) | PASS | Same methodology |
| C9 (Phase 8 close) | CONDITIONAL | Depends on CRIT fixes |
**Phase 6 BLOCKED pending CRIT-1 + CRIT-2 fixes.**
## Required amendments before Phase 6
1. **CRIT-1**: Remove `video_format == NULL` guard at context.c:65-66. Move CAPTURE-format probe to top of CreateContext. Derive output_type/capture_type after probe.
2. **CRIT-2**: Add `request_pool_destroy(&driver_data->output_pool)` to DestroyContext BEFORE `v4l2_request_buffers(OUTPUT, 0)`. Mirror surface.c:219-220's existing pattern.
3. **IMP-1**: Reorder C4 — CAPTURE probe FIRST, then derive output_type from `video_format->v4l2_mplane`, then S_FMT(OUTPUT). Eliminates `output_type_pre` hack.
4. **IMP-2 (folded into C7)**: Delete `surface_reset_format_cache` (function + call in DestroyContext + the stale comment block). Field deletion + function deletion happen together.
5. **C3 + CRIT-2 commit ordering**: Phase 6 must commit C3 (strip surface.c) and CRIT-2 fix (DestroyContext add) together or strictly sequenced. Cannot commit C3 alone — it would leave NO call site for request_pool_destroy in per-session teardown.
## Author re-verification (post-review, 2026-05-12)
Per `feedback_review_empirical_over_theoretical.md` Direction 2:
### CRIT-1 — CONFIRMED at code level
context.c:64-66 verified by direct read. The early-return guarantees first-CreateContext failure under β.
### CRIT-2 — CONFIRMED at code level
Grep of `request_pool_destroy` callsites: only surface.c:220 and request.c:361 (Terminate) plus internal request_pool.c:93. β strips surface.c:220. Without amendment, ZERO per-session callsites.
### IMP-1, IMP-2 — CONFIRMED
mplane=false case exists for SUNXI_TILED_NV12 (video.c entries). `surface_reset_format_cache` reads only `last_output_width/height` — fully dead under C7's field deletion.
### Lessons
The first Phase 5 review (of α') cleared a plan that then failed Phase 7. The second Phase 5 review (this one) caught two architectural gaps before Phase 6. Empirical-over-theoretical worked: the reviewer read the source to verify "DestroyContext does the right teardown" — the plan claimed it did, but the source disagreed.
The author (Phase 4 v2 writer, me) read context.c:276-307 and noted "STREAMOFF + REQBUFS(0) + cap_pool_destroy at context.c:276-307. No change needed." — missing that `request_pool_destroy` is absent. Surface-level read vs full empirical traversal.
### Memory rule reinforcement
`feedback_review_empirical_over_theoretical.md` (Direction 2): the reviewer here applied it correctly by enumerating `request_pool_destroy` callsites via grep before claiming the function existed only in surface.c. The author should apply Direction 1 in writing future plans: don't claim "no change needed" without enumerating call/use sites empirically.
Worth a fresh memory rule? Maybe: **"Before writing 'no change needed' in a plan, grep for the symbol/function the change would touch. List the call sites. Confirm the change wouldn't break any of them."** Pin this lesson when iter5b closes.