Files
fresnel-fourier/phase5_iter5b_review_v2.md
T
marfrit 3508a2cfeb iter5b Phase 5 v2: 2 CRIT findings — NULL guard + missing request_pool_destroy
CRIT-1: context.c:64-66 video_format==NULL guard rejects every first
β CreateContext. β moves the probe from CreateSurfaces2 into
CreateContext itself, so the guard fires before any new logic runs.
Fix: remove guard, move CAPTURE probe to top of CreateContext.

CRIT-2: DestroyContext lacks request_pool_destroy. Empirical grep
shows only surface.c:220 (which β strips) calls it per-session.
Without amendment, second CreateContext gets pool->initialized=true
with stale slot pointers → QBUF EINVAL. Fix: add request_pool_destroy
to DestroyContext before REQBUFS(0). C3 (surface.c strip) and CRIT-2
fix MUST land together.

Plus IMP-1 (mplane assumption wrong for SUNXI_TILED_NV12) + IMP-2
(surface_reset_format_cache becomes dead under C7) + IMP-3 (error
recovery comment).

Phase 6 BLOCKED pending CRIT-1 + CRIT-2 fixes. Author confirmed
both at code level — Phase 5 caught what Phase 4 v2's surface read
missed ("DestroyContext teardown — no change needed" — wrong; was
incomplete).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 13:50:08 +00:00

174 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Iteration 5b — Phase 5 review v2 (of option-β plan)
Phase 5 architect review of `phase4_iter5b_plan_v2.md` by sonnet-architect, 2026-05-12. Second Phase 5 pass for iter5b after the first Phase 5 cleared α' (which then failed at Phase 7).
## Executive summary
Option β's architectural choice is correct — the α' failure mode is genuinely eliminated. CreateContext is the right lifecycle site because `config_id` is unambiguously available there. **But two Critical blocking issues exist** that the plan glossed over: (1) `video_format == NULL` guard at context.c:64-66 rejects every first β CreateContext; (2) DestroyContext lacks `request_pool_destroy`, leaving stale output_pool state across context cycles. Stripping surface.c (C3) without fixing both makes the second CreateContext fail in a worse way than α' did.
Plus three Important findings and four Non-Amendment confirmations.
## Critical findings (BLOCKING Phase 6)
### CRIT-1 — `video_format == NULL` guard at context.c:64-66 will reject every first β CreateContext
In current code:
```c
video_format = driver_data->video_format; // context.c:64
if (video_format == NULL) // context.c:65
return VA_STATUS_ERROR_OPERATION_FAILED; // context.c:66
output_type = v4l2_type_video_output(video_format->v4l2_mplane); // :68
capture_type = v4l2_type_video_capture(video_format->v4l2_mplane); // :69
```
Pre-β: CreateSurfaces2 (surface.c:253) sets `driver_data->video_format` before CreateContext runs. Under β, that probe MOVES into CreateContext. First CreateContext call: `driver_data->video_format == NULL` → early return.
**Required fix**: Remove the early-return at lines 65-66. Move CAPTURE-format probe (which sets `driver_data->video_format`) to the top of CreateContext. Derive `output_type`/`capture_type` after the probe.
### CRIT-2 — `request_pool_destroy` missing from DestroyContext: second CreateContext re-uses stale pool
DestroyContext does:
- v4l2_set_stream(OUTPUT, false) (line 276)
- v4l2_set_stream(CAPTURE, false) (line 280)
- RequestDestroySurfaces (line 286)
- v4l2_request_buffers(OUTPUT, 0) (line 296) — REQBUFS(0) frees kernel-side OUTPUT buffers
- cap_pool_destroy (line 306-307) — frees CAPTURE side
- surface_reset_format_cache (line 319)
**Missing**: `request_pool_destroy(&driver_data->output_pool)`. Empirical grep:
```
src/request_pool.c:93: request_pool_destroy(pool); # internal error path
src/request_pool.c:97:void request_pool_destroy(...) # definition
src/request.c:361: request_pool_destroy(...) # Terminate
src/surface.c:220: request_pool_destroy(...) # resolution-change in CreateSurfaces2
```
Only surface.c:220 calls it from a per-session path — and β strips that line.
`request_pool_init` at request_pool.c:35-36 has `if (pool->initialized) return 0;`. Under β:
1. First CreateContext: request_pool_init succeeds, pool->initialized=true.
2. DestroyContext: REQBUFS(0) frees kernel buffers BUT pool->initialized stays true, slot mmap pointers stay populated.
3. Second CreateContext: request_pool_init sees initialized=true, returns 0 — pool retains stale slot indices pointing at kernel buffers that no longer exist.
4. Next QBUF on any slot → EINVAL.
**Required fix**: Add `request_pool_destroy(&driver_data->output_pool)` to DestroyContext BEFORE `v4l2_request_buffers(OUTPUT, 0)`. Order matters: munmap (request_pool_destroy) first, then REQBUFS(0) (kernel buffer free). Mirror surface.c:219-220's existing `if (output_pool.initialized) request_pool_destroy(...)` shape.
C3 (surface.c strip) and CRIT-2 fix (DestroyContext add) MUST land in the same commit or strictly sequenced.
## Important findings
### IMP-1 — `output_type` mplane assumption: hardcoded `v4l2_type_video_output(true)` is wrong for SUNXI_TILED_NV12
Plan's C4 opens with `output_type_pre = v4l2_type_video_output(true);` to set OUTPUT format before CAPTURE probe. Works on fresnel (NV12 = mplane=true) but fails on cedrus/Allwinner targets where SUNXI_TILED_NV12 has v4l2_mplane=false.
**Fix**: Reorder — probe CAPTURE first (the find_format calls don't require OUTPUT to be set), derive output_type from `video_format->v4l2_mplane`, THEN S_FMT(OUTPUT). Eliminates the `output_type_pre` hack.
Empirical correctness: `v4l2_find_format` enumerates available formats without requiring any S_FMT first — confirmed by surface.c:238-253 doing exactly that probe.
### IMP-2 — `driver_data->video_format` retained across DestroyContext; `surface_reset_format_cache` becomes dead
`video_format` is a pointer into static `formats[]` in video.c, never freed. Skipping the probe on second CreateContext via `if (!driver_data->video_format)` is intentional and correct.
But: DestroyContext's `surface_reset_format_cache` call (line 319) and its comment (lines 309-318) reference `last_output_width/height` which C7 deletes. Under β, `surface_reset_format_cache` is empty/dead.
**Fix**: C7 should also delete `surface_reset_format_cache` and the call in DestroyContext, plus update the comment to reflect β's CreateContext-centric lifecycle.
### IMP-3 — Error recovery in CreateContext (β): partial state on S_FMT-succeeds + cap_pool_init-fails
If S_FMT(OUTPUT) succeeds then cap_pool_init fails, the existing error path (context.c:244-252) frees ids+context_object but leaves OUTPUT format set. cap_pool's own error_cleanup at cap_pool.c:125-137 handles its own REQBUFS(0), so cap_pool failure is self-recovering. OUTPUT S_FMT is not destructive when no buffers exist yet.
**Safe but messy.** Phase 6 should add a comment in the error path noting what state remains, so future readers don't think the rollback is complete.
## Non-amendment findings
### NC-1 — α' failure mode eliminated by β: CONFIRMED
α' crash: CreateSurfaces2 fires mid-stream → cap_pool_destroy → stale destination_data → SIGSEGV. Under β: CreateSurfaces2 touches no V4L2 device state and allocates no V4L2 buffers. No cap_pool_destroy from CreateSurfaces2. Multi-CreateSurfaces2 mid-stream is harmless.
### NC-2 — Multi-context without intervening DestroyContext: SAFE
Second CreateContext sees STREAMON-on-OUTPUT EBUSY → returns OPERATION_FAILED. No silent corruption. Real consumers don't do this anyway.
### NC-3 — Field partitioning for surface_bind_slot: CORRECT
Format-uniform fields (`destination_planes_count`, `destination_sizes`, `destination_offsets`, `destination_bytesperlines`, `destination_buffers_count`) move to CreateContext (β). Per-slot fields (`destination_index`, `destination_map[]`, `destination_data[]`) stay at BeginPicture's surface_bind_slot. The split is exactly what surface.h:63-72 documents.
Minor: `destination_buffers_count` is also set at surface_bind_slot (line 101). Redundant but harmless (slot->buffers_count matches video_format->v4l2_buffers_count).
### NC-4 — mpv probe-then-real pattern: CORRECT under β with CRIT-2 fix
Probe-CreateContext → DestroyContext (with CRIT-2 fix: full pool teardown) → Real-CreateContext. Clean rebuild at the new resolution. Without CRIT-2 fix this path breaks; with CRIT-2 it works.
## Plan-specific concern answers
| Concern | Verdict |
|---|---|
| 1 (Error-recovery branches) | IMP-3 |
| 2 (Multi-context shared driver_data) | NC-2 + IMP-2 |
| 3 (Surface ID lookup fails) | Handled via standard SURFACE() NULL check + error return |
| 4 (picture_width vs surface->width/height) | Pre-existing quirk, β doesn't introduce |
| 5 (destination_data fill ordering) | NC-3 |
| 6 (ohm backwards compat) | Safe — MPEG-2 gets explicit format that hantro previously substituted; H.264 unchanged |
| 7 (CreateContext without DestroyContext between) | NC-2 |
| 8 (mpv probe-then-real) | NC-4 (depends on CRIT-2 fix) |
| 9 (destination_* transplant correctness) | Mechanical; preserve all 3 branches of `video_format->v4l2_buffers_count` switch |
| 10 (CreateConfig error paths) | object_heap_allocate zeroes the struct; if CreateConfig returns ERROR before reaching the pixelformat assignment, the field is 0; CreateContext's `if (pixelformat == 0)` defensive check catches it |
## Summary scorecard
| Clause | Status | Finding |
|---|---|---|
| C1 (codec.h/c helper) | PASS | Identical to reverted commit; no issue |
| C2 (CreateConfig wire-up) | PASS | Correct |
| C3 (strip surface.c) | CONDITIONAL | Must land with CRIT-2 fix |
| C4 (CreateContext new flow) | FAIL: CRIT-1 + IMP-1 | NULL guard removal; probe order before S_FMT |
| C5 (DestroyContext: "no change") | **FAIL: CRIT-2** | Plan said "no change needed" — wrong; must add request_pool_destroy |
| C6 (surface.h unchanged) | PASS | No edit needed |
| C7 (delete last_output_*) | PASS | Recommend extending to delete surface_reset_format_cache too (IMP-2) |
| C8 (Phase 7 verification) | PASS | Same methodology |
| C9 (Phase 8 close) | CONDITIONAL | Depends on CRIT fixes |
**Phase 6 BLOCKED pending CRIT-1 + CRIT-2 fixes.**
## Required amendments before Phase 6
1. **CRIT-1**: Remove `video_format == NULL` guard at context.c:65-66. Move CAPTURE-format probe to top of CreateContext. Derive output_type/capture_type after probe.
2. **CRIT-2**: Add `request_pool_destroy(&driver_data->output_pool)` to DestroyContext BEFORE `v4l2_request_buffers(OUTPUT, 0)`. Mirror surface.c:219-220's existing pattern.
3. **IMP-1**: Reorder C4 — CAPTURE probe FIRST, then derive output_type from `video_format->v4l2_mplane`, then S_FMT(OUTPUT). Eliminates `output_type_pre` hack.
4. **IMP-2 (folded into C7)**: Delete `surface_reset_format_cache` (function + call in DestroyContext + the stale comment block). Field deletion + function deletion happen together.
5. **C3 + CRIT-2 commit ordering**: Phase 6 must commit C3 (strip surface.c) and CRIT-2 fix (DestroyContext add) together or strictly sequenced. Cannot commit C3 alone — it would leave NO call site for request_pool_destroy in per-session teardown.
## Author re-verification (post-review, 2026-05-12)
Per `feedback_review_empirical_over_theoretical.md` Direction 2:
### CRIT-1 — CONFIRMED at code level
context.c:64-66 verified by direct read. The early-return guarantees first-CreateContext failure under β.
### CRIT-2 — CONFIRMED at code level
Grep of `request_pool_destroy` callsites: only surface.c:220 and request.c:361 (Terminate) plus internal request_pool.c:93. β strips surface.c:220. Without amendment, ZERO per-session callsites.
### IMP-1, IMP-2 — CONFIRMED
mplane=false case exists for SUNXI_TILED_NV12 (video.c entries). `surface_reset_format_cache` reads only `last_output_width/height` — fully dead under C7's field deletion.
### Lessons
The first Phase 5 review (of α') cleared a plan that then failed Phase 7. The second Phase 5 review (this one) caught two architectural gaps before Phase 6. Empirical-over-theoretical worked: the reviewer read the source to verify "DestroyContext does the right teardown" — the plan claimed it did, but the source disagreed.
The author (Phase 4 v2 writer, me) read context.c:276-307 and noted "STREAMOFF + REQBUFS(0) + cap_pool_destroy at context.c:276-307. No change needed." — missing that `request_pool_destroy` is absent. Surface-level read vs full empirical traversal.
### Memory rule reinforcement
`feedback_review_empirical_over_theoretical.md` (Direction 2): the reviewer here applied it correctly by enumerating `request_pool_destroy` callsites via grep before claiming the function existed only in surface.c. The author should apply Direction 1 in writing future plans: don't claim "no change needed" without enumerating call/use sites empirically.
Worth a fresh memory rule? Maybe: **"Before writing 'no change needed' in a plan, grep for the symbol/function the change would touch. List the call sites. Confirm the change wouldn't break any of them."** Pin this lesson when iter5b closes.