iter5 Phase 5: review CRIT-1 invalidates Phase 4 — loop back to Phase 0/3

Sonnet-architect review found that the RFC v2 fix mechanism does not
reach the libva backend's consumer path:
- Backend uses V4L2_MEMORY_MMAP for both OUTPUT + CAPTURE buffers.
- For MMAP buffers, vb->planes[].dbuf stays NULL.
- RFC v2 helper's plane loop skips planes with !dbuf, fence attached
  to no dma_resv.
- EXPBUF (vb2_dc_get_dmabuf) creates a fresh disjoint dma_resv.
- The fence-mechanism fix would be a no-op for the cap_pool path even
  if it did reach the right resv, because RequestSyncSurface already
  blocks on media_request_wait_completion + v4l2_dequeue_buffer.

Three alternative root-cause hypotheses for Phase 0/3 to disambiguate:
cache coherency, cap_pool slot-rotation bug, or a separate-sync gap
in vaDeriveImage/vaMapBuffer that bypasses RequestSyncSurface.

Phase 5 saved ~half a session of build-install-test wallclock that
would have ended in a Phase 7 → Phase 0 loopback anyway.

Three Important + 2 Minor findings also recorded for when iter5 reopens.

User pick: loop back to Phase 0/3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-11 10:56:11 +00:00
parent a809e9c0b8
commit 0adfb11fff
+210
View File
@@ -0,0 +1,210 @@
# Iteration 5 — Phase 5 (review)
Phase 5 architect review of `phase4_iter5_plan.md` by sonnet-architect, 2026-05-11. Cold-context source reads performed on: RFC v2 patches at `~/src/linux-rfc/` (branch `vb2-dma-resv-rfc`), vb2 core at `videobuf2-core.c` + `videobuf2-dma-contig.c` in the RFC base tree, v4l2-mem2mem.c, rkvdec.c (staging path on v6.12 base), hantro_v4l2.c, libva backend source at `~/src/libva-multiplanar/libva-v4l2-request-fourier/src/`, and ffmpeg vaapi context at `hwcontext_vaapi.c`. All findings verified empirically by source read before inclusion per `feedback_review_empirical_over_theoretical.md`.
**Overall plan health**: Patches 14 are individually correct in isolation. The completion chain (C1 concern) is empirically sound. The rkvdec opt-in site (C3 concern) matches hantro's shape and converges correctly on `vb2_buffer_done`. One Critical finding invalidates the plan's core fix hypothesis. Three Important findings need amendment before Phase 6. Two Minor items are noted.
## Critical findings (BLOCKING Phase 6)
### CRIT-1 — Fix mechanism does not reach the consumer path: `planes[].dbuf` is NULL for V4L2_MEMORY_MMAP
**Issue**: The RFC v2 helper `vb2_buffer_attach_release_fence()` attaches a fence to `vb->planes[plane].dbuf->resv` for each plane. `planes[plane].dbuf` is set **only** for `V4L2_MEMORY_DMABUF` (imported-fd) buffers, inside `__buf_prepare` at `videobuf2-core.c:1486/1555`. For `V4L2_MEMORY_MMAP` buffers, `planes[plane].dbuf` remains NULL throughout the buffer's lifetime.
**Evidence**`videobuf2-core.c` in the RFC base tree:
- `planes[plane].dbuf` is assigned at line 1486 (`planes[plane].dbuf = dma_buf_get(planes[plane].m.fd)`) inside the `V4L2_MEMORY_DMABUF`-specific branch of `__buf_prepare`. No other assignment exists.
- The helper's plane loop: "for plane in num_planes: if `!dbuf` continue" — all planes are skipped for MMAP buffers.
- The fence IS allocated and stored in `vb->release_fence`, but it is attached to **no** `dma_resv` anywhere.
**Evidence** — libva backend CAPTURE allocation:
- `v4l2.c:246`: `buffers.memory = V4L2_MEMORY_MMAP;` — cap_pool uses MMAP for CAPTURE buffers, verified across all alloc/queue/dequeue call sites (lines 246, 282, 320, 345, 384).
**Evidence** — EXPBUF path creates independent resv:
- `vb2_dc_get_dmabuf` (`videobuf2-dma-contig.c:498`) creates a new `dma_buf` via `dma_buf_export(&exp_info)` with `exp_info.resv = NULL` (not set). When `resv` is NULL, `dma_buf_export` allocates a fresh embedded `dma_resv`. This resv is **disjoint** from any fence the helper might have attached (which it can't, because `planes[].dbuf` is NULL for MMAP).
- Consumer receives this fresh-resv dma_buf via `VIDIOC_EXPBUF` — it sees no fence, no wait.
**Consequence**: After applying all 4 patches and rebooting to `linux-fresnel-fourier 7.0-2`, the Phase 7 sweep will produce the same all-zero hashes for H.264, HEVC, VP9, VP8 that Phase 3 showed — the fence mechanism never fires for the libva backend's cap_pool MMAP path. **Phase 7 verification will fail. Phase 6 should NOT proceed until the fix mechanism is re-evaluated.**
**Resolution options** (design, not implementation):
1. **Move fence attachment to EXPBUF time**: Hook `vb2_core_expbuf` to attach the fence to the newly-created dma_buf's resv at export time, before the fd is handed to userspace. The fence would already be signaled by the time EXPBUF is called (DQBUF completes before EXPBUF in the backend), so this doesn't add wait semantics — but it makes the fix mechanism reach the right resv.
2. **Re-examine whether the race is actually timing-based**: The backend already does DQBUF before reading mmap'd data (`RequestSyncSurface` at `surface.c:500528`). If DQBUF guarantees decode completion before returning, the sync-fence hypothesis may be wrong entirely. The actual cause of all-zero bytes may be a different bug (e.g., wrong buffer index, cache coherency on a different code path, or an async pipeline that bypasses the explicit DQBUF wait).
3. **Validate the hantro fix path empirically**: The hantro commit message says validated on PineTab2 with KWin. KWin uses Wayland DMA-BUF import (possibly `V4L2_MEMORY_DMABUF` path via GBM/Mesa, not MMAP + EXPBUF). Verify whether the PineTab2 validation covers the same memory path as the libva backend's cap_pool.
**The author must resolve which path actually produces all-zero bytes before Phase 6 can land the kernel patches. If the backend is already synchronous on DQBUF, the kernel-side fence patch does not address the root cause.**
## Important findings (incorporate into plan before Phase 6)
### IMP-1 — Version naming conflict: plan uses "7.0-2" but Phase 0 criterion 2 locks "7.0.kafr2"
**Locations**:
- Phase 0 criterion 2: "tagged `linux-fresnel-fourier 7.0.kafr2` per the fleet manifest's versioning scheme."
- Phase 2 situation: "Package versioning: `${baseline_ref}.kafr${pkgrel}`. iter5 produces `7.0.kafr2`."
- Phase 4 plan C7/C8/C9/C10: uses `7.0-2` throughout (Arch Linux pkgrel convention).
**Conflict**: "7.0-2" implies `pkgver=7.0, pkgrel=2` (Arch makepkg convention). "7.0.kafr2" implies `pkgver=7.0.kafr2, pkgrel=1` (kernel-agent naming scheme). These produce different `pacman -Q` outputs and different artifact filenames. The Phase 7 verification check at C10 (`pacman -Q linux-fresnel-fourier` expecting "linux-fresnel-fourier 7.0-2") is correct only if the Arch convention is used.
**Resolution**: Decide before Phase 6 which naming scheme the PKGBUILD uses and reconcile Phase 4 C7C10 and Phase 0 criterion 2 to match. If the current 7.0-1 kernel was built with `pkgver=7.0 pkgrel=1`, then Phase 4's "7.0-2" (pkgrel bump) is the correct Arch convention and Phase 0 criterion 2 is inaccurate. Grep the actual PKGBUILD on boltzmann to confirm.
### IMP-2 — Scope tag inconsistency: C5 contradicts Phase 0 criterion 2 and prior manifest comment
**Locations**:
- Phase 0 criterion 2 (locked): "Patches are scope-tagged per agent discipline (Hard rule #7): vb2 helper under `subsystem/media/videobuf2/`, three consumer opt-ins under `subsystem/media/<consumer>/`."
- Existing manifest exclusion comment (pre-iter5): `subsystem/media/videobuf2/dma-resv-release-fence/` (one level deeper than C5).
- Plan C5: `subsystem/media/dma-resv-release-fence/` (flattened, no `videobuf2` level for patch 1).
**Conflict**: Phase 0 is the locked source of truth for criteria. C5 diverges from it. The kernel-agent README's scope-tag convention (Hard rule #7) places vb2 patches under `subsystem/media/videobuf2/`, not directly under `subsystem/media/`. Consumer patches should be under `subsystem/media/verisilicon/` (for hantro), `subsystem/media/rockchip/` (for rga/rkvdec), not all together under `dma-resv-release-fence/`.
**Resolution**: Amend C5 scope tree to match Phase 0:
```
patches/subsystem/media/videobuf2/0001-media-videobuf2-add-dma_resv-release-fence-helper.patch
patches/subsystem/media/verisilicon/0002-media-hantro-attach-dma_resv-release-fence-at-buf_queue.patch
patches/subsystem/media/rockchip/0003-media-rockchip-rga-attach-dma_resv-release-fence-at-buf_queue.patch
patches/subsystem/media/rockchip/0004-media-rkvdec-attach-dma_resv-release-fence-at-buf_queue.patch
```
Also amend C6 manifest update to use these deeper paths. Verify against kernel-agent README for hard rule #7 scope-tag examples.
### IMP-3 — C11 verification matrix accepts "libva==kdirect" without requiring confirmation that the fix mechanism actually fires
**Issue**: The Phase 7 PASS criterion requires `libva==kdirect` for all 5 codecs. This is the right metric, but if CRIT-1 is not resolved (fix doesn't reach the consumer), Phase 7 will observe the opposite: `libva != kdirect` for 4 codecs, same as Phase 3. The matrix as written doesn't include a kernel-side verification step that confirms fences are actually being attached and signaled.
**Proposed amendment**: Add a Phase 7 pre-check step to confirm the fence mechanism fires before running the full sweep:
```bash
# On fresnel post-7.0-2 install: confirm vb2_buffer_attach_release_fence is called
# and that at least one fence is created per decode cycle:
ssh fresnel 'sudo dmesg | grep -i "vb2-release-fence\|videobuf2" | tail -20'
# Or: check /sys/kernel/debug/dma_fence/ for fence context IDs after a decode
ffmpeg -hwaccel vaapi -i bbb_1080p30_h264.mp4 -frames:v 1 -f null -
ssh fresnel 'sudo dmesg | grep -c "vb2"'
```
If the fix mechanism fires but the consumer still sees zeros, that confirms CRIT-1's root cause (fence exists but wrong resv). If no fence activity appears, the patch itself failed to apply.
## Non-amendments (plan is correct)
### Concern 1 — Patch ordering (reviewer question 1): CORRECT
Empirical: `videobuf2-core.c` defines and exports `vb2_buffer_attach_release_fence` via `EXPORT_SYMBOL_GPL` at RFC base `fbe8bf57a`. Consumer patches 2/3/4 call this symbol. Patch 1 must land before patches 2/3/4 in the kernel build. The plan's ordering (0001 helper, 0002 hantro, 0003 rga, 0004 rkvdec) is correct. Both Kbuild and `git am` will apply in numeric order.
### Concern 3 — rkvdec completion path (reviewer question 3): CORRECT
Empirical trace of the rkvdec IRQ completion chain in `rkvdec.c` (staging path, v6.12 base — same structure as v7.0 post-promotion):
```
rkvdec_irq_handler (IRQF_ONESHOT threaded)
→ rkvdec_job_finish (line 968)
→ rkvdec_job_finish_no_pm (line 664)
→ v4l2_m2m_buf_done_and_job_finish (line 653)
→ v4l2_m2m_buf_done (v4l2-mem2mem.h:229, inline)
→ vb2_buffer_done (&buf->vb2_buf, state)
→ vb2_buffer_signal_release_fence (when state != VB2_BUF_STATE_QUEUED)
```
`v4l2_m2m_buf_done` is an inline at `v4l2-mem2mem.h:229`: `vb2_buffer_done(&buf->vb2_buf, state)`. `v4l2_m2m_buf_done_and_job_finish` calls it for both CAPTURE (`dst_buf`, line 526) and OUTPUT (`src_buf`, line 537). The fence signals at decode-DONE, not at QBUF. This matches hantro's semantics.
`dma_fence_signal` is IRQ-safe (documented at `dma-fence.h:173`: "Can be called from irq context"). With `IRQF_ONESHOT`, the handler runs in threaded context — safe to call any IRQ-safe fence API.
### Concern 2 — `(void)` return value shape (reviewer question 2): ACCEPTABLE with caveat
The `(void)vb2_buffer_attach_release_fence(vb)` call shape is correct — the return value is intentionally dropped. For MMAP buffers (all current cap_pool slots), the helper allocates a fence but attaches it to no dma_resv (because `planes[].dbuf == NULL`), then returns 0. -ENOMEM on fence allocation means no fence is created and `vb->release_fence = NULL``vb2_buffer_done` then sees NULL and does nothing (no-op). "Best-effort: lose implicit-sync precision, no functional regression" holds.
The WARN_ON re-queue issue (re-calling `attach_release_fence` on a buffer with `release_fence != NULL`) does not fire in the normal decode path: `vb2_buffer_done(VB2_BUF_STATE_DONE)` sets `release_fence = NULL` before the buffer is dequeued, so it's NULL when QBUF calls `buf_queue` again. The only unsafe path is `vb2_buffer_done(VB2_BUF_STATE_QUEUED)` during start_streaming error recovery — but rkvdec's `rkvdec_stop_streaming` calls `rkvdec_queue_cleanup``v4l2_m2m_buf_done(VB2_BUF_STATE_ERROR)` before the queue is torn down, which signals and clears the fence. The WARN_ON does not fire in practice.
### Concern 6 — OUTPUT-side fence attachment (reviewer question 6): SAFE FOR CURRENT BACKEND
For the libva backend: OUTPUT buffers use `V4L2_MEMORY_MMAP` (confirmed `v4l2.c:246`). `planes[].dbuf == NULL` → helper is a no-op for OUTPUT-side buffers. No interaction with userspace producer writes.
For hypothetical future consumers using `V4L2_MEMORY_DMABUF` for OUTPUT: attaching a release fence to an OUTPUT buffer's resv is semantically dubious (the fence would signal at DQBUF of the compressed bitstream, not at decode completion of the resulting frame). This is a latent semantic issue in the helper design, not in iter5 scope. Not blocking.
### Concern 4 — C5 scope tag depth (reviewer question 4): IMPORTANT (covered in IMP-2 above)
The reviewer agrees the scope tag should match Phase 0 criterion 2. See IMP-2.
### Concern 5 — Phase 7 "libva==kdirect" bar (reviewer question 5): INSUFFICIENT given CRIT-1
"libva==sw" for 4 codecs is stricter but equivalent (since kdirect==sw for those 4). The real gap is that the current test cannot distinguish "fix fired and worked" from "fix fired but wrong resv" or "fix didn't fire at all." See IMP-3 for the proposed pre-check step.
## Minor findings
### MIN-1 — Panfrost comment in C6 manifest remains stale post-iter5
C6 retains the comment "ship together with vb2_dma_resv when it lands" on the panfrost exclusion. After iter5 ships vb2_dma_resv, this comment is inaccurate. The plan acknowledges it ("becomes inaccurate post-iter5 but doesn't gate iter5 close") — correct, but Phase 8 close should update this comment to avoid future confusion.
### MIN-2 — C4 commit message claims Phase 7 validation already performed
The C4 rkvdec patch commit message body includes: "Validated end-to-end on PineBook Pro (RK3399 / Mali-T860 / mainline v7.0 base with this series applied) against fresnel-fourier iter5 verification matrix." This claim is prospective — the validation hasn't happened yet (Phase 7 follows Phase 6). If CRIT-1 is resolved and Phase 7 passes, the commit message is accurate. If Phase 7 fails, the commit message becomes misleading. Recommend removing the "Validated" sentence from C4 prior to Phase 6 push, replacing with "Closes the regression documented in fresnel-fourier iter5 Phase 3 baseline (github.com/…)."
## Summary scorecard for Phase 4 contract clauses
| Clause | Status | Finding |
|---|---|---|
| C1 (vb2 helper patch) | WARNING | Patch is correct; but CRIT-1 means fence never reaches consumer resv for MMAP path |
| C2 (hantro consumer) | WARNING | Same CRIT-1 applies |
| C3 (rga consumer) | PASS | Out-of-scope; patch correct |
| C4 (rkvdec consumer) | WARNING | Completion chain verified correct; CRIT-1 applies; commit msg needs MIN-2 fix |
| C5 (scope tag) | FAIL | Contradicts Phase 0 criterion 2; see IMP-2 |
| C6 (manifest update) | CONDITIONAL | Correct structure; stale comment per MIN-1; path in IMP-2 |
| C7 (PKGBUILD bump) | CONDITIONAL | Correct approach; version naming per IMP-1 |
| C8 (build commands) | PASS | Commands correct; boltzmann availability per risk register |
| C9 (sign+publish) | PASS | Commands correct |
| C10 (install on fresnel) | CONDITIONAL | Commands correct; version string per IMP-1 |
| C11 (verification matrix) | CONDITIONAL | Metric correct; pre-check step needed per IMP-3 |
| C12 (Phase 8 close) | CONDITIONAL | All criteria depend on CRIT-1 resolution |
**Phase 6 is blocked on CRIT-1 resolution.** Author must verify empirically whether the sync-fence mechanism reaches the actual consumer path before investing build/install time on a fix that may not trigger.
## Author re-verification (post-review, 2026-05-11)
Per memory `feedback_review_empirical_over_theoretical.md` Direction 2, the author re-verified each finding before incorporating:
### CRIT-1 — CONFIRMED at the kernel-side code level
- `v4l2.c:246/282/320` all use `V4L2_MEMORY_MMAP` for both OUTPUT and CAPTURE — author re-read; reviewer claim accurate.
- `videobuf2-core.c::__prepare_dmabuf` is the only assignment site for `planes[].dbuf` (verified via grep `planes\[\(plane\|i\)\].dbuf\s*=`).
- `vb2_dc_get_dmabuf` (videobuf2-dma-contig.c:498) does not set `exp_info.resv` — verified empirically. dma_buf_export with NULL resv allocates fresh embedded resv per dma_buf.
### CRIT-1 — DEEPER QUESTION RAISED BY AUTHOR RE-VERIFICATION
Backend's `RequestSyncSurface` (surface.c:455-540) already does:
1. `media_request_wait_completion(request_fd)` — blocks on media-request-API completion.
2. `v4l2_dequeue_buffer(..., capture_type, ...)` — DQBUF blocks until VB2_BUF_STATE_DONE.
Both return only when kernel decoder is finished. The RFC v2 fence mechanism would be a no-op for the libva path even if it DID reach the right resv, because the wait already happened upstream of where the fence would signal.
**This means the Phase 3-observed all-zero pages are not caused by the sync race the RFC v2 patches address.** Three plausible alternative root causes (all require empirical investigation):
1. **Cache coherency** — CPU sees stale pages despite kernel decoder being done; needs explicit cache invalidation before read.
2. **Slot rotation bug** — backend reads from a different cap_pool slot than the one the kernel wrote to (iter4 P7 note: "cap_pool slot rotation between QBUF cycles isn't being driven by the kernel's actual decoded frame data").
3. **Separate-sync gap** — vaDeriveImage / vaMapBuffer doesn't go through RequestSyncSurface; reads happen via a path that bypasses the explicit DQBUF wait.
### IMP-1 — verification pending (need to check actual PKGBUILD)
### IMP-2 — confirmed; reviewer is right that scope-tag should be nested deeper
### IMP-3 — moot once CRIT-1 is resolved (pre-check step is meaningful only if Phase 6 ships)
## Decision
iter5 Phase 4 plan is **rejected**. Phase 5 → Phase 0/3 loopback per `feedback_dev_process.md`. New empirical investigation begins under `phase0_findings_iter5_loopback.md` to root-cause the all-zero pages.
User pick 2026-05-11: "Loop back to Phase 0/3 — re-investigate root cause."
The 4-patch RFC v2 series may still be the right thing to land at some later iteration (it does fix a real race for DMABUF-import consumers like KWin/Mesa), but it is NOT the fix for iter5's observed Bug 2 on the libva-backend hwdownload path. iter5 reopens.
## Phase 5 review track-record acknowledgment
Per memory `feedback_review_empirical_over_theoretical.md`: each review's findings acknowledged at the relevant phase boundary.
| ID | Reviewer claim | Author re-verification | Status |
|---|---|---|---|
| CRIT-1 | RFC v2 fix mechanism doesn't reach MMAP+EXPBUF consumer path | CONFIRMED at code level; plus deeper observation that RequestSyncSurface already waits, making the fence a no-op for the libva path | **CORRECT, BLOCKING** |
| IMP-1 | "7.0-2" vs "7.0.kafr2" naming inconsistency between Phase 0 + Phase 2 + Phase 4 | Confirmed inconsistency; resolution deferred (moot until iter5 reopens) | **CORRECT** |
| IMP-2 | C5 scope tag depth contradicts Phase 0 criterion 2 | Confirmed; matches kernel-agent README's scope-tag convention | **CORRECT** |
| IMP-3 | C11 verification matrix lacks pre-check that fence mechanism fires | Confirmed; moot once CRIT-1 resolves into a different fix path | **CORRECT** |
| MIN-1 | Stale panfrost comment | Confirmed; trivial | **CORRECT** |
| MIN-2 | C4 commit msg claims prospective validation | Confirmed; trivial | **CORRECT** |
All 6 findings empirically correct. Reviewer caught the major plan invalidator BEFORE Phase 6 burned a build cycle on a non-fix.
Estimated savings without the review pass: 1 full Phase 6 build (~30-60 min boltzmann time) + 1 sign+publish + 1 install + 1 Phase 7 sweep that would have produced "all-zero same as Phase 3" → forced Phase 7 → Phase 0 loopback at that point anyway. Phase 5 saved roughly half a session's worth of wallclock + an embarrassing reboot of fresnel for a kernel that wouldn't have fixed anything.