iter5 Phase 2: situation analysis — 4-patch plan (3 RFC v2 + 1 new rkvdec consumer)

Source-read complete: 3 RFC v2 patches dissected, v7.0 rkvdec_buf_queue
site identified at line 954 of drivers/media/platform/rockchip/rkvdec/rkvdec.c,
empirical disproof of Bug 3 UAPI drift via byte-identical v6.12↔v7.0 struct
diff, hantro_v4l2.c confirmed unchanged across the same range.

Rebase risk concentrated in videobuf2-core.c (medium — vb2 core sees regular
activity); deferred to Phase 4 when boltzmann is reachable for the
git apply --3way verification.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-11 03:58:07 +00:00
parent 31b9255d63
commit 9941523f1f
+262
View File
@@ -0,0 +1,262 @@
# Iteration 5 — Phase 2 (situation analysis)
Captured 2026-05-10 evening / 2026-05-11 in resume. Closes Phase 2 of iter5 per `feedback_dev_process.md`: source-read of the kernel paths Bug 2 touches, contract-before-patch citation of every site iter5 modifies. The major mid-Phase finding (Bug 3 collapses) was already folded back into Phase 0 (`phase0_findings_iter5.md` amendment, commit `31b9255`).
## Bug 2 — vb2_dma_resv blocker (real, RFC v2 ready)
### Root-cause framing
Per memory `reference_dmabuf_resv_blocker.md`: V4L2 producers don't propagate `VB2_BUF_STATE_DONE` into the dmabuf's `dma_resv` exclusive fence. When userspace consumers (libva backend cap_pool readback path, Wayland compositors, etc.) import a V4L2-produced dmabuf and wait on the implicit-sync fence (`poll(POLLIN)` / `DMA_BUF_IOCTL_EXPORT_SYNC_FILE`), they see either no fences or the stub fence from `dma_fence_get_stub()`.
For our cap_pool readback path, the practical result is: the libva backend reads the dmabuf-backed CAPTURE buffer before the kernel-side decoder IRQ has signalled completion, gets back the page state at QBUF time (the cap_pool init pattern `RGB(0, 0x4c, 0)`), and ships that to ffmpeg-vaapi-hwdownload. The kernel decoded the frame correctly — but the userspace consumer read the page out of order.
Why it surfaces on `linux-fresnel-fourier 7.0-1` and not on `linux-eos-arm 6.19.9-99`: not a regression. The same bug existed on 6.19; iter3 hit it on hantro (memory `reference_dmabuf_resv_blocker.md` documents the symptom as all-zero pages on RK3399 hantro CAPTURE). The shift from "iter3 saw it on hantro only" to "iter4 sees it across all 4 codecs on the new kernel" is most likely **timing**: the new kernel's cap_pool allocation / buffer-handoff path is slightly faster (or slower) than 6.19's, and the userspace race window that was sometimes-closed-sometimes-open on rkvdec at 6.19 is now consistently open at 7.0. iter3 deferred this for hantro; iter4 surfaced it for rkvdec on the new substrate; iter5 fixes it for both blocks at the kernel layer.
### RFC v2 patch series (source: `~/src/linux-rfc/` branch `vb2-dma-resv-rfc`)
Three operator-authored patches on top of `v6.12`:
#### Patch 1/3 — `fbe8bf57a media: videobuf2: add dma_resv release-fence helper`
```
drivers/media/common/videobuf2/videobuf2-core.c | +99
include/media/videobuf2-core.h | +19
```
Adds the opt-in API. Key surface:
- `int vb2_buffer_attach_release_fence(struct vb2_buffer *vb)` — driver calls from buf_queue. Allocates a `dma_fence` on the queue's per-queue fence context (set up at `vb2_core_queue_init`), attaches it as `DMA_RESV_USAGE_WRITE` on each plane's `dmabuf->resv`, stashes in `vb->release_fence`. Skips planes whose `vb2_plane.dbuf` is NULL.
- `vb2_buffer_signal_release_fence(vb, state)` — internal helper called from `vb2_buffer_done()` on state transition. Signals + puts the fence. No-op when `vb->release_fence` is NULL (drivers that didn't opt in).
- New `struct vb2_queue` fields: `u64 dma_resv_fence_context`, `atomic64_t dma_resv_fence_seqno`, `spinlock_t dma_resv_fence_lock`.
- New `struct vb2_buffer` field: `struct dma_fence *release_fence`.
This is the only non-trivial patch in the series — adds ~120 lines of new code in vb2 core. Drivers that don't opt in pay zero cost beyond a few extra struct fields.
#### Patch 2/3 — `14a68fcf0 media: hantro: attach dma_resv release fence at buf_queue`
```
drivers/media/platform/verisilicon/hantro_v4l2.c | +12
```
The driver-side opt-in is one line of code plus a 10-line comment block:
```c
static void hantro_buf_queue(struct vb2_buffer *vb)
{
...
v4l2_m2m_buf_queue(ctx->fh.m2m_ctx, vbuf);
+ /*
+ * Opt in to vb2's dma_resv release-fence path. [...]
+ */
+ (void)vb2_buffer_attach_release_fence(vb);
}
```
Operator's commit message empirically validated on PineTab2 (RK3566 hantro) mainline 6.19 + this series backported: KWin's `Transaction::watchDmaBuf` wait completes correctly the moment hantro's IRQ fires.
#### Patch 3/3 — `89b699508 media: rockchip-rga: attach dma_resv release fence at buf_queue`
```
drivers/media/platform/rockchip/rga/rga-buf.c | +10
```
Same shape as the hantro patch. Out-of-scope for iter5's libva path (we don't use RGA), but kept in the kernel-agent local-carry as part of the cohesive series — RGA is referenced by GStreamer flows on Rockchip boards and the operator's intent (per RFC commit message) is to land all three v4l2 producers together.
### Gap — no rkvdec consumer patch
The series ships hantro + rga but **not rkvdec**. iter4 Phase 7 verified Bug 2 hits rkvdec too on the new substrate (constant `0x4c` for H.264 inter + HEVC + VP9 cap_pool reads). iter5 contributes the missing 4th patch.
### Patch 4/4 — `media: rkvdec: attach dma_resv release fence at buf_queue` (NEW, iter5 work)
Target file: `drivers/media/platform/rockchip/rkvdec/rkvdec.c` at v7.0 (post-staging-promotion path; was `drivers/staging/media/rkvdec/` in earlier kernels).
Target function: `rkvdec_buf_queue` at line 954 of `028ef9c96e96 Linux 7.0`:
```c
static void rkvdec_buf_queue(struct vb2_buffer *vb)
{
struct rkvdec_ctx *ctx = vb2_get_drv_priv(vb->vb2_queue);
struct vb2_v4l2_buffer *vbuf = to_vb2_v4l2_buffer(vb);
v4l2_m2m_buf_queue(ctx->fh.m2m_ctx, vbuf);
}
```
Patch shape (mechanical, same as hantro patch):
```diff
static void rkvdec_buf_queue(struct vb2_buffer *vb)
{
struct rkvdec_ctx *ctx = vb2_get_drv_priv(vb->vb2_queue);
struct vb2_v4l2_buffer *vbuf = to_vb2_v4l2_buffer(vb);
v4l2_m2m_buf_queue(ctx->fh.m2m_ctx, vbuf);
+
+ /*
+ * Opt in to vb2's dma_resv release-fence path. Userspace
+ * consumers that imported this buffer's dmabuf and wait on
+ * its implicit-sync fence get a real producer fence
+ * representing rkvdec's completion, instead of the stub
+ * fence dma_buf_export_sync_file substitutes when dma_resv
+ * is empty. Best-effort: a fence-allocation failure means we
+ * lose implicit-sync precision, no functional regression.
+ */
+ (void)vb2_buffer_attach_release_fence(vb);
}
```
Author trailer must preserve attribution discipline per memory `feedback_gitea_as_claude_noether.md`: this is Claude-authored work, sign as `claude-noether <sentinel-email>`, with a `Co-Authored-By:` trailer for the operator if iter5 is reviewed via PR flow. Local-carry-only acceptable per Phase 0 lock.
## Rebase risk — v6.12 base → v7.0 base
The 3 existing RFC v2 patches were authored against v6.12. The kernel-agent product baseline is v7.0 (per `fleet/fresnel.yaml`). Risk surface:
| File | v6.12 → v7.0 delta | Rebase risk |
|---|---|---|
| `drivers/media/common/videobuf2/videobuf2-core.c` | Not measured (boltzmann offline). Expect non-zero delta — vb2 core sees regular activity. | **MEDIUM** — the helper patch adds includes + extends `vb2_buffer_done` + extends `vb2_core_queue_init`. Conflicts possible. Phase 4 task: run `git apply --3way` against v7.0 and resolve. |
| `include/media/videobuf2-core.h` | Not measured. | **LOW** — header changes typically less churn-prone. |
| `drivers/media/platform/verisilicon/hantro_v4l2.c` | Confirmed unchanged v6.12 → v7.0 (boltzmann diff stat showed 0 lines in hantro_v4l2.c). | **LOW** — patch should apply cleanly. |
| `drivers/media/platform/rockchip/rga/rga-buf.c` | Not measured. | **LOW** — rga sees less churn than vb2 core. |
| `drivers/media/platform/rockchip/rkvdec/rkvdec.c` | Not applicable — iter5 is authoring this patch fresh against v7.0. | N/A |
Boltzmann reconnection needed for Phase 4 final rebase verification. Not blocking Phase 2 close.
### v4l2_m2m / v4l2-mem2mem rebase note
The hantro + rga patches both insert their opt-in call *after* `v4l2_m2m_buf_queue()`. The rkvdec consumer follows the same shape. If any of these `v4l2_m2m_*` helpers shifted between v6.12 and v7.0 in a way that affects the buf_queue call signature, the patches need updating. Not measured; Phase 4 task.
## Bug 3 — collapsed (UAPI drift hypothesis was wrong)
### Empirical disproof of "UAPI drift" hypothesis
iter4 Phase 7 doc speculated:
> **Hantro `Unable to set control(s)` errors**: a kernel-side rejection on hantro for MPEG-2/VP8. Substrate change appears to have shifted hantro's expected control structure or fields; iter1 (MPEG-2) and iter3 (VP8) were tested on 6.19.9 — UAPI likely drifted between 6.19.9 and 7.0 the same way VP9 did.
Empirical struct-by-struct check 2026-05-10:
```bash
ssh boltzmann 'cd ~/src/linux-rockchip &&
for ref in v6.12 028ef9c96e96; do
echo "===$ref==="
git show $ref:include/uapi/linux/v4l2-controls.h | awk \
"/^struct v4l2_ctrl_mpeg2_(sequence|picture|quantisation|vp8_frame) {/{f=1; print; next} f{print; if(\$0~/^};/) f=0}"
done'
```
Result: **byte-identical** struct definitions across v6.12 and v7.0 for:
- `struct v4l2_ctrl_mpeg2_sequence` (8 fields)
- `struct v4l2_ctrl_mpeg2_picture` (8 fields)
- `struct v4l2_ctrl_mpeg2_quantisation` (4 fields)
- `struct v4l2_ctrl_vp8_frame` (30 fields)
Plus the surrounding `drivers/media/v4l2-core/v4l2-ctrls-defs.c` delta was 15 lines, all additions for unrelated controls (FLASH duration, HEVC EXT_SPS_*_RPS, AV1).
So the iter4 hypothesis was wrong — there is no UAPI drift on MPEG-2 or VP8.
### Actual cause of "Unable to set control(s)"
Re-traced MPEG-2 decode on fresnel 7.0-1 with explicit hantro-decoder env override (`/dev/video2 + /dev/media0` on the 2026-05-10 boot):
```bash
LIBVA_DRIVER_NAME=v4l2_request \
LIBVA_V4L2_REQUEST_NO_AUTODETECT=1 \
LIBVA_V4L2_REQUEST_VIDEO_PATH=/dev/video2 \
LIBVA_V4L2_REQUEST_MEDIA_PATH=/dev/media0 \
ffmpeg -hwaccel vaapi -i bbb_720p10s_mpeg2.ts -frames:v 2 -f null -
```
Captured ioctl trace (`strace -ff -v -x -s 999999`). Sequence of `VIDIOC_S_EXT_CTRLS` submissions on hantro:
| # | ctrl_class | controls | result | meaning |
|---|---|---|---|---|
| 1 | 0 | `0xa40900 H264_DECODE_MODE`, `0xa40901 H264_START_CODE` | **EINVAL** error_idx=2 | Backend probe — fails because hantro doesn't expose H.264 |
| 2 | 0 | `0xa40a95 HEVC_DECODE_MODE`, `0xa40a96 HEVC_START_CODE` | **EINVAL** error_idx=2 | Backend probe — fails because hantro doesn't expose HEVC |
| 3 | `0xf010000` | `0xa409dc MPEG2_SEQUENCE`(12B), `0xa409dd MPEG2_PICTURE`(32B), `0xa409de MPEG2_QUANTISATION`(256B) | **0** | Frame 1 controls accepted |
| 4 | `0xf010000` | same shape | **0** | Frame 2 controls accepted |
| 5..7 | `0xf010000` | same shape, varying timestamps | **0** | More frames |
The init-time H.264 + HEVC probes happen on every device the libva backend binds to. On rkvdec they succeed (rkvdec supports both); on hantro they EINVAL because hantro is MPEG-2 + VP8 only. The EINVAL log lines are cosmetic — actual MPEG-2 (and presumably VP8) frame submission goes through `= 0`.
### Bug 3 → B4 backlog item, not iter5 scope
This is iter1+ backlog item **B4** ("context.c log suppression for unsupported codec controls"). Cosmetic noise. Doesn't affect functional decode. The actual MPEG-2 + VP8 pixel-output FAIL at iter4 Phase 7 was caused by **Bug 2** (cap_pool readback returning init pattern), identical in shape to the rkvdec case. Fixing Bug 2 fixes MPEG-2 + VP8 too.
B4 stays in backlog for a separate iteration; iter5 doesn't touch the backend.
## Kernel-agent operational state
Per memory `project_kernel_agent.md` (2026-05-09):
- Agent **spec'd, not operational**. `ka-promote / ka-close / ka-install / ka-status` CLI verbs designed but not implemented.
- Fleet manifest exists at `git.reauktion.de/marfrit/kernel-agent/fleet/fresnel.yaml` and documents the canonical patch set + baseline.
- Build host primary: boltzmann (kbuild-aarch64 surrogate, native).
- Build host fallback: fermi (hertz LXD, ALARM aarch64).
- No distcc for kernel-agent builds (per `feedback_kernel_agent_no_distcc.md`).
- Package versioning: `${baseline_ref}.kafr${pkgrel}`. iter5 produces `7.0.kafr2`.
The current manifest `fleet/fresnel.yaml` explicitly excludes vb2_dma_resv per a 2026-04-28 decision:
> Explicitly NOT included (tracked elsewhere, decision logged):
> - subsystem/media/videobuf2/dma-resv-release-fence/ (RFC v1 rejected;
> v2 in design — see marfrit/dmabuf-modifier-triage#3. Skip until v2 lands
> or we explicitly accept v1-shape parity with ohm.)
iter5 work re-classifies vb2_dma_resv from "skip" to "include," updates the manifest, and lands the build. Manual build path (no `ka-*` CLI yet) is the fallback per Phase 0 lock.
## Phase 4 plan preview
Phase 4 will detail the patch sequence + manifest update + build pipeline + verification matrix. Predicted shape:
- **4 patches** (3 RFC v2 rebased + 1 new rkvdec consumer).
- **1 manifest update** to `fleet/fresnel.yaml`: remove `Explicitly NOT included` block for vb2_dma_resv, add 4 includes under `includes:`, bump version comment.
- **1 build cycle** on boltzmann producing `linux-fresnel-fourier 7.0.kafr2-*.pkg.tar.zst`.
- **1 install + reboot on fresnel** via pacman.
- **1 Phase 7 verification matrix** running ffmpeg-vaapi-hwdownload on all 5 codecs, byte-identical YUV check vs SW reference, no transitive proof.
Predicted LOC delta:
- Patch 1 (vb2 helper): ~120 LOC kernel, **operator-authored**.
- Patch 2 (hantro consumer): +12 LOC, operator-authored.
- Patch 3 (rga consumer): +10 LOC, operator-authored.
- Patch 4 (rkvdec consumer): +12 LOC, **claude-noether-authored (iter5 contribution)**.
- Manifest update: ~10 LOC YAML.
Total iter5 new code authorship: ~12 LOC of kernel C, ~10 LOC of YAML config.
## Phase 4 source-read targets
Already complete in Phase 2 (above):
-`~/src/linux-rfc/` branch `vb2-dma-resv-rfc` — 3 RFC v2 patches read end-to-end.
- ✓ v6.12 + v7.0 `include/uapi/linux/v4l2-controls.h` MPEG-2 + VP8 struct diff — byte-identical.
- ✓ v6.12 + v7.0 `drivers/media/v4l2-core/v4l2-ctrls-defs.c` diff — 15 lines, none MPEG-2/VP8 related.
- ✓ v7.0 `drivers/media/platform/rockchip/rkvdec/rkvdec.c::rkvdec_buf_queue` — confirmed mechanical opt-in site.
- ✓ Fleet manifest `fleet/fresnel.yaml` — current state captured, exclusion-of-vb2_dma_resv noted.
- ✓ Empirical re-trace of MPEG-2 decode on fresnel — confirms Bug 3 is B4 cosmetic noise.
For Phase 4 (deferred until boltzmann reconnects):
- v6.12 → v7.0 delta on `drivers/media/common/videobuf2/videobuf2-core.c` — rebase risk assessment.
- v6.12 → v7.0 delta on `drivers/media/platform/rockchip/rga/rga-buf.c` — confirm rebase trivial.
- Apply the 3 RFC v2 patches with `git apply --3way` onto v7.0 baseline and capture conflict-rate.
## What "iteration 5 close" looks like
Per `feedback_dev_process.md` Phase 8:
- All 4 Phase 1 criteria green (Bug 2 closed for all 5 codecs · substrate ships from kernel-agent · no codec-contract regression · 5/5 direct verification).
- `phase8_iteration5_close.md` documenting the patches, build details, verification matrix.
- Campaign scoreboard updated from "5/5 (4 direct + 1 transitive)" to "5/5 direct."
- Memory entries distilled — likely 1 new entry on the contract: "vb2_dma_resv pattern: V4L2 producers must opt-in per driver, one line at end of buf_queue callback." Predicted name: `reference_vb2_dma_resv_opt_in_pattern.md` or fold update into existing `reference_dmabuf_resv_blocker.md`.
- Phase 5 sonnet-architect review pass signed off.
- Commits authored as `claude-noether` per `feedback_gitea_as_claude_noether.md`. Operator-authored RFC v2 patches preserve `Signed-off-by: Markus Fritsche <mfritsche@reauktion.de>`.
- Kernel-agent `fleet/fresnel.yaml` updated and committed.
Predicted iter5 difficulty vs iter1-4:
- **vs iter1-3 (~370 LOC libva backend per codec)**: iter5 is **much smaller in LOC** but **larger in scope** — touches kernel + build pipeline instead of single binary.
- **vs iter4 (single new codec, 4 commits + 1 fix-forward)**: iter5 has 4 patches (3 operator-existing + 1 claude-new) + 1 manifest update. Comparable patch count, simpler per-patch shape.
- **Predicted Phase 7 failure modes**:
1. RFC v2 rebase conflicts on videobuf2-core.c (medium risk — vb2 core is active code).
2. Helper patch causes silent regression on a non-opted-in driver (low risk — patch is opt-in by design).
3. fence-allocation under memory pressure fails and the fence-attach call returns -ENOMEM (low impact — best-effort by design).
4. cap_pool readback still fails after the fix (the userspace race window isn't what we thought it was). **This is the highest-impact failure mode** — would force Phase 7 → Phase 4 or even Phase 7 → Phase 0 loopback.