Files
fresnel-fourier/phase2_iter5_situation.md
T
marfrit 9941523f1f iter5 Phase 2: situation analysis — 4-patch plan (3 RFC v2 + 1 new rkvdec consumer)
Source-read complete: 3 RFC v2 patches dissected, v7.0 rkvdec_buf_queue
site identified at line 954 of drivers/media/platform/rockchip/rkvdec/rkvdec.c,
empirical disproof of Bug 3 UAPI drift via byte-identical v6.12↔v7.0 struct
diff, hantro_v4l2.c confirmed unchanged across the same range.

Rebase risk concentrated in videobuf2-core.c (medium — vb2 core sees regular
activity); deferred to Phase 4 when boltzmann is reachable for the
git apply --3way verification.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 03:58:07 +00:00

16 KiB

Iteration 5 — Phase 2 (situation analysis)

Captured 2026-05-10 evening / 2026-05-11 in resume. Closes Phase 2 of iter5 per feedback_dev_process.md: source-read of the kernel paths Bug 2 touches, contract-before-patch citation of every site iter5 modifies. The major mid-Phase finding (Bug 3 collapses) was already folded back into Phase 0 (phase0_findings_iter5.md amendment, commit 31b9255).

Bug 2 — vb2_dma_resv blocker (real, RFC v2 ready)

Root-cause framing

Per memory reference_dmabuf_resv_blocker.md: V4L2 producers don't propagate VB2_BUF_STATE_DONE into the dmabuf's dma_resv exclusive fence. When userspace consumers (libva backend cap_pool readback path, Wayland compositors, etc.) import a V4L2-produced dmabuf and wait on the implicit-sync fence (poll(POLLIN) / DMA_BUF_IOCTL_EXPORT_SYNC_FILE), they see either no fences or the stub fence from dma_fence_get_stub().

For our cap_pool readback path, the practical result is: the libva backend reads the dmabuf-backed CAPTURE buffer before the kernel-side decoder IRQ has signalled completion, gets back the page state at QBUF time (the cap_pool init pattern RGB(0, 0x4c, 0)), and ships that to ffmpeg-vaapi-hwdownload. The kernel decoded the frame correctly — but the userspace consumer read the page out of order.

Why it surfaces on linux-fresnel-fourier 7.0-1 and not on linux-eos-arm 6.19.9-99: not a regression. The same bug existed on 6.19; iter3 hit it on hantro (memory reference_dmabuf_resv_blocker.md documents the symptom as all-zero pages on RK3399 hantro CAPTURE). The shift from "iter3 saw it on hantro only" to "iter4 sees it across all 4 codecs on the new kernel" is most likely timing: the new kernel's cap_pool allocation / buffer-handoff path is slightly faster (or slower) than 6.19's, and the userspace race window that was sometimes-closed-sometimes-open on rkvdec at 6.19 is now consistently open at 7.0. iter3 deferred this for hantro; iter4 surfaced it for rkvdec on the new substrate; iter5 fixes it for both blocks at the kernel layer.

RFC v2 patch series (source: ~/src/linux-rfc/ branch vb2-dma-resv-rfc)

Three operator-authored patches on top of v6.12:

Patch 1/3 — fbe8bf57a media: videobuf2: add dma_resv release-fence helper

drivers/media/common/videobuf2/videobuf2-core.c | +99
include/media/videobuf2-core.h                  | +19

Adds the opt-in API. Key surface:

  • int vb2_buffer_attach_release_fence(struct vb2_buffer *vb) — driver calls from buf_queue. Allocates a dma_fence on the queue's per-queue fence context (set up at vb2_core_queue_init), attaches it as DMA_RESV_USAGE_WRITE on each plane's dmabuf->resv, stashes in vb->release_fence. Skips planes whose vb2_plane.dbuf is NULL.
  • vb2_buffer_signal_release_fence(vb, state) — internal helper called from vb2_buffer_done() on state transition. Signals + puts the fence. No-op when vb->release_fence is NULL (drivers that didn't opt in).
  • New struct vb2_queue fields: u64 dma_resv_fence_context, atomic64_t dma_resv_fence_seqno, spinlock_t dma_resv_fence_lock.
  • New struct vb2_buffer field: struct dma_fence *release_fence.

This is the only non-trivial patch in the series — adds ~120 lines of new code in vb2 core. Drivers that don't opt in pay zero cost beyond a few extra struct fields.

Patch 2/3 — 14a68fcf0 media: hantro: attach dma_resv release fence at buf_queue

drivers/media/platform/verisilicon/hantro_v4l2.c | +12

The driver-side opt-in is one line of code plus a 10-line comment block:

static void hantro_buf_queue(struct vb2_buffer *vb)
{
    ...
    v4l2_m2m_buf_queue(ctx->fh.m2m_ctx, vbuf);

+   /*
+    * Opt in to vb2's dma_resv release-fence path. [...]
+    */
+   (void)vb2_buffer_attach_release_fence(vb);
}

Operator's commit message empirically validated on PineTab2 (RK3566 hantro) mainline 6.19 + this series backported: KWin's Transaction::watchDmaBuf wait completes correctly the moment hantro's IRQ fires.

Patch 3/3 — 89b699508 media: rockchip-rga: attach dma_resv release fence at buf_queue

drivers/media/platform/rockchip/rga/rga-buf.c | +10

Same shape as the hantro patch. Out-of-scope for iter5's libva path (we don't use RGA), but kept in the kernel-agent local-carry as part of the cohesive series — RGA is referenced by GStreamer flows on Rockchip boards and the operator's intent (per RFC commit message) is to land all three v4l2 producers together.

Gap — no rkvdec consumer patch

The series ships hantro + rga but not rkvdec. iter4 Phase 7 verified Bug 2 hits rkvdec too on the new substrate (constant 0x4c for H.264 inter + HEVC + VP9 cap_pool reads). iter5 contributes the missing 4th patch.

Patch 4/4 — media: rkvdec: attach dma_resv release fence at buf_queue (NEW, iter5 work)

Target file: drivers/media/platform/rockchip/rkvdec/rkvdec.c at v7.0 (post-staging-promotion path; was drivers/staging/media/rkvdec/ in earlier kernels).

Target function: rkvdec_buf_queue at line 954 of 028ef9c96e96 Linux 7.0:

static void rkvdec_buf_queue(struct vb2_buffer *vb)
{
    struct rkvdec_ctx *ctx = vb2_get_drv_priv(vb->vb2_queue);
    struct vb2_v4l2_buffer *vbuf = to_vb2_v4l2_buffer(vb);

    v4l2_m2m_buf_queue(ctx->fh.m2m_ctx, vbuf);
}

Patch shape (mechanical, same as hantro patch):

 static void rkvdec_buf_queue(struct vb2_buffer *vb)
 {
     struct rkvdec_ctx *ctx = vb2_get_drv_priv(vb->vb2_queue);
     struct vb2_v4l2_buffer *vbuf = to_vb2_v4l2_buffer(vb);

     v4l2_m2m_buf_queue(ctx->fh.m2m_ctx, vbuf);
+
+    /*
+     * Opt in to vb2's dma_resv release-fence path. Userspace
+     * consumers that imported this buffer's dmabuf and wait on
+     * its implicit-sync fence get a real producer fence
+     * representing rkvdec's completion, instead of the stub
+     * fence dma_buf_export_sync_file substitutes when dma_resv
+     * is empty. Best-effort: a fence-allocation failure means we
+     * lose implicit-sync precision, no functional regression.
+     */
+    (void)vb2_buffer_attach_release_fence(vb);
 }

Author trailer must preserve attribution discipline per memory feedback_gitea_as_claude_noether.md: this is Claude-authored work, sign as claude-noether <sentinel-email>, with a Co-Authored-By: trailer for the operator if iter5 is reviewed via PR flow. Local-carry-only acceptable per Phase 0 lock.

Rebase risk — v6.12 base → v7.0 base

The 3 existing RFC v2 patches were authored against v6.12. The kernel-agent product baseline is v7.0 (per fleet/fresnel.yaml). Risk surface:

File v6.12 → v7.0 delta Rebase risk
drivers/media/common/videobuf2/videobuf2-core.c Not measured (boltzmann offline). Expect non-zero delta — vb2 core sees regular activity. MEDIUM — the helper patch adds includes + extends vb2_buffer_done + extends vb2_core_queue_init. Conflicts possible. Phase 4 task: run git apply --3way against v7.0 and resolve.
include/media/videobuf2-core.h Not measured. LOW — header changes typically less churn-prone.
drivers/media/platform/verisilicon/hantro_v4l2.c Confirmed unchanged v6.12 → v7.0 (boltzmann diff stat showed 0 lines in hantro_v4l2.c). LOW — patch should apply cleanly.
drivers/media/platform/rockchip/rga/rga-buf.c Not measured. LOW — rga sees less churn than vb2 core.
drivers/media/platform/rockchip/rkvdec/rkvdec.c Not applicable — iter5 is authoring this patch fresh against v7.0. N/A

Boltzmann reconnection needed for Phase 4 final rebase verification. Not blocking Phase 2 close.

v4l2_m2m / v4l2-mem2mem rebase note

The hantro + rga patches both insert their opt-in call after v4l2_m2m_buf_queue(). The rkvdec consumer follows the same shape. If any of these v4l2_m2m_* helpers shifted between v6.12 and v7.0 in a way that affects the buf_queue call signature, the patches need updating. Not measured; Phase 4 task.

Bug 3 — collapsed (UAPI drift hypothesis was wrong)

Empirical disproof of "UAPI drift" hypothesis

iter4 Phase 7 doc speculated:

Hantro Unable to set control(s) errors: a kernel-side rejection on hantro for MPEG-2/VP8. Substrate change appears to have shifted hantro's expected control structure or fields; iter1 (MPEG-2) and iter3 (VP8) were tested on 6.19.9 — UAPI likely drifted between 6.19.9 and 7.0 the same way VP9 did.

Empirical struct-by-struct check 2026-05-10:

ssh boltzmann 'cd ~/src/linux-rockchip &&
  for ref in v6.12 028ef9c96e96; do
    echo "===$ref==="
    git show $ref:include/uapi/linux/v4l2-controls.h | awk \
      "/^struct v4l2_ctrl_mpeg2_(sequence|picture|quantisation|vp8_frame) {/{f=1; print; next} f{print; if(\$0~/^};/) f=0}"
  done'

Result: byte-identical struct definitions across v6.12 and v7.0 for:

  • struct v4l2_ctrl_mpeg2_sequence (8 fields)
  • struct v4l2_ctrl_mpeg2_picture (8 fields)
  • struct v4l2_ctrl_mpeg2_quantisation (4 fields)
  • struct v4l2_ctrl_vp8_frame (30 fields)

Plus the surrounding drivers/media/v4l2-core/v4l2-ctrls-defs.c delta was 15 lines, all additions for unrelated controls (FLASH duration, HEVC EXT_SPS_*_RPS, AV1).

So the iter4 hypothesis was wrong — there is no UAPI drift on MPEG-2 or VP8.

Actual cause of "Unable to set control(s)"

Re-traced MPEG-2 decode on fresnel 7.0-1 with explicit hantro-decoder env override (/dev/video2 + /dev/media0 on the 2026-05-10 boot):

LIBVA_DRIVER_NAME=v4l2_request \
  LIBVA_V4L2_REQUEST_NO_AUTODETECT=1 \
  LIBVA_V4L2_REQUEST_VIDEO_PATH=/dev/video2 \
  LIBVA_V4L2_REQUEST_MEDIA_PATH=/dev/media0 \
  ffmpeg -hwaccel vaapi -i bbb_720p10s_mpeg2.ts -frames:v 2 -f null -

Captured ioctl trace (strace -ff -v -x -s 999999). Sequence of VIDIOC_S_EXT_CTRLS submissions on hantro:

# ctrl_class controls result meaning
1 0 0xa40900 H264_DECODE_MODE, 0xa40901 H264_START_CODE EINVAL error_idx=2 Backend probe — fails because hantro doesn't expose H.264
2 0 0xa40a95 HEVC_DECODE_MODE, 0xa40a96 HEVC_START_CODE EINVAL error_idx=2 Backend probe — fails because hantro doesn't expose HEVC
3 0xf010000 0xa409dc MPEG2_SEQUENCE(12B), 0xa409dd MPEG2_PICTURE(32B), 0xa409de MPEG2_QUANTISATION(256B) 0 Frame 1 controls accepted
4 0xf010000 same shape 0 Frame 2 controls accepted
5..7 0xf010000 same shape, varying timestamps 0 More frames

The init-time H.264 + HEVC probes happen on every device the libva backend binds to. On rkvdec they succeed (rkvdec supports both); on hantro they EINVAL because hantro is MPEG-2 + VP8 only. The EINVAL log lines are cosmetic — actual MPEG-2 (and presumably VP8) frame submission goes through = 0.

Bug 3 → B4 backlog item, not iter5 scope

This is iter1+ backlog item B4 ("context.c log suppression for unsupported codec controls"). Cosmetic noise. Doesn't affect functional decode. The actual MPEG-2 + VP8 pixel-output FAIL at iter4 Phase 7 was caused by Bug 2 (cap_pool readback returning init pattern), identical in shape to the rkvdec case. Fixing Bug 2 fixes MPEG-2 + VP8 too.

B4 stays in backlog for a separate iteration; iter5 doesn't touch the backend.

Kernel-agent operational state

Per memory project_kernel_agent.md (2026-05-09):

  • Agent spec'd, not operational. ka-promote / ka-close / ka-install / ka-status CLI verbs designed but not implemented.
  • Fleet manifest exists at git.reauktion.de/marfrit/kernel-agent/fleet/fresnel.yaml and documents the canonical patch set + baseline.
  • Build host primary: boltzmann (kbuild-aarch64 surrogate, native).
  • Build host fallback: fermi (hertz LXD, ALARM aarch64).
  • No distcc for kernel-agent builds (per feedback_kernel_agent_no_distcc.md).
  • Package versioning: ${baseline_ref}.kafr${pkgrel}. iter5 produces 7.0.kafr2.

The current manifest fleet/fresnel.yaml explicitly excludes vb2_dma_resv per a 2026-04-28 decision:

Explicitly NOT included (tracked elsewhere, decision logged):

  • subsystem/media/videobuf2/dma-resv-release-fence/ (RFC v1 rejected; v2 in design — see marfrit/dmabuf-modifier-triage#3. Skip until v2 lands or we explicitly accept v1-shape parity with ohm.)

iter5 work re-classifies vb2_dma_resv from "skip" to "include," updates the manifest, and lands the build. Manual build path (no ka-* CLI yet) is the fallback per Phase 0 lock.

Phase 4 plan preview

Phase 4 will detail the patch sequence + manifest update + build pipeline + verification matrix. Predicted shape:

  • 4 patches (3 RFC v2 rebased + 1 new rkvdec consumer).
  • 1 manifest update to fleet/fresnel.yaml: remove Explicitly NOT included block for vb2_dma_resv, add 4 includes under includes:, bump version comment.
  • 1 build cycle on boltzmann producing linux-fresnel-fourier 7.0.kafr2-*.pkg.tar.zst.
  • 1 install + reboot on fresnel via pacman.
  • 1 Phase 7 verification matrix running ffmpeg-vaapi-hwdownload on all 5 codecs, byte-identical YUV check vs SW reference, no transitive proof.

Predicted LOC delta:

  • Patch 1 (vb2 helper): ~120 LOC kernel, operator-authored.
  • Patch 2 (hantro consumer): +12 LOC, operator-authored.
  • Patch 3 (rga consumer): +10 LOC, operator-authored.
  • Patch 4 (rkvdec consumer): +12 LOC, claude-noether-authored (iter5 contribution).
  • Manifest update: ~10 LOC YAML.

Total iter5 new code authorship: ~12 LOC of kernel C, ~10 LOC of YAML config.

Phase 4 source-read targets

Already complete in Phase 2 (above):

  • ~/src/linux-rfc/ branch vb2-dma-resv-rfc — 3 RFC v2 patches read end-to-end.
  • ✓ v6.12 + v7.0 include/uapi/linux/v4l2-controls.h MPEG-2 + VP8 struct diff — byte-identical.
  • ✓ v6.12 + v7.0 drivers/media/v4l2-core/v4l2-ctrls-defs.c diff — 15 lines, none MPEG-2/VP8 related.
  • ✓ v7.0 drivers/media/platform/rockchip/rkvdec/rkvdec.c::rkvdec_buf_queue — confirmed mechanical opt-in site.
  • ✓ Fleet manifest fleet/fresnel.yaml — current state captured, exclusion-of-vb2_dma_resv noted.
  • ✓ Empirical re-trace of MPEG-2 decode on fresnel — confirms Bug 3 is B4 cosmetic noise.

For Phase 4 (deferred until boltzmann reconnects):

  • v6.12 → v7.0 delta on drivers/media/common/videobuf2/videobuf2-core.c — rebase risk assessment.
  • v6.12 → v7.0 delta on drivers/media/platform/rockchip/rga/rga-buf.c — confirm rebase trivial.
  • Apply the 3 RFC v2 patches with git apply --3way onto v7.0 baseline and capture conflict-rate.

What "iteration 5 close" looks like

Per feedback_dev_process.md Phase 8:

  • All 4 Phase 1 criteria green (Bug 2 closed for all 5 codecs · substrate ships from kernel-agent · no codec-contract regression · 5/5 direct verification).
  • phase8_iteration5_close.md documenting the patches, build details, verification matrix.
  • Campaign scoreboard updated from "5/5 (4 direct + 1 transitive)" to "5/5 direct."
  • Memory entries distilled — likely 1 new entry on the contract: "vb2_dma_resv pattern: V4L2 producers must opt-in per driver, one line at end of buf_queue callback." Predicted name: reference_vb2_dma_resv_opt_in_pattern.md or fold update into existing reference_dmabuf_resv_blocker.md.
  • Phase 5 sonnet-architect review pass signed off.
  • Commits authored as claude-noether per feedback_gitea_as_claude_noether.md. Operator-authored RFC v2 patches preserve Signed-off-by: Markus Fritsche <mfritsche@reauktion.de>.
  • Kernel-agent fleet/fresnel.yaml updated and committed.

Predicted iter5 difficulty vs iter1-4:

  • vs iter1-3 (~370 LOC libva backend per codec): iter5 is much smaller in LOC but larger in scope — touches kernel + build pipeline instead of single binary.
  • vs iter4 (single new codec, 4 commits + 1 fix-forward): iter5 has 4 patches (3 operator-existing + 1 claude-new) + 1 manifest update. Comparable patch count, simpler per-patch shape.
  • Predicted Phase 7 failure modes:
    1. RFC v2 rebase conflicts on videobuf2-core.c (medium risk — vb2 core is active code).
    2. Helper patch causes silent regression on a non-opted-in driver (low risk — patch is opt-in by design).
    3. fence-allocation under memory pressure fails and the fence-attach call returns -ENOMEM (low impact — best-effort by design).
    4. cap_pool readback still fails after the fix (the userspace race window isn't what we thought it was). This is the highest-impact failure mode — would force Phase 7 → Phase 4 or even Phase 7 → Phase 0 loopback.