Empirical sweep on iter5b backend (SHA d7722da...) crashed in copy_surface_to_image during HEVC libva-vaapi-hwdownload. Coredump backtrace shows memcpy on stale surface_object->destination_data[i] pointer — cap_pool_destroy ran during my pixfmt-change teardown branch, but the subsequent S_FMT got EBUSY because the OUTPUT queue was already streaming. State corruption mid-decode. Root cause: ffmpeg-vaapi calls vaCreateSurfaces2 *twice*, with CreateContext+STREAMON between them. My CreateSurfaces2 gate destructively tears down cap_pool on pixelformat change but can't recover when REQBUFS(0) silently fails on a streaming queue. surface.c:164-171 TODO comment from iter1 anticipated exactly this: "STREAMOFF + REQBUFS(0) + new S_FMT + new CREATE_BUFS — that's a context-level redesign for the next iteration." Phase 4 dismissed the comment as targeting multi-resolution mid-stream. That dismissal was wrong; ffmpeg-vaapi triggers the same code path. 3 reverts on fork master: 4b2288f, f8256e6, ce304ef reverted by 709ab34, 9a7f888, 6bc29ec. Backend rebuilt + reinstalled on fresnel at iter4-tip SHA 6e90b7a9.... Post-revert HEVC libva returns the pre-iter5b broken-but-non-crashing all-zero pattern. Per Phase 1 lock: criteria 1 FAIL (HEVC/VP9/VP8 still all-zero); criteria 2-4 PASS (no regression on MPEG-2/H.264 keyframe/control payloads). iter5b does not close. Phase 7 → Phase 4 loopback: re-plan as option β (defer OUTPUT-side S_FMT+CREATE_BUFS to CreateContext where config_id is known and streams haven't started). User pick: revert + re-plan with β. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
12 KiB
Iteration 5b — Phase 7 (verification)
Verdict: FAIL. Reverted. Phase 7 → Phase 4 loopback with option β.
Captured 2026-05-12 mid-Phase-7. Empirical sweep on post-iter5b backend triggered a SIGSEGV in copy_surface_to_image (image.c) during HEVC libva-vaapi-hwdownload. Pre-iter5b state was broken (all-zero HEVC) but non-crashing; iter5b's option α' regressed to crashing. Reverted 3 fork commits, restored fresnel to pre-iter5b SHA256, looping back to Phase 4 with option β as Phase 2 originally recommended.
Pre-Phase-7 expectation (per phase4_iter5b_plan.md C7)
Sweep matrix expected:
| Codec | Pre-fix libva | Expected post-fix libva |
|---|---|---|
| HEVC | 06b2c5a0… (all-zero) |
9340b832… (matches kdirect) |
| VP9 | 06b2c5a0… |
4f1565e8… |
| VP8 | 06b2c5a0… |
136ce5cb… |
| MPEG-2 | 19eefbf4… (worked) |
unchanged |
| H.264 | 71ac099b… (keyframe partial) |
unchanged (Bug 4) |
Empirical Phase 7 result
Phase 3 sweep script re-run on fresnel at post-iter5b backend SHA d7722da742bfcb86a9136b07e6d9a5de23668f37fcad328258966c5338265e82. Device numbering re-mapped for today's boot (iter4-B1: rkvdec at /dev/video1+/dev/media0, hantro at /dev/video5+/dev/media2).
Sweep terminated on the HEVC step with SIGSEGV.
Backtrace from /var/lib/systemd/coredump/core.ffmpeg.1000.4d1e78de91ce4af0b55dcd31f823033b.5959.…:
#0 memcpy (libc.so.6 + 0x9ef58)
#1 copy_surface_to_image (v4l2_request_drv_video.so + 0x6908)
#2 RequestGetImage (v4l2_request_drv_video.so + 0x6c4c)
#3 vaGetImage (libva.so.2 + 0x7390)
#4 hwcontext_vaapi-internal (libavutil.so.60 + 0x460f0)
#5 av_hwframe_transfer_data (libavutil.so.60 + 0x3e308)
…
memcpy in copy_surface_to_image at image.c:168:
memcpy(buffer_object->data + image->offsets[i],
surface_object->destination_data[i],
surface_object->destination_sizes[i]);
One of: surface_object->destination_data[i] is NULL or freed; buffer_object->data is NULL; sizes are corrupt.
Mechanism (empirically traced)
Strace of HEVC libva run captured the call sequence. Decisive observations:
- Two
VIDIOC_S_FMT(OUTPUT_MPLANE)calls from a single ffmpeg-vaapi process. - First call:
pixelformat = V4L2_PIX_FMT_H264_SLICE, success. - Second call:
pixelformat = V4L2_PIX_FMT_HEVC_SLICE, EBUSY × 6 (retried six times by some inner ffmpeg loop or by my gate retry — both end with EBUSY). - Between the two
S_FMTcalls: device-wideV4L2_CID_STATELESS_H264_DECODE_MODE/START_CODE+_HEVC_DECODE_MODE/START_CODEcontrols (CreateContext-time) AND HEVC frame controls (decode-time S_EXT_CTRLS at size=40/64/280/1000/328) → kernel decoder is already streaming when the secondS_FMTfires.
What's happening
ffmpeg-vaapi's hwcontext_vaapi pattern calls vaCreateSurfaces2 more than once per session, with vaCreateContext (and subsequent STREAMON + frame decode) interleaving between calls. The flow:
- ffmpeg-vaapi: probe surfaces — first
vaCreateSurfaces2. At this point, config_heap is empty or has multiple configs (mismatched ffmpeg-internal flow), so myfind_sole_active_pixelformatreturns 0 → fallback to H264_SLICE. - CreateContext, STREAMON, frame decode submission (decode fails silently because controls are HEVC but OUTPUT is H264_SLICE — pre-iter5b behavior preserved at this point).
- ffmpeg-vaapi: probably some surface-reallocation pattern internal to libavutil; second
vaCreateSurfaces2. - My code:
find_sole_active_pixelformatnow sees the active HEVC config → returns HEVC_SLICE. My gate detects pixelformat changed (H264_SLICE → HEVC_SLICE), enters teardown branch. - Teardown calls
cap_pool_destroy()— frees the CAPTURE mmaps. Thenv4l2_request_buffers(OUTPUT, 0)— silently fails because OUTPUT queue is streaming (REQBUFS(0) doesn't tear down a streaming queue). v4l2_set_format(HEVC_SLICE)returns EBUSY.- My code returns
VA_STATUS_ERROR_OPERATION_FAILEDfrom CreateSurfaces2. - But cap_pool is already destroyed. Surface objects in flight still hold
destination_data[i]pointers into the freed CAPTURE mmaps. - Decoder finishes the in-flight frame (or didn't, since we destroyed its target). ffmpeg-vaapi calls
vaGetImage→ backendRequestGetImage→copy_surface_to_image→ memcpy on staledestination_data[i]→ SIGSEGV.
Why option α' is fundamentally wrong
The surface.c:164-171 TODO comment from iter1 era:
TODO: this is still not a clean architecture — v4l2_set_format after CREATE_BUFS requires REQBUFS(0) first (kernel returns EBUSY otherwise). For mpv's pattern (probe with small, then allocate big) the small probe surfaces have not been streamed yet, so REQBUFS(0) on them works. For consumers that legitimately stream multiple resolutions in sequence, we'd need to STREAMOFF + REQBUFS(0) + new S_FMT + new CREATE_BUFS — that's a context-level redesign for the next iteration.
Phase 4 of iter5b read this comment as targeting "multi-resolution mid-stream" only and dismissed it as not applicable to iter5b's "wrong-format-at-initial-create-surfaces" case. That dismissal was wrong. ffmpeg-vaapi's multi-CreateSurfaces2 pattern is exactly the case the TODO anticipates: streams are active when the second CreateSurfaces2 fires, REQBUFS(0) silently fails, my gate fires destructive teardown anyway, and state corruption follows.
The Phase 5 reviewer's Phase 2 original recommendation — option β (defer OUTPUT-side S_FMT/CREATE_BUFS lifecycle to CreateContext) — would have avoided this. CreateContext fires AFTER the binding config_id is known; there's exactly one CreateContext call per decode session in every consumer; STREAMON happens at the same site so the ordering is unambiguous. iter5b should have done β.
Action taken: revert + restore
Three reverts on the shared fork master, authored as claude-noether:
6bc29ec Revert "fresnel-fourier iter5b Phase 6 commit A: NEW src/codec.{h,c} — pixelformat_for_profile helper"
9a7f888 Revert "fresnel-fourier iter5b Phase 6 commit B: state-tracking — request.h field + config.c wire-up"
709ab34 Revert "fresnel-fourier iter5b Phase 6 commit C: surface.c — profile-derived OUTPUT pixel format"
The iter5b commits (4b2288f, f8256e6, ce304ef) stay in history for reference. Fork tip after reverts: 6bc29ec. Backend rebuilt + reinstalled on fresnel. SHA256 confirms restoration:
$ sha256sum /usr/lib/dri/v4l2_request_drv_video.so
6e90b7a9b2c33480dd3ffc2da8423ab0bcef14f23c68cf18dc2ae2ff66ac808c
Identical to iter4-close binary. Pre-iter5b state restored.
Empirical verification of restored state
Post-revert HEVC libva run:
ssh fresnel 'env LIBVA_DRIVER_NAME=v4l2_request \
LIBVA_V4L2_REQUEST_NO_AUTODETECT=1 \
LIBVA_V4L2_REQUEST_VIDEO_PATH=/dev/video1 \
LIBVA_V4L2_REQUEST_MEDIA_PATH=/dev/media0 \
ffmpeg -hide_banner -hwaccel vaapi -hwaccel_output_format vaapi \
-hwaccel_device /dev/dri/renderD128 \
-i ~/fourier-test/bbb_720p10s_hevc.mp4 -frames:v 2 \
-vf hwdownload,format=nv12,format=yuv420p \
-f rawvideo /tmp/post_revert_hevc.yuv'
Exit code 0 (no SIGSEGV). Hash 822e3a311bc34185394aeb709bb83310f1243089b95cdadc218eee179b1d6f78 = SHA256 of 2 frames × 720p NV12-to-yuv420p of all-zero bytes (verified: python3 -c "import hashlib; print(hashlib.sha256(b'\x00'*2764800).hexdigest())" produces the same hash). Pre-iter5b broken-but-noncrashing state restored.
Per-criterion verdict against Phase 1 lock
| # | Criterion | Verdict |
|---|---|---|
| 1 | HEVC + VP9 + VP8 libva == kdirect == sw | FAIL — HEVC + VP9 + VP8 still all-zero; HEVC additionally crashed during the build cycle (now reverted) |
| 2 | MPEG-2 unchanged | PASS (regression-free post-revert; in flight during the failed build, MPEG-2 wasn't reached) |
| 3 | H.264 keyframe still decodes | PASS (unchanged post-revert) |
| 4 | Control-payload anchors hold | PASS (backend code is bit-identical to iter4 tip; no control-handling changes anywhere in the reverted commits) |
iter5b verdict: 3 of 4 criteria PASS, 1 of 4 FAIL. The FAIL on criterion 1 is the whole point of iter5b — that criterion was the goal. Iteration does not close.
Cost of iter5b
- Phase 0 loopback work: salvageable (the empirical Bug 2 diagnosis at
surface.c:173was correct; OUTPUT format IS the right thing to fix; the discovery survives the failed Phase 4 plan). - Phase 2 situation: salvageable (lifecycle trace is reusable for Phase 4 v2 with option β).
- Phase 4 plan + Phase 5 review + Phase 6 implementation: failed work. ~200 LOC written + reverted. Real cost = wallclock of Phase 4-Phase 7.
- Memory rules pinned: still valid (
feedback_trace_fix_mechanism_to_consumer.md, the pin was correct independent of which option Phase 4 chose).
Net: half-day's wallclock burned for the empirical evidence that option α' is wrong. Phase 5 reviewer's original Phase 2 recommendation of β was correct; the Phase 4 author (me) overrode it on insufficient grounds.
Phase 7 → Phase 4 loopback decision
Per feedback_dev_process.md Phase 7 → Phase 4 edge: a failing Phase 7 loops back to Phase 4 with the new evidence. User pick 2026-05-12: "Revert 3 fork commits, re-plan as option β."
Next: rewrite Phase 4 plan as iter5b-β.
iter5b-β skeleton
- Move OUTPUT-side S_FMT + CREATE_BUFS lifecycle from
RequestCreateSurfaces2toRequestCreateContext(the existingrequest_pool_initsite at context.c:103+). RequestCreateSurfaces2keeps only CAPTURE-side allocation work AND ID-allocation work. But: hantro derives CAPTURE format from OUTPUT format at S_FMT time. If we defer OUTPUT S_FMT to CreateContext, then CAPTURE format isn't queryable at CreateSurfaces2 time either. Solution: also defer cap_pool_init AND destination_* field setup from CreateSurfaces2 to CreateContext.RequestCreateSurfaces2becomes "allocate surface object IDs and initialize per-surface bookkeeping; defer all V4L2-device state until CreateContext binds them to a profile."RequestCreateContext: look upconfig_object->profile(already done at context.c:71), derive pixelformat viapixelformat_for_profile(), do S_FMT(OUTPUT, pixelformat), then existing request_pool_init flow, then cap_pool_init, then walk surface_ids and fill destination_*.- Predicted LOC delta: ~200 LOC across surface.c, context.c, request.h.
Reusable from iter5b
src/codec.{h,c}(thepixelformat_for_profilehelper) — useful for both α' and β. Re-introduce in iter5b-β commit A.object_config->pixelformatwire-up at CreateConfig — useful for both. Re-introduce in iter5b-β commit B.- The Phase 2 situation lifecycle analysis at
phase2_iter5b_situation.md— reusable. - Phase 3 baseline (iter5_phase3_baseline.tgz) — still the relevant pre-fix anchor.
- Phase 5 review's CRIT-1 fix (
int iternot struct, cast-on-return pattern) — applies to any heap walk in β too if β still does one (probably doesn't; CreateContext has config_id directly).
Memory updates
The feedback_trace_fix_mechanism_to_consumer.md rule pinned earlier in this iteration was the right idea but didn't catch this specific failure mode. The relevant lesson now: trace consumer's call-pattern, not just consumer's read-site. Phase 5 reviewer empirically read read-sites in image.c but didn't simulate ffmpeg-vaapi's multi-CreateSurfaces2 call pattern. The miss was on the "when does the consumer call my entry point, and how many times?" axis, not the "what does the consumer do with the data?" axis.
Consider amending or adding a sibling rule. Defer the memory update to iter5b-β Phase 8 close (or skip if the lesson is already covered).
Substrate at Phase 7 close
- Fork tip:
6bc29ec(revert of iter5b commit A). - Backend installed SHA256:
6e90b7a9b2c33480dd3ffc2da8423ab0bcef14f23c68cf18dc2ae2ff66ac808c(identical to iter4-close). - Kernel:
linux-fresnel-fourier 7.0-1(unchanged). - Test fixtures: unchanged.
- Phase 3 baseline: still valid, still relevant.
- Phase 4 v1 (α'): rejected.
- Phase 4 v2 (β): to-be-written.