Files
fresnel-fourier/phase7_iter5b_verification.md
T
marfrit 864af258e9 iter5b Phase 7: FAIL — HEVC SIGSEGV, option α' rejected, revert + loopback to β
Empirical sweep on iter5b backend (SHA d7722da...) crashed in
copy_surface_to_image during HEVC libva-vaapi-hwdownload. Coredump
backtrace shows memcpy on stale surface_object->destination_data[i]
pointer — cap_pool_destroy ran during my pixfmt-change teardown
branch, but the subsequent S_FMT got EBUSY because the OUTPUT
queue was already streaming. State corruption mid-decode.

Root cause: ffmpeg-vaapi calls vaCreateSurfaces2 *twice*, with
CreateContext+STREAMON between them. My CreateSurfaces2 gate
destructively tears down cap_pool on pixelformat change but can't
recover when REQBUFS(0) silently fails on a streaming queue.

surface.c:164-171 TODO comment from iter1 anticipated exactly this:
"STREAMOFF + REQBUFS(0) + new S_FMT + new CREATE_BUFS — that's a
context-level redesign for the next iteration." Phase 4 dismissed
the comment as targeting multi-resolution mid-stream. That
dismissal was wrong; ffmpeg-vaapi triggers the same code path.

3 reverts on fork master: 4b2288f, f8256e6, ce304ef reverted by
709ab34, 9a7f888, 6bc29ec. Backend rebuilt + reinstalled on fresnel
at iter4-tip SHA 6e90b7a9.... Post-revert HEVC libva returns the
pre-iter5b broken-but-non-crashing all-zero pattern.

Per Phase 1 lock: criteria 1 FAIL (HEVC/VP9/VP8 still all-zero);
criteria 2-4 PASS (no regression on MPEG-2/H.264 keyframe/control
payloads). iter5b does not close.

Phase 7 → Phase 4 loopback: re-plan as option β (defer OUTPUT-side
S_FMT+CREATE_BUFS to CreateContext where config_id is known and
streams haven't started). User pick: revert + re-plan with β.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 12:46:16 +00:00

12 KiB
Raw Blame History

Iteration 5b — Phase 7 (verification)

Verdict: FAIL. Reverted. Phase 7 → Phase 4 loopback with option β.

Captured 2026-05-12 mid-Phase-7. Empirical sweep on post-iter5b backend triggered a SIGSEGV in copy_surface_to_image (image.c) during HEVC libva-vaapi-hwdownload. Pre-iter5b state was broken (all-zero HEVC) but non-crashing; iter5b's option α' regressed to crashing. Reverted 3 fork commits, restored fresnel to pre-iter5b SHA256, looping back to Phase 4 with option β as Phase 2 originally recommended.

Pre-Phase-7 expectation (per phase4_iter5b_plan.md C7)

Sweep matrix expected:

Codec Pre-fix libva Expected post-fix libva
HEVC 06b2c5a0… (all-zero) 9340b832… (matches kdirect)
VP9 06b2c5a0… 4f1565e8…
VP8 06b2c5a0… 136ce5cb…
MPEG-2 19eefbf4… (worked) unchanged
H.264 71ac099b… (keyframe partial) unchanged (Bug 4)

Empirical Phase 7 result

Phase 3 sweep script re-run on fresnel at post-iter5b backend SHA d7722da742bfcb86a9136b07e6d9a5de23668f37fcad328258966c5338265e82. Device numbering re-mapped for today's boot (iter4-B1: rkvdec at /dev/video1+/dev/media0, hantro at /dev/video5+/dev/media2).

Sweep terminated on the HEVC step with SIGSEGV.

Backtrace from /var/lib/systemd/coredump/core.ffmpeg.1000.4d1e78de91ce4af0b55dcd31f823033b.5959.…:

#0  memcpy                       (libc.so.6 + 0x9ef58)
#1  copy_surface_to_image        (v4l2_request_drv_video.so + 0x6908)
#2  RequestGetImage              (v4l2_request_drv_video.so + 0x6c4c)
#3  vaGetImage                   (libva.so.2 + 0x7390)
#4  hwcontext_vaapi-internal     (libavutil.so.60 + 0x460f0)
#5  av_hwframe_transfer_data     (libavutil.so.60 + 0x3e308)
…

memcpy in copy_surface_to_image at image.c:168:

memcpy(buffer_object->data + image->offsets[i],
       surface_object->destination_data[i],
       surface_object->destination_sizes[i]);

One of: surface_object->destination_data[i] is NULL or freed; buffer_object->data is NULL; sizes are corrupt.

Mechanism (empirically traced)

Strace of HEVC libva run captured the call sequence. Decisive observations:

  • Two VIDIOC_S_FMT(OUTPUT_MPLANE) calls from a single ffmpeg-vaapi process.
  • First call: pixelformat = V4L2_PIX_FMT_H264_SLICE, success.
  • Second call: pixelformat = V4L2_PIX_FMT_HEVC_SLICE, EBUSY × 6 (retried six times by some inner ffmpeg loop or by my gate retry — both end with EBUSY).
  • Between the two S_FMT calls: device-wide V4L2_CID_STATELESS_H264_DECODE_MODE/START_CODE + _HEVC_DECODE_MODE/START_CODE controls (CreateContext-time) AND HEVC frame controls (decode-time S_EXT_CTRLS at size=40/64/280/1000/328) → kernel decoder is already streaming when the second S_FMT fires.

What's happening

ffmpeg-vaapi's hwcontext_vaapi pattern calls vaCreateSurfaces2 more than once per session, with vaCreateContext (and subsequent STREAMON + frame decode) interleaving between calls. The flow:

  1. ffmpeg-vaapi: probe surfaces — first vaCreateSurfaces2. At this point, config_heap is empty or has multiple configs (mismatched ffmpeg-internal flow), so my find_sole_active_pixelformat returns 0 → fallback to H264_SLICE.
  2. CreateContext, STREAMON, frame decode submission (decode fails silently because controls are HEVC but OUTPUT is H264_SLICE — pre-iter5b behavior preserved at this point).
  3. ffmpeg-vaapi: probably some surface-reallocation pattern internal to libavutil; second vaCreateSurfaces2.
  4. My code: find_sole_active_pixelformat now sees the active HEVC config → returns HEVC_SLICE. My gate detects pixelformat changed (H264_SLICE → HEVC_SLICE), enters teardown branch.
  5. Teardown calls cap_pool_destroy()frees the CAPTURE mmaps. Then v4l2_request_buffers(OUTPUT, 0) — silently fails because OUTPUT queue is streaming (REQBUFS(0) doesn't tear down a streaming queue).
  6. v4l2_set_format(HEVC_SLICE) returns EBUSY.
  7. My code returns VA_STATUS_ERROR_OPERATION_FAILED from CreateSurfaces2.
  8. But cap_pool is already destroyed. Surface objects in flight still hold destination_data[i] pointers into the freed CAPTURE mmaps.
  9. Decoder finishes the in-flight frame (or didn't, since we destroyed its target). ffmpeg-vaapi calls vaGetImage → backend RequestGetImagecopy_surface_to_image → memcpy on stale destination_data[i] → SIGSEGV.

Why option α' is fundamentally wrong

The surface.c:164-171 TODO comment from iter1 era:

TODO: this is still not a clean architecture — v4l2_set_format after CREATE_BUFS requires REQBUFS(0) first (kernel returns EBUSY otherwise). For mpv's pattern (probe with small, then allocate big) the small probe surfaces have not been streamed yet, so REQBUFS(0) on them works. For consumers that legitimately stream multiple resolutions in sequence, we'd need to STREAMOFF + REQBUFS(0) + new S_FMT + new CREATE_BUFS — that's a context-level redesign for the next iteration.

Phase 4 of iter5b read this comment as targeting "multi-resolution mid-stream" only and dismissed it as not applicable to iter5b's "wrong-format-at-initial-create-surfaces" case. That dismissal was wrong. ffmpeg-vaapi's multi-CreateSurfaces2 pattern is exactly the case the TODO anticipates: streams are active when the second CreateSurfaces2 fires, REQBUFS(0) silently fails, my gate fires destructive teardown anyway, and state corruption follows.

The Phase 5 reviewer's Phase 2 original recommendation — option β (defer OUTPUT-side S_FMT/CREATE_BUFS lifecycle to CreateContext) — would have avoided this. CreateContext fires AFTER the binding config_id is known; there's exactly one CreateContext call per decode session in every consumer; STREAMON happens at the same site so the ordering is unambiguous. iter5b should have done β.

Action taken: revert + restore

Three reverts on the shared fork master, authored as claude-noether:

6bc29ec Revert "fresnel-fourier iter5b Phase 6 commit A: NEW src/codec.{h,c} — pixelformat_for_profile helper"
9a7f888 Revert "fresnel-fourier iter5b Phase 6 commit B: state-tracking — request.h field + config.c wire-up"
709ab34 Revert "fresnel-fourier iter5b Phase 6 commit C: surface.c — profile-derived OUTPUT pixel format"

The iter5b commits (4b2288f, f8256e6, ce304ef) stay in history for reference. Fork tip after reverts: 6bc29ec. Backend rebuilt + reinstalled on fresnel. SHA256 confirms restoration:

$ sha256sum /usr/lib/dri/v4l2_request_drv_video.so
6e90b7a9b2c33480dd3ffc2da8423ab0bcef14f23c68cf18dc2ae2ff66ac808c

Identical to iter4-close binary. Pre-iter5b state restored.

Empirical verification of restored state

Post-revert HEVC libva run:

ssh fresnel 'env LIBVA_DRIVER_NAME=v4l2_request \
  LIBVA_V4L2_REQUEST_NO_AUTODETECT=1 \
  LIBVA_V4L2_REQUEST_VIDEO_PATH=/dev/video1 \
  LIBVA_V4L2_REQUEST_MEDIA_PATH=/dev/media0 \
  ffmpeg -hide_banner -hwaccel vaapi -hwaccel_output_format vaapi \
    -hwaccel_device /dev/dri/renderD128 \
    -i ~/fourier-test/bbb_720p10s_hevc.mp4 -frames:v 2 \
    -vf hwdownload,format=nv12,format=yuv420p \
    -f rawvideo /tmp/post_revert_hevc.yuv'

Exit code 0 (no SIGSEGV). Hash 822e3a311bc34185394aeb709bb83310f1243089b95cdadc218eee179b1d6f78 = SHA256 of 2 frames × 720p NV12-to-yuv420p of all-zero bytes (verified: python3 -c "import hashlib; print(hashlib.sha256(b'\x00'*2764800).hexdigest())" produces the same hash). Pre-iter5b broken-but-noncrashing state restored.

Per-criterion verdict against Phase 1 lock

# Criterion Verdict
1 HEVC + VP9 + VP8 libva == kdirect == sw FAIL — HEVC + VP9 + VP8 still all-zero; HEVC additionally crashed during the build cycle (now reverted)
2 MPEG-2 unchanged PASS (regression-free post-revert; in flight during the failed build, MPEG-2 wasn't reached)
3 H.264 keyframe still decodes PASS (unchanged post-revert)
4 Control-payload anchors hold PASS (backend code is bit-identical to iter4 tip; no control-handling changes anywhere in the reverted commits)

iter5b verdict: 3 of 4 criteria PASS, 1 of 4 FAIL. The FAIL on criterion 1 is the whole point of iter5b — that criterion was the goal. Iteration does not close.

Cost of iter5b

  • Phase 0 loopback work: salvageable (the empirical Bug 2 diagnosis at surface.c:173 was correct; OUTPUT format IS the right thing to fix; the discovery survives the failed Phase 4 plan).
  • Phase 2 situation: salvageable (lifecycle trace is reusable for Phase 4 v2 with option β).
  • Phase 4 plan + Phase 5 review + Phase 6 implementation: failed work. ~200 LOC written + reverted. Real cost = wallclock of Phase 4-Phase 7.
  • Memory rules pinned: still valid (feedback_trace_fix_mechanism_to_consumer.md, the pin was correct independent of which option Phase 4 chose).

Net: half-day's wallclock burned for the empirical evidence that option α' is wrong. Phase 5 reviewer's original Phase 2 recommendation of β was correct; the Phase 4 author (me) overrode it on insufficient grounds.

Phase 7 → Phase 4 loopback decision

Per feedback_dev_process.md Phase 7 → Phase 4 edge: a failing Phase 7 loops back to Phase 4 with the new evidence. User pick 2026-05-12: "Revert 3 fork commits, re-plan as option β."

Next: rewrite Phase 4 plan as iter5b-β.

iter5b-β skeleton

  • Move OUTPUT-side S_FMT + CREATE_BUFS lifecycle from RequestCreateSurfaces2 to RequestCreateContext (the existing request_pool_init site at context.c:103+).
  • RequestCreateSurfaces2 keeps only CAPTURE-side allocation work AND ID-allocation work. But: hantro derives CAPTURE format from OUTPUT format at S_FMT time. If we defer OUTPUT S_FMT to CreateContext, then CAPTURE format isn't queryable at CreateSurfaces2 time either. Solution: also defer cap_pool_init AND destination_* field setup from CreateSurfaces2 to CreateContext.
  • RequestCreateSurfaces2 becomes "allocate surface object IDs and initialize per-surface bookkeeping; defer all V4L2-device state until CreateContext binds them to a profile."
  • RequestCreateContext: look up config_object->profile (already done at context.c:71), derive pixelformat via pixelformat_for_profile(), do S_FMT(OUTPUT, pixelformat), then existing request_pool_init flow, then cap_pool_init, then walk surface_ids and fill destination_*.
  • Predicted LOC delta: ~200 LOC across surface.c, context.c, request.h.

Reusable from iter5b

  • src/codec.{h,c} (the pixelformat_for_profile helper) — useful for both α' and β. Re-introduce in iter5b-β commit A.
  • object_config->pixelformat wire-up at CreateConfig — useful for both. Re-introduce in iter5b-β commit B.
  • The Phase 2 situation lifecycle analysis at phase2_iter5b_situation.md — reusable.
  • Phase 3 baseline (iter5_phase3_baseline.tgz) — still the relevant pre-fix anchor.
  • Phase 5 review's CRIT-1 fix (int iter not struct, cast-on-return pattern) — applies to any heap walk in β too if β still does one (probably doesn't; CreateContext has config_id directly).

Memory updates

The feedback_trace_fix_mechanism_to_consumer.md rule pinned earlier in this iteration was the right idea but didn't catch this specific failure mode. The relevant lesson now: trace consumer's call-pattern, not just consumer's read-site. Phase 5 reviewer empirically read read-sites in image.c but didn't simulate ffmpeg-vaapi's multi-CreateSurfaces2 call pattern. The miss was on the "when does the consumer call my entry point, and how many times?" axis, not the "what does the consumer do with the data?" axis.

Consider amending or adding a sibling rule. Defer the memory update to iter5b-β Phase 8 close (or skip if the lesson is already covered).

Substrate at Phase 7 close

  • Fork tip: 6bc29ec (revert of iter5b commit A).
  • Backend installed SHA256: 6e90b7a9b2c33480dd3ffc2da8423ab0bcef14f23c68cf18dc2ae2ff66ac808c (identical to iter4-close).
  • Kernel: linux-fresnel-fourier 7.0-1 (unchanged).
  • Test fixtures: unchanged.
  • Phase 3 baseline: still valid, still relevant.
  • Phase 4 v1 (α'): rejected.
  • Phase 4 v2 (β): to-be-written.