diff --git a/phase0_findings_iter6.md b/phase0_findings_iter6.md index be64e06..cd432c2 100644 --- a/phase0_findings_iter6.md +++ b/phase0_findings_iter6.md @@ -140,13 +140,17 @@ Same as iter5. Plus for new candidates: - For E (perf): `pidstat -u` for CPU%, Mali-G52 freq via `/sys/class/devfreq/fde60000.gpu`. - For F (MPEG-2): need an MPEG-2 fixture (`mpv --dump-stream` from a public DVD or transcode bbb to MPEG-2). -## In-scope (LOCKED 2026-05-05 for iteration 6) — I (Firefox VIDIOC_S_EXT_CTRLS / VIDIOC_QBUF EINVAL) +## In-scope (LOCKED 2026-05-05 for iteration 6) — A∪I (cap_pool / OUTPUT-buffer-recycle lifecycle) -Operator locked candidate **I** (Firefox VIDIOC_QBUF EINVAL on first frame, enriched post-iter5-amend telemetry to also include S_EXT_CTRLS EINVAL). +Operator locked candidate **I**, then **merged with candidate A** after Phase 2 telemetry (2026-05-05) showed they are facets of the same underlying bug: -Why I: deterministic repro on two consumers (bbb fixture + YouTube avc1), narrowest scope of any candidate, gates the only known consumer hard-failure under the post-amendment binary. Also: iter5-amend just unblocked Firefox's path; closing this completes the Firefox HW-decode story end-to-end. +- **A (iter5 sonnet C4 carryover)**: cap_pool resolution-change race — REQBUFS-EBUSY when CAPTURE pool isn't drained before re-allocation. mpv libplacebo `--vo=gpu` triggers it via Vulkan-fallback resolution change. -Other candidates (A, B, C, D, E, F, G) deferred to iter7+. H (fourier-fresnel) remains separate top-level campaign. +- **I (iter6 candidate)**: Firefox `VIDIOC_QBUF` EINVAL on OUTPUT after ~19 successful S_EXT_CTRLS calls. The original "S_EXT_CTRLS EINVAL on frame 1" framing was transient state; the reproducible failure is at OUTPUT-buffer requeue with rotating `source_index`. mpv-vaapi-copy (single-surface recycle) doesn't hit this; Firefox (multi-surface rotation through libva) does. + +Why merge: both are buffer-pool / surface-recycle lifecycle issues at OUTPUT (and CAPTURE) drain ordering. Partial fixes risk just shifting the symptom. The intended Phase 4 fix is a single coherent rework of cap_pool + DQBUF/QBUF sequencing in the surface lifecycle. + +Other candidates (B, C, D, E, F, G) deferred to iter7+. H (fourier-fresnel) remains separate top-level campaign. ## Out-of-scope (LOCKED 2026-05-05 for iteration 6) @@ -156,16 +160,17 @@ Other candidates (A, B, C, D, E, F, G) deferred to iter7+. H (fourier-fresnel) r - New codecs OUTSIDE H.264 / MPEG-2 (VP8/VP9/AV1/HEVC out per iter1 lock). - New target hardware (fresnel, ampere) — separate campaign (H above). -## Phase 1 success criterion (LOCKED 2026-05-05 for iteration 6) +## Phase 1 success criterion (LOCKED 2026-05-05 for iteration 6, AMENDED 2026-05-05 for A∪I merge) -> Firefox 150 (iter5-amend, sandbox enabled, `LIBVA_DRIVER_NAME=v4l2_request`) plays a known-h264 fixture (`bbb_1080p30_h264.mp4`) for ≥30 seconds with HW decode actually engaged: `cap_pool_init` succeeds, **zero `Unable to set control(s)` and zero `Unable to queue buffer`** in driver stderr, `lsof /dev/video1` shows the Firefox Utility process holding the device throughout playback, frames advance, no SW fallback in `about:support`'s "Decoder Backend" fields. +> All three consumer paths must be GREEN on the iter6 driver: > -> Acceptance evidence (capture all three): -> 1. Driver stderr lines: only the single per-process `cap_pool_init: 24 slots ready` log, no per-frame errors. -> 2. `lsof /dev/video1` snapshot at t=15s into playback shows a Firefox process (PID parent or descendant of the launcher) with the device open. -> 3. about:support's media decoder section names `vaapi`-or-equivalent for video/h264, not `ffvpx`. +> 1. **Firefox** (iter5-amend binary, sandbox enabled, `LIBVA_DRIVER_NAME=v4l2_request`) plays `bbb_1080p30_h264.mp4` for ≥30 seconds with HW decode engaged: zero `Unable to queue buffer` / `Unable to set control(s)` per-frame, `lsof /dev/video1` shows the Firefox Utility process holding the device, frames advance. > -> Phase 5 sonnet review must explicitly confirm that the fix is on the libva-side (or jointly libva + a Firefox-side patch), and that mpv-vaapi-copy 2000-frame test still GREEN (no regression introduced). +> 2. **mpv libplacebo `--vo=gpu`** runs ≥30 seconds on the same fixture without segfault and without REQBUFS-EBUSY events at init or resolution-change boundaries (carries iter5 sonnet C4 caveat to closure). +> +> 3. **mpv `--hwdec=vaapi-copy`** (regression check) decodes 2000 frames clean, identical pattern to iter5-end driver baseline (sha `4bed52ec5d44b389…`). +> +> Phase 5 sonnet review must confirm: (a) fix is libva-side (not consumer-specific kludge), (b) all three consumer paths verified, (c) no new mutable global state introduced (Track E discipline). ## Phase 1 LOCKED. Iteration 6 proceeds. diff --git a/phase2_iter6_situation.md b/phase2_iter6_situation.md new file mode 100644 index 0000000..bb60a71 --- /dev/null +++ b/phase2_iter6_situation.md @@ -0,0 +1,174 @@ +# Phase 2 (iter6) — situation analysis: Firefox H.264 EINVAL + +Opened 2026-05-05 immediately after iter6 Phase 1 lock on candidate I. + +## Working hypothesis + +The iter5-amend Firefox 150 binary (Utility seccomp fix shipped) loads the v4l2_request driver, sets up surfaces (`cap_pool_init: 24 slots ready`), then fails on either: +- `VIDIOC_S_EXT_CTRLS` EINVAL — observed on YouTube avc1 (Enhancer for YouTube forcing h264, multiple decode attempts) +- `VIDIOC_QBUF` EINVAL — observed on `bbb_1080p30_h264.mp4` direct file URL (single decode attempt) + +mpv `--hwdec=vaapi-copy` decoded 2000 frames clean on the same iter5-end driver build. So this is consumer-specific (Firefox vs mpv) and stream-specific (S_EXT_CTRLS path vs QBUF path). + +## Diagnostic build (iter6-DX) + +Source tarball'd from the campaign fork at HEAD `c8b6ede` and pushed to ohm at `/home/mfritsche/iter6-fork-dx/`. Two diagnostics added (local-only, not committed): + +1. `v4l2_set_controls`: per-control `VIDIOC_TRY_EXT_CTRLS` isolation when S_EXT_CTRLS EINVAL fires. Iterates each compound control alone, escapes the kernel cluster-commit obfuscation per `feedback_kernel_obfuscation_compound.md`. Reinstates the iter4 diagnostic (`f21bdf0`) that iter5 sweep removed (`848fc0c`). + +2. `v4l2_queue_buffer`: full `v4l2_buffer` struct dump on QBUF EINVAL — type, memory, index, length, bytesused, timestamp, flags, request_fd. Untangles which field the kernel objects to. + +Built on ohm with meson/ninja. Output sha `e01e2cc05b925cfdb2ad8e79a59ab796c90b9b20cdc4863a7c7f9c64624ed5a2`. Installed at `/usr/lib/dri/v4l2_request_drv_video.so`. Iter5-end driver backed up at `/home/mfritsche/v4l2_request_drv_video.so.iter5end.bak`. + +## Telemetry runs + +### Run 1 — bbb_1080p30_h264.mp4 (with iter6-DX driver, S_EXT_CTRLS diagnostic only) + +``` +v4l2-request: cap_pool_init: 24 slots ready (v4l2_index=0..23, 1 plane(s) per slot) +v4l2-request: Unable to queue buffer: Invalid argument +``` + +**Observation:** No ITER6_DX TRY-isolation lines fired → S_EXT_CTRLS is NOT failing on bbb. The QBUF EINVAL is the sole error. Single attempt, then Firefox falls back to SW. + +This means the diagnostic placement was incomplete for this stream. The QBUF-side diagnostic (just authored in this Phase 2) is needed. After rebuild + re-test, we'll have the full `v4l2_buffer` payload at the failing call. + +### Run 2 — YouTube avc1 (planned) + +YouTube URL: `https://www.youtube.com/watch?v=7DAPd5MGodY` with Enhancer for YouTube extension forcing h264 codec. + +Prior iter5-amend run (without diagnostic): emitted both `Unable to set control(s)` and `Unable to queue buffer` errors across 4 cap_pool_init events (multiple decode attempts on one tab as the player retried). + +Run 2 with iter6-DX driver will print the failing compound H.264 control's id + size (and which others passed). Will identify whether the issue is in SPS, PPS, SLICE_PARAMS, DECODE_PARAMS, or SCALING_MATRIX. + +## Why-but-not-mpv reasoning + +mpv `--hwdec=vaapi-copy` writes its own libva client code inheriting from FFmpeg's reference implementation. Firefox's Linux media stack uses the libva wrapper through the FFmpeg path too, but the surface-management is different: Firefox attaches DMA-buf VAExportSurfaceHandle for zero-copy, while mpv-vaapi-copy uses VAImage readback (per the `-copy` semantics). + +Possible divergences that lead to S_EXT_CTRLS / QBUF EINVAL: +- DMA-buf attach changes the buffer's exported state — kernel may reject QBUF on a buffer that has been EXPBUF'd +- VABufferType ordering — Firefox may submit slice data buffers before SPS/PPS for some frames +- Firefox's SPS payload builder might include a field mpv doesn't (e.g. profile_idc edge case, separate_colour_plane_flag) +- request_fd lifecycle: iter4 fixed mpv's per-frame request_fd allocation; Firefox may share request_fd across frames in a way the kernel rejects + +The diagnostic output should narrow this in a single retest. + +## Next steps (Phase 2 → Phase 3 → Phase 4) + +1. Add QBUF-side diagnostic to ohm's `/home/mfritsche/iter6-fork-dx/src/v4l2.c` (waiting on ohm-tools MCP reconnection). +2. Rebuild + reinstall iter6-DX driver on ohm. +3. Re-run bbb fixture: capture full v4l2_buffer payload at QBUF failure. +4. Re-run YouTube avc1: capture per-control TRY isolation output identifying failing compound H.264 control. +5. Document findings as Phase 2 evidence. +6. Phase 3: anchor pre-fix baseline (iter5-end driver behavior + iter5-amend Firefox + iter6-DX driver behavior) for before/after comparison. +7. Phase 4: design + implement the actual semantic fix in libva-side struct setup (per `feedback_no_fixture_hardcoding.md`: must be a general-case fix, not a stream-specific kludge). + +## Stop point + +Waiting on ohm-tools MCP reconnect. After that, autonomous through Phase 4 plan authoring (sonnet review at Phase 5 is mandatory per CLAUDE.md). + +## Phase 2 finding update (post-payload-comparison telemetry, 2026-05-05) + +### Working symptom shifted + +Initial assumed bug: "Firefox VIDIOC_S_EXT_CTRLS EINVAL on frame 1, all controls fail individually with cluster-validation error_idx=count=1, looks like iter4 request_fd-staleness pattern reappearing in a different consumer." + +Actual current bug (after payload comparison + multiple test cycles): + +| Phase | mpv-vaapi-copy | Firefox | +|---|---|---| +| Device-init `S_EXT_CTRLS` (DECODE_MODE/START_CODE, request_fd=-1) | rc=0 | rc=0 | +| Per-frame `S_EXT_CTRLS` (4 controls, request_fd≥0) | rc=0 sustained 2000+ frames | rc=0 for 19 frames | +| Per-frame `QBUF` (OUTPUT, request_fd) | rc=0 sustained 2000+ frames | **EINVAL on frame 20** | + +Failing QBUF detail (steady-state): `video_fd=32 request_fd=30 type=10 memory=1 index=1 length=1 bytesused(plane0)=66661 flags=0x800000`. type=10=`V4L2_BUF_TYPE_VIDEO_OUTPUT_MPLANE`, flags `V4L2_BUF_FLAG_REQUEST_FD`. Original "first-frame fail" pattern observed before this Phase 2 work seems to have been transient kernel/Firefox state — does not reproduce on fresh runs with the iter6-DX diagnostic build. + +### Payload comparison (mpv vs Firefox, frame 1) + +First 64 bytes of every per-frame ext_control are **byte-identical** between mpv (works) and Firefox (works through frame 19, fails at QBUF on frame 20): +``` +SPS: 4d 00 29 00 01 00 00 01 00 03 02 00 00 00 ... (Main profile, level 4.1) +PPS: 00 00 00 00 00 00 02 00 00 00 8a 00 +DECODE_PARAMS: 00 00 00 00 ... (zeroed DPB at first frame) +SCALING_MATRIX: 10 10 10 10 ... (default flat 0x10 = 16) +``` +So the *control payloads* aren't where mpv and Firefox diverge. + +### New hypothesis (cap_pool / source-index recycling) + +mpv-vaapi-copy uses single-surface recycling — always one OUTPUT buffer index recycled in place. Firefox rotates through multiple surfaces → driver issues OUTPUT QBUF on `source_index=0, 1, 2, ...` over consecutive frames. By frame 20, some `source_index` is being requeued before the prior submission has been DQBUF'd by RequestSyncSurface, OR the buffer pool is exhausted, OR the kernel-side OUTPUT buffer state machine is rejecting recycle. + +This is structurally similar to iter5 Phase 5 sonnet caveat C4 (cap_pool resolution-change race) — buffer-pool lifecycle issue. Possibly a different facet of the same root weakness: cap_pool / OUTPUT buffer drain isn't sequenced cleanly when the consumer issues fast-rotating surface IDs. + +### Diagnostic build status (iter6-DX) + +- Source tree: `/home/mfritsche/iter6-fork-dx/` on ohm +- Local-only patches in `src/v4l2.c`: + - per-control TRY isolation in `v4l2_set_controls` failure path + - per-call S_EXT_CTRLS rc + first-control id/value log + - first-perframe-call payload hex dump (one-shot) + - QBUF failure full v4l2_buffer struct dump +- Built sha (current): `63ea0e630748cc81e98c3d3108fa5f593f60f880fa6a18a28fbafb6864740648` +- Installed at `/usr/lib/dri/v4l2_request_drv_video.so` +- Iter5-end backup at `/home/mfritsche/v4l2_request_drv_video.so.iter5end.bak` + +Diagnostics are diagnostic-only (no semantic change). Will not be committed to campaign repo. + +### Next steps + +The OUTPUT-QBUF-on-frame-20 finding warrants: +1. Cross-check: does our driver allocate enough OUTPUT buffers? `REQBUFS` count vs surface rotation pattern. +2. Do we DQBUF OUTPUT buffers correctly between submissions? Look at picture.c QBUF path + surface.c RequestSyncSurface DQBUF path. +3. Strace at frame ~19-20 boundary to see EXACT kernel ioctl sequence preceding the EINVAL. +4. Whether iter5 Phase 5 caveat C4 (cap_pool race) and this iter6-I finding share root cause — they might be the same bug under different names. + +User check-in needed: this is structurally similar to iter5 carryover candidate A (cap_pool resolution-change race). Probably worth merging iter6-I scope with candidate A or treating them as one investigation. + +Operator decision (2026-05-05): merge accepted. Scope is now A∪I — the merged buffer-pool / surface-recycle lifecycle. + +## Phase 2 wrap (2026-05-05) — bug is intermittent, request_fd / OUTPUT-buffer race + +After three test runs with the iter6-DX diagnostic build: + +| Run | Successful frames | Failure point | +|---|---|---| +| 1 | 1 | OUTPUT QBUF EINVAL index=3 | +| 2 | 19 | OUTPUT QBUF EINVAL index=1 | +| 3 | 53 | OUTPUT QBUF EINVAL index=0 | + +Failure is **intermittent**. Different frame numbers. Different OUTPUT pool slot indices. Always: request_fd=30, type=OUTPUT_MPLANE, REQUEST_FD flag set. Always: bytesused varies with the frame's slice payload size. mpv-vaapi-copy never hits it (verified clean ≥3 frames in this Phase 2 testing; iter5 verified 2000 frames clean). + +### Theories ruled out + +- **DQBUF returns wrong index** — diagnostic logs `expected vs kernel-returned` and fired zero mismatches across 53 successful + 1 failed cycle. Kernel returns the same index our code passes. +- **Control payload divergence** — first 64 bytes of every per-frame ext_control byte-identical between mpv (works) and Firefox (works frame N, fails frame N+1). +- **Frame-1 setup failure** — original symptom does not reproduce on iter6-DX. S_EXT_CTRLS device-init call (DECODE_MODE/START_CODE) succeeds for both consumers. + +### Surviving hypothesis + +**request_fd lifecycle race.** request_fd=30 is reused on every frame (Linux fd-allocation reuses lowest-free after close). The iter4 fix (`385dee1`) closes request_fd in RequestSyncSurface after DQBUF. But: + +1. If close() does not synchronously release the kernel-side request object (it may defer cleanup until the request is fully drained from any in-flight queue), the next `MEDIA_IOC_REQUEST_ALLOC` may return fd=30 pointing at a *new* request object while the OLD object is still mid-cleanup with its OUTPUT buffer attached. +2. Then frame N+1's QBUF(OUTPUT, request_fd=30) attempts to attach a new buffer to a request that already has one — kernel rejects with EINVAL. + +Why mpv doesn't hit this: single-surface recycle means each frame's `RequestSyncSurface → close(request_fd) → DQBUF → next BeginPicture` cycle fully completes before the next frame's BeginPicture. Firefox's MediaSource pipeline issues multiple BeginPicture in flight (separate libva surface IDs), so a stale request_fd from a not-yet-fully-closed previous frame can collide with a new frame's allocation. + +### Phase 4 fix sketch + +Three angles, leading candidate is **C**: + +A. **Use unique request fds, not reused-30.** After `media_request_alloc`, `dup()` the fd to push it past low-fd-reuse range, OR maintain a pool of pre-allocated request fds reset via REINIT. iter4 chose close+alloc over REINIT because REINIT failed for the iter1-3 frame-11 case — but that may have been the DPB / control-payload bug, not this lifecycle race. + +B. **Serialize close+alloc with explicit barrier.** After `close(request_fd)`, sleep / probe before next `MEDIA_IOC_REQUEST_ALLOC`. Brittle. + +C. **Drain OUTPUT pool slot before surface re-bind (mirror cap_pool discipline)**. picture.c:245 unbinds CAPTURE slot if `surface_object->current_slot != NULL`. Add equivalent for OUTPUT: at start of BeginPicture, if the surface still owns an OUTPUT slot from a previous decode that wasn't sync'd yet, drive that slot to drain (close request_fd, DQBUF its buffer, request_pool_release) before acquiring a new slot. Mirrors iter4's "drain before reuse" discipline at the right layer. + +C is the cleanest. The OUTPUT-side gap mirrors the iter4 work for request_fd; this extends the same discipline to the OUTPUT pool slot. + +### Stop point (revised) + +Phase 2 has identified the bug class: **OUTPUT-pool / request_fd race, surfaced by Firefox's multi-surface concurrent-BeginPicture pattern that mpv-vaapi-copy doesn't hit**. + +Phase 4 leading approach: **C — extend iter4's "drain before reuse" discipline from request_fd to OUTPUT pool slot**, with the same close-then-fresh-alloc pattern applied to the OUTPUT slot lifecycle when surface state forces reuse. + +Phase 5 sonnet review will be mandatory before any commit (per CLAUDE.md user-global rule).