Files
libva-multiplanar/phase0_evidence/2026-05-04-firefox-live/findings.md
T
marfrit e892cea858 Phase 0 deliverable #3 (Firefox live session): inverted verdict
Re-tested Firefox 150.0.1 inside operator's active Plasma 6 Wayland
session (not Xvfb). Two-layer finding:

1. Firefox engages libva in real Plasma session: full V4L2-stateless
   contract lifecycle completes, no EINVAL on the request-API path,
   v4l2_request_drv_video.so successfully loaded, /dev/video1 +
   /dev/media0 opened by RDD utility process 146420.

2. Kernel produces no decoded pixel output: CAPTURE buffer returns
   from DQBUF with the patch-0011 sentinel pattern 0xab unchanged.
   Hantro never wrote the buffer despite the contract trace looking
   clean. Firefox detected the failed first frame and silently fell
   back to SW decode in RDD's FFmpeg-OS-library PDM. User-visible
   playback continues normally for 5+ minutes (operator confirmed
   t=337s playback time in live inspection).

Cross-checked against the prior 2026-05-04 mpv vaapi-copy run: 68 of
68 mpv CAPTURE buffers show the same sentinel-survives pattern.
mpv's --vo=null consumed all 68 sentinel buffers as if they were
valid NV12 frames; the failure was invisible. OUTPUT bytes are
byte-for-byte identical between mpv and Firefox (same IDR slice via
libavcodec, both consumers feed hantro the same data, hantro
silently drops both).

Implication: the prior Phase 0 in-session re-verification verdict
(commit f15ba8b: "the 2026-04-26 picture holds at boolean-correctness
level") was wrong at the kernel-decode layer. The patch-0011 sentinel
test in the deployed Step 1 build was authored specifically to detect
this failure mode; the predecessor close-out didn't grep for it, and
contract-trace cleanliness was mistaken for end-to-end success.

Phase 1 lock should be deferred until: (a) boolean-correctness
criterion is sharpened to require pixel-content verification,
(b) Phase 0 acquires kernel-side observability (ftrace, dmesg) to
characterize WHY hantro is silent. Step 1 engages libva but doesn't
make hantro decode -- Phase 6 has substantive work beyond the
18-patch series.

Likely failure-mode candidates flagged in findings_live.md priority
order: reference_ts not propagated; DECODE_PARAMS slice_header
bit_size zero; POC sentinel may still leak past patch-0015 strip;
level_idc over-allocation; SOURCE_CHANGE event handling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 10:38:57 +00:00

11 KiB
Raw Blame History

Phase 0 deliverable #3 — Firefox VAAPI engagement, LIVE Plasma session (2026-05-04)

Re-test of phase0_evidence/2026-05-04-firefox/findings.md inside an active Plasma 6 Wayland session on ohm (XDG_SESSION_TYPE=wayland, XDG_RUNTIME_DIR=/run/user/1001, kwin_wayland 144533, plasmashell 144667, Xwayland :0). Same Firefox profile + LIBVA env vars; only the gfx environment changed.

Verdict

Two-layer finding that inverts the prior Phase 0 re-verification verdict.

Layer Result
Firefox loads libva-v4l2-request driver dlopen v4l2_request_drv_video.so succeeds; no sandbox/gating issue under real Plasma session
Firefox completes the V4L2-stateless contract lifecycle on hantro REQUEST_ALLOC → S_FMT → CREATE_BUFS → STREAMON → S_EXT_CTRLS → QBUF + REQUEST_QUEUE → DQBUF + EXPBUF, no EINVAL on the request-API path
Kernel produces decoded pixel output CAPTURE buffer returns with patch-0011 sentinel 0xab unchanged. Hantro never wrote the buffer.
Consumer reaction to bad output Firefox detects the failed first frame and silently falls back to software decode in RDD's FFmpeg-OS-library PDM. User-visible playback continues normally; observed t=337s (5+ minutes) of stable bunny playback via SW.

This means the prior Phase 0 re-verification verdict (commit f15ba8b, "the 2026-04-26 picture holds at boolean-correctness level") was wrong — the prior test was a clean contract trace but never inspected the actual decoded pixel content.

What the live-session strace shows

Decode happened on PID 146420 (Firefox utility process for hwaccel), /dev/video1 fd 7, /dev/media0 fd 8.

Single-frame attempt (full ioctl summary, exhaustive — not a sample):

22 VIDIOC_ENUM_FMT       (probe, including expected MPLANE-fallback EINVALs)
 5 VIDIOC_QUERYBUF
 4 VIDIOC_G_FMT
 2 VIDIOC_STREAMON       (OUTPUT_MPLANE + CAPTURE_MPLANE)
 2 VIDIOC_STREAMOFF      (the bail-out after frame 0)
 2 VIDIOC_S_EXT_CTRLS    (1 device-wide DECODE_MODE+START_CODE per patch 0002, 1 per-request)
 2 VIDIOC_REQBUFS
 2 VIDIOC_QBUF           (1 OUTPUT, 1 CAPTURE)
 2 VIDIOC_DQBUF          (1 OUTPUT, 1 CAPTURE)
 2 VIDIOC_CREATE_BUFS
 1 VIDIOC_S_FMT          (OUTPUT_MPLANE H264_SLICE 1920x1088)
 1 VIDIOC_EXPBUF         (DMA-BUF export of CAPTURE buffer)
 1 MEDIA_REQUEST_IOC_QUEUE
 1 MEDIA_REQUEST_IOC_REINIT
 1 MEDIA_IOC_REQUEST_ALLOC

Single QBUF/DQBUF pair, single MEDIA_REQUEST_IOC_QUEUE = exactly one frame attempted. No EINVAL on any request-API ioctl. Two STREAMOFF = clean shutdown of both queues after the failed frame.

After the libva STREAMOFF, no further V4L2 ioctls anywhere in the strace. Bunny continued playing for 5+ minutes via SW decode — confirmed by user inspection (t=337s playback time visible in window title).

What the .so DEBUG instrumentation shows

stderr_live (4 lines, 553 B, the entire output of patches 0010+0011+0014):

v4l2-request: VAPictureH264 sizeof=36 CurrPic[0..31]: 00 00 00 04 00 00 00 00 00 00 00 00 00 00 01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00
v4l2-request: VAPictureH264 CurrPic field reads: picture_id=0x04000000 frame_idx=0 flags=0x0 TopFOC=65536 BottomFOC=65536 frame_num=0
v4l2-request: OUTPUT[idx=0, len=6272]: 00 00 01 25 b8 20 20 21 44 c5 00 01 57 9b ef be fb ef be fb ef be fb ef be fb ef be fb ef be fb
v4l2-request: CAPTURE[idx=0, plane0]: ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
  • VAPictureH264 dump (patch 0014): TopFOC=65536 and BottomFOC=65536 — these are the ffmpeg-vaapi POC sentinel values that patch 0015 is supposed to strip. The dump shows the values BEFORE patch 0015's strip. The strip happens after the dump in the code path — so the dump captures the raw VAAPI input. After strip, the values handed to V4L2 should be 0/0. (Worth verifying via a second dump after the strip.)
  • OUTPUT buffer (patch 0010): 00 00 01 25 b8 20 ... is correct ANNEX_B start code + IDR slice NAL header (0x25 = forbidden_zero=0, nal_ref_idc=01, nal_unit_type=00101 = IDR slice). The ef be fb repeating pattern that initially looked like poison is real H.264 RBSP slice data (CABAC bin sequences are random-looking). 6272 bytes is plausible for a 1080p IDR slice.
  • CAPTURE buffer (patch 0011): All 32 bytes of plane 0 still hold 0xab — the sentinel pattern patch 0011 wrote before QBUF. The kernel returned the buffer via DQBUF without writing to it. Hantro did not decode this frame.

Cross-check against the 2026-05-04 mpv vaapi-copy run

Re-examined phase0_evidence/2026-05-04/mpv_vaapi_copy_2026-05-04.stderr (which we previously called "68 frames decoded cleanly"):

$ grep "CAPTURE\[" mpv_vaapi_copy_2026-05-04.stderr | wc -l
68
$ grep "CAPTURE\[" mpv_vaapi_copy_2026-05-04.stderr | grep -c "ab ab ab ab"
68

68 of 68 mpv CAPTURE buffers show the same sentinel-survives pattern. mpv --vo=null consumed all 68 sentinel buffers as if they were valid NV12 frames; with no real VO to render to, the failure was invisible.

OUTPUT bytes for frame 0 are byte-for-byte identical between mpv and Firefox (same IDR slice from same source clip, both via libavcodec). Both consumers feed hantro the same data; hantro silently drops both.

Why the 2026-04-26 STUDY claim survived as long as it did

The claim was "vainfo + mpv probes work end-to-end." This is true at the libva-engagement layer: vainfo enumerates profiles, mpv completes the contract lifecycle, no errors on the request-API path. The check that was missing was pixel-content verification:

  • vainfo doesn't decode — it only enumerates capabilities. Always green.
  • mpv with --vo=gpu or --vo=vaapi would have shown garbage (all 0xab = mid-gray NV12), but the test rig in the predecessor campaign was probably the same as ours: SSH + --vo=null.
  • fourier_attribution cell A (chromium-fourier 149 with Step 1 patches, browser_cpu_median = 54 %) was visually inspected by the operator on a real screen during that campaign. Cell A's bunny WAS visible and playing — but on Brave/Chromium, not on mpv or Firefox. Chromium's V4L2 path may differ (uses its own backend in addition to libva, depending on flags); the cell A success could be a different code path than the one we just traced.
  • The patch-0011 sentinel test was apparently authored to detect this exact failure mode but its output wasn't being grepped in the close-out. The patch series was held to be working based on the contract-trace cleanliness — which we now know is necessary but not sufficient.

Implication: Phase 0 substrate result is "kernel decode broken"

The Phase 0 in-session re-verification (campaign repo commit f15ba8b) overstated the case. The corrected verdict:

  • libva engagement: ✓ on both mpv and Firefox in their respective rigs
  • V4L2-stateless contract trace: ✓ no EINVAL on the request-API path
  • Hantro produces decoded pixel output: ✗ on every frame attempted, by either consumer

Phase 1 boolean-correctness criterion as currently locked says "boolean correctness — libva accepted + providing access to hardware decoder." We reached "libva accepted" but not "providing access to hardware decoder" in the meaningful sense. The criterion should be sharpened to require pixel-content verification, e.g.: "the CAPTURE buffer returned from DQBUF must contain decoded pixel data (sentinel-overwritten); a smoke test of NV12 luma min/max range across at least one frame must reject the all-0xab pattern."

Phase 1 lock now needs amendment.

What this means for Phase 6 / Step 1

The deployed Step 1 18-patch series engages the libva path correctly but doesn't make hantro decode. The bug surface is in one of these areas (rough priority order, based on patch-comment hints):

  1. reference_ts not propagated. Patch 0017's commit message: "hantro doesn't read pic_num (uses reference_ts)." Implication: hantro depends on reference_ts being populated correctly to find DPB references for inter prediction. For an IDR (frame 0), reference_ts is irrelevant — but if reference_ts is malformed in a way that breaks SPS parsing, hantro might bail before decode.
  2. DECODE_PARAMS missing slice_header bit_size fields. Patch 0008's open question was explicitly "does hantro tolerate the bit_size fields being zero, or do we need a slice_header() bit-level parser?" The Step 1 series did NOT add a slice_header parser — those fields are zero. Maybe hantro doesn't tolerate that and silently skips decode.
  3. POC sentinel still leaking. Patch 0015's strip happens at the right call sites, but the DEBUG dump (patch 0014) runs before the strip — so the dump shows the raw 65536 values. Verify the values handed to V4L2 are actually stripped by adding a post-strip dump or reading the V4L2_CTRL_TYPE_H264_DECODE_PARAMS via VIDIOC_G_EXT_CTRLS just before QBUF.
  4. Level_idc over-allocation interaction. Patch 0013 hardcodes level_idc=51; patch 0018 derives it from Annex A.3 (so for 1080p we'd get level_idc=41). hantro uses level_idc to size DPB/MV buffers. Wrong sizing might allocate too small and drop the decode silently.
  5. CAPTURE format derivation. Patch 0009 removed VIDIOC_S_FMT on CAPTURE per "Hantro derives CAPTURE format from per-request SPS." The G_FMT shows NV12 1920×1088, which looks right — but if SPS isn't being submitted, the kernel might decode into a different layout that overwrites neither the sentinel nor the real CAPTURE bytes.
  6. Other hantro silent-failure paths: V4L2_EVENT_SOURCE_CHANGE (open Q #5 in phase0_findings.md), per-frame timestamp / VIDIOC_S_PARM, missing frame_num / IDR-bit setup in DECODE_PARAMS flags.

The correct Phase 6 starting point is to instrument the kernel side: ftrace events/v4l2/ and events/hantro/ (if exposed), or dmesg for any silent-decode-error messages, while the userspace contract trace runs. That's the actual Phase 3 baseline we need.

Artifacts

  • firefox_live.log.{moz_log,child-1.moz_log} — MOZ_LOG output from the live-session run
  • firefox_stderr_live — the .so DEBUG output (only 4 lines because only 1 frame was attempted)
  • firefox_stdout_live — empty
  • strace_146420 — the decode utility process: full V4L2-stateless lifecycle
  • strace_146198, strace_146199, strace_146200, strace_146201, strace_146203 — RDD + content + GPU process traces
  • strace_146147 — Firefox parent
  • strace_146164 — fork-server / GMP-related child

Phase 0 deliverables status (updated)

  • #1 Re-verify failure-mode finding in-session — ✗ AMENDED: contract trace lands cleanly, but kernel produces no decoded pixels. Prior commit f15ba8b overstated the verdict.
  • #2 Step 1 reconciliation — ✓ done in commit 74b3793 on fork master.
  • #3 Firefox configuration end-to-end — ✓ engagement confirmed in live Plasma session; pixel-content failure mode identical to mpv (libva ✓, hantro ✗).
  • #4 Phase 0 baseline anchor — ✗ AMENDED: the captured contract trace describes Step 1's userspace behaviour, not what Phase 6 must reproduce. Phase 6's spec needs to include kernel-side observability (ftrace / dmesg) so we can actually characterize hantro's silent failure.

Phase 1 lock should be deferred until we have a sharpened boolean-correctness criterion (pixel-content verification) and at least a hypothesis about why hantro is silent. Phase 0 isn't done yet.