Files
libva-multiplanar/phase4_iter5_plan.md
T
marfrit f36c6b040d Iteration 5 Track G complete + Phase 7G verified
firefox-fourier 150.0.1-1.1 rebuilt without --enable-profile-generate=cross
on boltzmann firefox-fourier container (single-pass, ~2h27m). Pkg
68.7 MB, libxul.so 169 MB stripped — 21× smaller than iter3
PGO-instrumented 3.6 GB binary. Installed on ohm via pacman -U.

Phase 7G: 35s autonomous run (no MOZ_DISABLE_RDD_SANDBOX=1):
  ENETDOWN: 0  (sandbox patch holds)
  EINVAL: 0    (iter4 fix holds)
  RDD ProcessDecode: 538 events
  Stream mTime reached: 22.3s
  Decode rate: 0.64× realtime (~2.7× speedup vs PGO-instrumented)

All four iter5 tracks (A+G+B+E) GREEN.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 17:37:34 +00:00

7.7 KiB
Raw Blame History

Iteration 5 — Phase 4 (plan + execution across 4 tracks)

iter5 locked four tracks at Phase 1: A (DEBUG sweep) + G (PGO-disabled Firefox rebuild) + B (mpv libplacebo segfault) + E (multi-context libva safety). Phase 4 splits into 4A / 4G / 4E / 4B sub-phases.

Track A — DEBUG instrumentation sweep ✓ COMPLETE

Sweep landed in 6 commits (in apply order):

  1. 848fc0c — remove iter3 Y2 v1 + iter4 Y2 v3 + per-control TRY iso from v4l2.c::v4l2_ioctl_controls (-54 lines)
  2. 39498f0 — remove iter4 DPB census + per-entry dump from h264.c::h264_set_controls (-31 lines)
  3. 951233a — remove iter1 patch-0014 ENTER traces from buffer.c, image.c, picture.c, surface.c (-17 lines, 13 call sites)
  4. d3a299b — remove iter1 patch-0010 hex-dumps + patch-0011 sentinel write from picture.c + surface.c (-81 lines)
  5. 843febc — remove iter1 slice_header parse echo + VAPicture byte-dump in h264.c, RequestSyncSurface RETURN/early-exit traces in surface.c, suppress per-frame "Unable to get control(s)" when errno==EACCES (-49 lines net)

Total: ~232 lines of instrumentation removed. Per-frame v4l2-request log noise dropped from ~30+ lines/frame to 0 (only init-time + once-per-resolution-change). Driver source builds clean; 2000-frame stress test (timeout 120s) shows 0 EINVAL, 0 "Unable to" lines, 9 v4l2-request log lines total (all init).

KEPT (justified):

  • POC sentinel strip (h264_strip_ffmpeg_poc_sentinel) — load-bearing for ffmpeg-vaapi consumers
  • slice_header bit-precise parser — load-bearing for hantro hw decode (DECODE_PARAMS bit_size fields)
  • EACCES retry-skip in v4l2_get_controls — load-bearing reflective behavior; one-time announcement message stays
  • "slice_header parse FAILED" log — fires only on decode-blocking errors, not per-frame noise

Track E — Multi-context libva safety ✓ COMPLETE

Commit b993355 moves LAST_OUTPUT_WIDTH/HEIGHT from process-global static in surface.c to struct request_data.last_output_width/height. The V4L2 device fd is per-driver_data, so this is the correct binding unit (one fd, one current OUTPUT format).

surface_reset_format_cache() signature changed to take a struct request_data *driver_data parameter; one callsite in context.c updated.

Audit confirmed only LAST_OUTPUT_* was mutable process-global state. Other statics (formats[], formats_count) are constant lookup tables — no race.

Verification: two concurrent mpv processes with 2-second stagger both decoded 300 frames cleanly, no cross-context corruption. Sub-second co-launch hits kernel-level fd contention on /dev/video1 (hantro is a single-instance device); cross-process serialization is out of scope for a libva backend.

Track B — mpv libplacebo --vo=gpu: doesn't reproduce on this consumer ✓

iter3 substrate documented the segfault: Vulkan init fails → mpv falls through to GPU non-vulkan path → 4 frames decode → REQBUFS EBUSY → bizarre CreateSurfaces2 with sizes[1]=1050626 (uninitialized memory) → SIGSEGV.

Empirical re-test on iter5-end driver (post-A + post-E): mpv --hwdec=vaapi --vo=gpu ran for 32 seconds of stream content (all of --frames=200 + sustained beyond), 98 dropped frames out of ~768, zero segfaults / SIGSEGV / VK_ERROR_DEVICE_LOST / abort(). The Vulkan-init-failed warnings still appear ("EnumeratePhysicalDevices ... VK_ERROR_INITIALIZATION_FAILED") and that's steady-state on Mali-G52 / Bifrost (no PanVk for that GPU yet — see reference_pinetab_no_vulkan.md memory). mpv falls through to GLES via Panfrost.

Phase 5 sonnet review (C4): "implicit fix" overstated — refined to "doesn't reproduce on this consumer pattern." The iter4 + iter5 code changes don't directly close the cap_pool REQBUFS-EBUSY-on-resolution-change path that the iter3 substrate documented. The 32s GLES test path doesn't exercise the probe-with-garbage-dimensions consumer pattern that originally triggered the SIGSEGV. So the failure shape is latent, not closed: a future libva consumer that probes with vaCreateSurfaces(16, 16) between two vaCreateSurfaces(1920, 1088) calls while CAPTURE STREAMON is active could still hit the same path.

The cap_pool drain ordering concern survives as iter6+ candidate. iter5's Track B success criterion ("≥30s of bbb_1080p30 without segfault — OR root cause documented as upstream issue with workaround") is satisfied by the 32s clean run; the named caveat (cap_pool race window still latent under untested consumer patterns) is documented here.

No iter5 code change required for Track B beyond what A + E landed. Phase 5 review C4 framing applied.

Track G — PGO-disabled Firefox rebuild (in progress)

PKGBUILD overlay edit replaced the 3-tier PGO sequence with a single-pass optimized build. The PGO profile-collection step needed xvfb-run + display server, which the boltzmann LXC container can't provide.

Single-pass build kicked at iter5 Phase 4G start; running on boltzmann firefox-fourier container. Currently at ~36 minutes in, mid C++ compile phase. ETA: 30-60 min more, then mach package step (5-10 min), then transfer to ohm + extract.

Will deploy to /opt/firefox-fourier/ replacing the iter3 PGO-instrumented binary. Expected libxul.so size delta: 3.6 GB (PGO instrumented) → ~150-300 MB (release). Phase 7G verifies on-ohm playback.

Phase 5 sonnet review caveats addressed (in commit c8b6ede)

Phase 5 review came back YELLOW with four caveats. Three resolved in code, one in documentation:

  • C1 (Track A incomplete): sweep missed three surface.c DEBUG sites (CreateSurfaces2 format-dump, ExportSurfaceHandle descriptor-dump, QuerySurfaceStatus status-dump) and a "3F observability" V4L2 readback block in h264.c. Resolved in c8b6ede — additional 107-line removal.
  • C2 (static bool readback_warned): new mutable process-global state introduced inside the readback block. Resolved by removing the readback block entirely (point above).
  • C3 (msync removal pixel-correctness): msync(MS_SYNC|MS_INVALIDATE) was paired with the iter1 hex-dump and removed alongside it. The CAPTURE buffer is read post-DQBUF via copy_surface_to_image (image.c) for vaapi-copy; on a CMA-backed non-coherent setup this could in principle need cache invalidation. Empirical: 2000-frame stress with 0 errors, no visible decode failure. Likely the kernel does DMA sync at DQBUF level. Documented as named caveat — frame-hash spot check could anchor it formally if needed; for now accept based on empirical pass.
  • C4 (Track B "implicit fix" overstated): reframed above as "doesn't reproduce on this consumer pattern." Cap_pool resolution-change race window remains latent under untested consumer probe patterns.

Phase 4 → Phase 6/7/8 transition

Phase 4 + Phase 5 done for A + E + B. G in progress.

Phase 7 verification anchored (driver sha256 4bed52ec5d44b389..., post-cleanup):

  • A: 2000-frame mpv vaapi-copy stress, 0 EINVAL, 1 v4l2-request log line, 3.0 KB log (down from 9 lines / 4.4 KB pre-Phase-5-cleanup). Phase 5 C1 caveat resolved.
  • E: 2-process concurrent mpv post-cleanup (300 frames each, 2s stagger), both clean ("Exiting (End of file)").
  • B: 35s mpv --vo=gpu post-cleanup: 31s stream pos, 29 dropped, 0 segfaults. Same shape as pre-cleanup (cap_pool race fires at init, mpv falls back to SW gracefully). Sonnet C4 caveat stands as iter6+ candidate.
  • G: ✓ COMPLETE. firefox 150.0.1-1.1 built (single-pass non-PGO, ~2h27m on boltzmann), 68.7 MB pkg, libxul.so stripped to 169 MB (21× smaller than iter3 PGO-instrumented 3.6 GB). Installed via pacman -U on ohm replacing stock firefox 150.0.1-1. Phase 7G test (35s autonomous run, no MOZ_DISABLE_RDD_SANDBOX=1): ENETDOWN=0, EINVAL=0, 538 RDD ProcessDecode events, 22.3s stream content (0.64× realtime — ~2.7× speedup over PGO-instrumented).