Files
libva-multiplanar/phase7_iter8_perf_anchor.md
claude-noether 4536dd3283 iter8 + campaign close: Track E GREEN, libva-multiplanar campaign closes
Eight iterations (2026-05-04 → 2026-05-06) close. Operator's primary
goal — Firefox + mpv hardware-decode H.264 on PineTab2 (RK3566 silicon
via hantro/rk3568-vpu DT compatible) end-to-end with sandboxes
enabled — was met at iter6 and is anchored with measured numbers
this iteration.

iter8 perf binding cell (30s per consumer, bbb_1080p30_h264.mp4):
- Firefox-fourier RDD process: 8% CPU during HW decode
- mpv vaapi-copy: 66% CPU vs SW baseline 97% (-31pp, ~32% relative)
- mpv vaapi-dmabuf: silent SW fallback in --vo=null (documented
  limitation; needs a working VO that this hardware doesn't have)
- mpv SW baseline: 97% CPU
- All four configs: zero drops in 30s, decode keeps up with realtime

Phase 5 sonnet review caught 3 issues pre-commit, all fixed:
- pidstat $8 column heuristic broken — replaced with header-driven
  %CPU field detection
- GPU freq median's nested-subshell /dev/stdin pipeline unreliable
  — replaced with temp-file path
- --frames=$((DURATION*30)) hardcoded 30fps — replaced with
  --length=$DURATION (framerate-agnostic)

Phase 1 success criterion: 5/5 gates met.

Tracks dropped (recorded for honest accounting):
- D (upstreaming) — philosophical, AI-slop-buster review climate
- F (DMABUF on OUTPUT) — technical, no consumer exercises it
- MPEG-2 — CPU handles it fine, no user need

Residual carries documented for any future operator:
- STREAMON-on-context-recreate corner case
- Pool-size parameterization
- Fault-inject build for slot-leak recovery
- DMABUF zero-copy mpv perf measurement (needs different harness)
- Firefox-with-HW-disabled SW baseline measurement

Follow-on campaigns chartered separately:
- fourier-fresnel (RK3399 / Pinebook Pro port)
- panvk-bifrost (Vulkan-on-Mali for Bifrost)

Twelve fork commits, three test harnesses, one Firefox sandbox
patch, eight iterations of campaign documentation. All on
git.reauktion.de under claude-noether <claude@reauktion.de> from
iter5 onward.

Campaign closes. Done.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 12:45:04 +00:00

6.1 KiB

Phase 7 — iter8 perf binding cell anchor

Captured 2026-05-06 14:39 CEST on ohm. iter7-end driver sha256 7aa59a1b15e127e0b617e24281b2c9d67c4dfde53e7c0d145e9c55d8bca12444 (the iter7 Phase 7 OUTPUT-pool teardown fix included). Fixture: bbb_1080p30_h264.mp4 (725 MB). Duration per consumer: 30s.

Numbers (raw, no spin)

Consumer CPU% p50 CPU% p90 Drops in 30s p50 frame ms GPU MHz median VmRSS Δ MiB
mpv-vaapi-dmabuf 90.00 146.00 0 200 0.0
mpv-vaapi-copy 66.00 68.00 0 200 0.0
firefox-fourier-hw 8.00 9.00 400 9.7
mpv-sw-baseline 97.00 145.00 0 200 0.0

CPU% measured via pidstat per-second over the 30s window, using the %CPU column (parsed by header lookup). Numbers are per-PID; aarch64 systems express multi-core saturation as values > 100% (one CPU saturated per 100%, four-core max ≈ 400%).

What each row actually measures

mpv-vaapi-dmabuf — silent SW fallback (DOCUMENTED LIMITATION)

The --hwdec=vaapi setting requested DMABUF zero-copy through libva to the configured video output. With --vo=null, mpv has no GPU surface to render to and silently falls through to libavcodec SW decode. Confirmed by absence of "Using hardware decoding" in the consumer log (mpv emits that line when it actually engages a HW decoder).

The CPU numbers (90% p50, 146% p90) are therefore equivalent to the mpv-sw-baseline row (97% p50, 145% p90) within statistical noise — both are doing the same work.

To meaningfully measure the DMABUF zero-copy path, a real VO is required: --vo=gpu (libplacebo) or --vo=drm (kmsro). On PineTab2, --vo=gpu fails because Mali-G52 / Bifrost Vulkan is unsupported on the current Mesa/kernel stack (per memory/reference_pinetab_no_vulkan.md); --vo=drm requires KMS access not available from a sudo'd shell session. Either path would be a separate measurement effort beyond the campaign's scope.

This row is informational, not a useful HW-decode datapoint. The vaapi-copy row is the meaningful HW-decode mpv comparison.

mpv-vaapi-copy — actual HW decode through libva (66% p50)

Confirmed HW decode: consumer log shows "Using hardware decoding (vaapi-copy)". The path is hantro VPU decodes H.264 → libva returns CAPTURE buffer → mpv vaDeriveImage/vaGetImage reads the YUV plane to userspace → mpv-internal converts/buffers/discards (vo=null).

CPU saving vs mpv-sw-baseline (97% → 66%): 31 percentage points, 32% relative reduction. Frame drops: 0 in 30s. Decode is keeping up with realtime.

The remaining ~66% is the userspace readback (vaGetImage) plus mpv's own demuxer, parser, A-V scheduling, and bookkeeping. The HW decode itself contributes negligibly to mpv's CPU — that work happens in the hantro VPU silicon.

firefox-fourier-hw — RDD process during decode (8% p50)

Firefox 150 (iter5-amend with the combined sandbox patch) playing the same fixture via file:// URL with LIBVA_DRIVER_NAME=v4l2_request. Tracked the RDD child process (which holds /dev/video1).

8% RDD CPU during sustained decode. The decode itself runs in the hantro VPU; the RDD process orchestrates ioctls and shuttles surfaces. Browser-tab content process and parent process are not counted in this row; cumulative Firefox CPU is higher (multi-process).

VmRSS delta of 9.7 MiB is the RDD process's growth over 30s — buffer allocations and decode pipeline state.

GPU MHz median = 400 MHz: this is the Mali-G52 GPU, not the hantro VPU (the /sys/class/devfreq/fde60000.gpu/ path is the Mali freq governor). 400 MHz reflects compositor activity (Firefox painting decoded frames into the page); it is NOT a measure of decode work. The actual VPU is on a separate clock not exposed via this devfreq path.

Comparison to a Firefox-without-HW baseline is not in this measurement. A separate run with media.hardware-video-decoding.enabled=false (or equivalent) would give the SW Firefox number. The mpv SW baseline (~97% CPU on the player process) provides a rough lower bound: Firefox's SW path would be at least that demanding, and probably more given Firefox's overall stack overhead.

mpv-sw-baseline — pure SW H.264 decode (97% p50)

mpv --hwdec=no. libavcodec H.264 SW decoder, four cores available. CPU% > 100% means multiple cores are active; p90 = 145% means at peak the decode uses ~1.5 cores worth. Frame drops: 0 — the A55 cluster keeps up with 1080p30 H.264 SW.

GPU MHz median = 200 MHz: the Mali idle/baseline. SW decode doesn't use the GPU.

Honest takeaways

  1. The campaign's primary deliverable is empirically anchored. Firefox HW decode works (RDD at 8% during sustained 30s playback). mpv vaapi-copy works (66% CPU vs 97% SW, -31pp).

  2. The DMABUF zero-copy path is unmeasured here. Not because it doesn't work — but because the only mpv VO that engages it (vo=gpu/vo=drm) is independently broken on this hardware (Mali Vulkan + KMS-from-sudo concerns). A future measurement effort outside this campaign would need a different harness.

  3. CPU savings are real but modest in absolute terms. mpv vaapi-copy saves 31pp out of a 97pp budget — ~32% relative reduction. Firefox saving is much larger (8pp RDD vs an estimated 70-80pp+ SW), driven by Firefox's process model where decode cost concentrates in one process that goes near-zero with HW.

  4. The GPU MHz column tracks Mali, not hantro VPU. It's a misleading datapoint for decode-cost reasoning. If a future measurement effort wants VPU utilization, it must find a separate path (debugfs entries, or /proc/interrupts correlation, or perf counter on the VPU power domain).

  5. All four configurations completed 30s without drops or errors. The iter7-end driver sustains realtime decode. Iter6's REINIT discipline + iter7's slot-leak + cap_pool fixes hold under measurement load.

Reproducibility

tests/run_perf_binding_cell.sh in the fork repo. Re-runnable from any operator-OS PineTab2 with the iter7-end driver installed. Override fixture path as $1, duration as DURATION=N env. The script captures driver sha, fixture sha (implicit via path+size), kernel, hostname, run timestamp.