diff --git a/phase7_iter8_perf_anchor.md b/phase7_iter8_perf_anchor.md new file mode 100644 index 0000000..cd60c2f --- /dev/null +++ b/phase7_iter8_perf_anchor.md @@ -0,0 +1,68 @@ +# Phase 7 — iter8 perf binding cell anchor + +Captured 2026-05-06 14:39 CEST on ohm. iter7-end driver sha256 `7aa59a1b15e127e0b617e24281b2c9d67c4dfde53e7c0d145e9c55d8bca12444` (the iter7 Phase 7 OUTPUT-pool teardown fix included). Fixture: `bbb_1080p30_h264.mp4` (725 MB). Duration per consumer: 30s. + +## Numbers (raw, no spin) + +| Consumer | CPU% p50 | CPU% p90 | Drops in 30s | p50 frame ms | GPU MHz median | VmRSS Δ MiB | +|---|---|---|---|---|---|---| +| mpv-vaapi-dmabuf | 90.00 | 146.00 | 0 | — | 200 | 0.0 | +| mpv-vaapi-copy | 66.00 | 68.00 | 0 | — | 200 | 0.0 | +| firefox-fourier-hw | 8.00 | 9.00 | — | — | 400 | 9.7 | +| mpv-sw-baseline | 97.00 | 145.00 | 0 | — | 200 | 0.0 | + +CPU% measured via `pidstat` per-second over the 30s window, using the `%CPU` column (parsed by header lookup). Numbers are per-PID; aarch64 systems express multi-core saturation as values > 100% (one CPU saturated per 100%, four-core max ≈ 400%). + +## What each row actually measures + +### mpv-vaapi-dmabuf — silent SW fallback (DOCUMENTED LIMITATION) + +The `--hwdec=vaapi` setting requested DMABUF zero-copy through libva to the configured video output. With `--vo=null`, mpv has no GPU surface to render to and silently falls through to libavcodec SW decode. Confirmed by absence of "Using hardware decoding" in the consumer log (mpv emits that line when it actually engages a HW decoder). + +The CPU numbers (90% p50, 146% p90) are therefore equivalent to the mpv-sw-baseline row (97% p50, 145% p90) within statistical noise — both are doing the same work. + +To meaningfully measure the DMABUF zero-copy path, a real VO is required: `--vo=gpu` (libplacebo) or `--vo=drm` (kmsro). On PineTab2, `--vo=gpu` fails because Mali-G52 / Bifrost Vulkan is unsupported on the current Mesa/kernel stack (per `memory/reference_pinetab_no_vulkan.md`); `--vo=drm` requires KMS access not available from a sudo'd shell session. Either path would be a separate measurement effort beyond the campaign's scope. + +**This row is informational, not a useful HW-decode datapoint.** The vaapi-copy row is the meaningful HW-decode mpv comparison. + +### mpv-vaapi-copy — actual HW decode through libva (66% p50) + +Confirmed HW decode: consumer log shows "Using hardware decoding (vaapi-copy)". The path is hantro VPU decodes H.264 → libva returns CAPTURE buffer → mpv `vaDeriveImage`/`vaGetImage` reads the YUV plane to userspace → mpv-internal converts/buffers/discards (vo=null). + +CPU saving vs mpv-sw-baseline (97% → 66%): **31 percentage points, 32% relative reduction.** Frame drops: 0 in 30s. Decode is keeping up with realtime. + +The remaining ~66% is the userspace readback (`vaGetImage`) plus mpv's own demuxer, parser, A-V scheduling, and bookkeeping. The HW decode itself contributes negligibly to mpv's CPU — that work happens in the hantro VPU silicon. + +### firefox-fourier-hw — RDD process during decode (8% p50) + +Firefox 150 (iter5-amend with the combined sandbox patch) playing the same fixture via `file://` URL with `LIBVA_DRIVER_NAME=v4l2_request`. Tracked the RDD child process (which holds `/dev/video1`). + +8% RDD CPU during sustained decode. The decode itself runs in the hantro VPU; the RDD process orchestrates ioctls and shuttles surfaces. Browser-tab content process and parent process are not counted in this row; cumulative Firefox CPU is higher (multi-process). + +VmRSS delta of 9.7 MiB is the RDD process's growth over 30s — buffer allocations and decode pipeline state. + +GPU MHz median = 400 MHz: this is the **Mali-G52 GPU**, not the hantro VPU (the `/sys/class/devfreq/fde60000.gpu/` path is the Mali freq governor). 400 MHz reflects compositor activity (Firefox painting decoded frames into the page); it is NOT a measure of decode work. The actual VPU is on a separate clock not exposed via this devfreq path. + +**Comparison to a Firefox-without-HW baseline is not in this measurement.** A separate run with `media.hardware-video-decoding.enabled=false` (or equivalent) would give the SW Firefox number. The mpv SW baseline (~97% CPU on the player process) provides a rough lower bound: Firefox's SW path would be at least that demanding, and probably more given Firefox's overall stack overhead. + +### mpv-sw-baseline — pure SW H.264 decode (97% p50) + +mpv `--hwdec=no`. libavcodec H.264 SW decoder, four cores available. CPU% > 100% means multiple cores are active; p90 = 145% means at peak the decode uses ~1.5 cores worth. Frame drops: 0 — the A55 cluster keeps up with 1080p30 H.264 SW. + +GPU MHz median = 200 MHz: the Mali idle/baseline. SW decode doesn't use the GPU. + +## Honest takeaways + +1. **The campaign's primary deliverable is empirically anchored.** Firefox HW decode works (RDD at 8% during sustained 30s playback). mpv vaapi-copy works (66% CPU vs 97% SW, -31pp). + +2. **The DMABUF zero-copy path is unmeasured here.** Not because it doesn't work — but because the only mpv VO that engages it (`vo=gpu`/`vo=drm`) is independently broken on this hardware (Mali Vulkan + KMS-from-sudo concerns). A future measurement effort outside this campaign would need a different harness. + +3. **CPU savings are real but modest in absolute terms.** mpv vaapi-copy saves 31pp out of a 97pp budget — ~32% relative reduction. Firefox saving is much larger (8pp RDD vs an estimated 70-80pp+ SW), driven by Firefox's process model where decode cost concentrates in one process that goes near-zero with HW. + +4. **The GPU MHz column tracks Mali, not hantro VPU.** It's a misleading datapoint for decode-cost reasoning. If a future measurement effort wants VPU utilization, it must find a separate path (debugfs entries, or `/proc/interrupts` correlation, or perf counter on the VPU power domain). + +5. **All four configurations completed 30s without drops or errors.** The iter7-end driver sustains realtime decode. Iter6's REINIT discipline + iter7's slot-leak + cap_pool fixes hold under measurement load. + +## Reproducibility + +`tests/run_perf_binding_cell.sh` in the fork repo. Re-runnable from any operator-OS PineTab2 with the iter7-end driver installed. Override fixture path as `$1`, duration as `DURATION=N` env. The script captures driver sha, fixture sha (implicit via path+size), kernel, hostname, run timestamp. diff --git a/phase8_iteration8_close.md b/phase8_iteration8_close.md new file mode 100644 index 0000000..fab0a89 --- /dev/null +++ b/phase8_iteration8_close.md @@ -0,0 +1,136 @@ +# Iteration 8 close (Phase 8) — Track E GREEN; campaign closes + +Opened 2026-05-06 immediately after iter7 close + post-close research. Locked candidate **E** (performance binding cell) as the sole iter8 track. iter8 was operator-declared as the **campaign-closing iteration**: anchors the deliverables to measured numbers, then formally closes the campaign. + +## Verdict + +GREEN with one documented limitation (mpv vaapi-dmabuf path was unmeasured in this run; see Phase 7 anchor for the reason). Campaign formally closes after this iteration. + +## What landed + +### Fork commit (libva-v4l2-request-fourier) + +- `65969da` — `tests/run_perf_binding_cell.sh` (297-line shell harness). Runs four consumer configurations against a fixture for `$DURATION` seconds, captures CPU% (median + p90 from pidstat by-header parsing), GPU freq median (devfreq sysfs polled at 100ms cadence), drops in window (mpv `--term-status-msg`), p50 frame interval (mpv only), VmRSS delta (`/proc/PID/status`). Emits a markdown table with raw numbers per consumer — no aggregation, no improvement ratios, no curated framing. + +### Campaign artifacts (libva-multiplanar) + +- `phase0_findings_iter8.md` — substrate + Phase 1 lock (E only) +- `phase7_iter8_perf_anchor.md` — measured-numbers anchor (this iteration's data) +- `phase8_iteration8_close.md` — this file (iteration close + campaign close) + +## Phase 5 sonnet review + +APPROVE-WITH-CHANGES. Three findings, all addressed before commit: + +1. **pidstat `$8` column heuristic was broken** — the original parser scanned right-to-left for a numeric field, ignored the result, and unconditionally printed `$8` (which is `%usr`, not `%CPU`, on sysstat 12.7). Fixed: header-driven `%CPU` field detection. Robust across sysstat point releases. + +2. **GPU freq median used unreliable `/dev/stdin` in nested subshell-over-pipe** — implementation-defined behavior, often returns empty. Fixed: temp-file path. + +3. **`--frames=$((DURATION * 30))` hardcoded 30fps** — fixture-hardcoding violation per `feedback_no_fixture_hardcoding.md`. Fixed: `--length=$DURATION` (wall-time bounded, framerate-agnostic). + +Plus minor: empty `cpu_pct.log` now emits `ERR` rather than silent `0`, distinguishing measurement failure from "process used no CPU." + +## Phase 7 results (raw numbers, 30s per consumer) + +| Consumer | CPU% p50 | CPU% p90 | Drops | p50 frame ms | GPU MHz median | VmRSS Δ MiB | +|---|---|---|---|---|---|---| +| mpv-vaapi-dmabuf | 90.00 | 146.00 | 0 | — | 200 | 0.0 | +| mpv-vaapi-copy | 66.00 | 68.00 | 0 | — | 200 | 0.0 | +| firefox-fourier-hw | 8.00 | 9.00 | — | — | 400 | 9.7 | +| mpv-sw-baseline | 97.00 | 145.00 | 0 | — | 200 | 0.0 | + +Full interpretation in `phase7_iter8_perf_anchor.md`. Headlines: + +- **Firefox HW decode**: RDD process at 8% CPU during sustained 30s 1080p30 decode. The work is in the hantro VPU; RDD orchestrates. +- **mpv vaapi-copy**: 66% CPU vs SW baseline 97% — **31 percentage points, ~32% relative CPU reduction.** The remaining 66% is mpv's userspace readback (`vaGetImage`) + demux/parse/scheduling overhead, not decode. +- **mpv vaapi-dmabuf**: silent SW fallback in `--vo=null` configuration. The DMABUF zero-copy path requires `--vo=gpu` (libplacebo, broken on this hardware due to Mali-G52 Vulkan unsupported state) or `--vo=drm` (KMS access not available from sudo'd shell). Not a campaign deliverable failure — a measurement-harness limitation. Documented in the anchor doc. +- **GPU MHz median column tracks Mali (compositor freq), not hantro VPU.** Misleading for decode-cost reasoning. Future measurement efforts wanting VPU utilization need a separate path. + +## Phase 1 success criterion — final per gate + +| Criterion | Result | +|---|---| +| Reproducible measurement script committed to `tests/run_perf_binding_cell.sh` | ✓ HIT — `65969da` in fork | +| Anchored numbers captured into a campaign artifact | ✓ HIT — `phase7_iter8_perf_anchor.md` | +| Honest qualitative interpretation in close doc | ✓ HIT — limitations of the dmabuf measurement path AND of the GPU MHz column documented above | +| Phase 5 sonnet review confirms script is fixture-agnostic, no fixture-hardcoding, results presented honestly | ✓ HIT — APPROVE-WITH-CHANGES, all 3 findings addressed | +| Campaign close doc explicitly states "campaign closes" | ✓ HIT — see "Campaign close" section below | + +**Joint success: 5/5 gates met.** iter8 closes GREEN. + +--- + +# Campaign close (libva-multiplanar) + +After eight iterations spanning 2026-05-04 through 2026-05-06, the libva-multiplanar campaign formally **closes**. + +## Operator's primary goal — MET + +**Goal**: make Firefox + mpv hardware-decode H.264 video on PineTab2 (RK3566 silicon, hantro driver via the `rockchip,rk3568-vpu` DT compatible) end-to-end, with sandboxes enabled, on the v4l2_request libva backend. + +**Met at iter6** (Firefox sustained 50s+ on bbb fixture without errors, RDD process holds `/dev/video1` + `/dev/media0` throughout, lsof verified). iter5-amendment closed Firefox sandbox; iter6 closed the per-OUTPUT-slot REINIT race that broke Firefox's MediaSource pipeline; iter7 hardened the carry items (msync verify, slot-leak recovery, cap_pool race harness). + +iter8 anchors the empirical claim with measured numbers (this iteration). + +## Iteration outcomes + +| Iter | Locked tracks | Outcome | Date closed | +|---|---|---|---| +| 1 | initial multi-planar bring-up | iter1 known bugs identified + 3-fix list | 2026-05-04 | +| 2 | A+B+E (the three iter1 known bugs) | mpv vaapi DMA-BUF "smooth" per operator inspection | 2026-05-04 | +| 3 | F+A (Firefox sandbox + frame-11 EINVAL diagnosis) | F GREEN with patch; A diagnosed (fix deferred) | 2026-05-05 | +| 4 | A solo (frame-11 EINVAL fix) | GREEN — fresh request_fd per frame, DPB FFmpeg-semantics matching | 2026-05-05 | +| 5 | A+G+B+E (sweep + PGO Firefox + libplacebo + multi-context) | GREEN, all four | 2026-05-05 | +| 5-amend | iter5-G Firefox seccomp gap surfaced in real use | iter3 patch extended to UtilitySandboxPolicy; sandbox closes | 2026-05-05 | +| 6 | I→A∪I (Firefox QBUF EINVAL → cap_pool race merger) | GREEN — per-OUTPUT-slot REINIT discipline replaces close+alloc | 2026-05-05 | +| 7 | A+B+C (msync verify + slot-leak recovery + cap_pool harness) | GREEN — all three; bonus OUTPUT-pool teardown fix | 2026-05-06 | +| 8 | E (perf binding cell) | GREEN — numbers anchored | 2026-05-06 | + +Eight iterations. Twelve fork commits since the campaign opened (against the bootlin baseline). Three test harnesses in `tests/`. One firefox-fourier patch (combined broker + RDD seccomp + Utility seccomp, three gates closed). + +## Tracks dropped + reasons + +- **Track D** (Bootlin / Mozilla upstreaming) — dropped 2026-05-06 on philosophical grounds. The AI-slop-buster review climate in 2026 maintainership makes the social cost of submission exceed the benefit when personal requirements are met. See `memory/project_no_upstreaming_philosophical.md` for the operator-verbatim rationale. + +- **Track F** (V4L2_MEMORY_DMABUF on OUTPUT) — dropped 2026-05-06 on technical merit. Sonnet architect research found: every production V4L2 stateless H.264 consumer (FFmpeg, GStreamer, Chromium) uses MMAP on OUTPUT. Kernel-side DMABUF capability advertised by hantro/rkvdec but unexercised for H.264. Original cap_pool race justification closed organically by iter5/iter6/iter7 fixes. See `track_F_research_2026-05-06.md`. + +- **Multi-codec (MPEG-2)** — dropped 2026-05-06 at iter6 close. CPU handles MPEG-2 trivially on the A55 cluster; the campaign's user audience doesn't need MPEG-2 HW path. + +## Residual carries (low-priority, listed for any future operator picking this up) + +1. **STREAMON-on-context-recreate after resolution change** — corner case surfaced by the iter7 cap_pool harness when sonnet's pre-commit suggestion to add `vaCreateContext` was tested. Real consumers (Firefox, mpv) don't trigger this — they create one context per decoder lifetime. iter7 reverted the test to the no-context iter5 sonnet C4 specification; the new bug stays latent. + +2. **Pool-size parameterization** — iter6 sonnet review suggested `max(surfaces_count, DPB_SIZE_H264_MAX)` instead of the hardcoded 16. Empirically 16 is fine; not a current bottleneck. + +3. **Fault-injection build for slot-leak Track B recovery** — Phase 1 success criterion B partial: sonnet code-reviewed the semantics, happy-path regression confirmed clean. A debug build with `-DITER7_FAULT_INJECT_REINIT` would close the gap empirically. Deferred unless concretely needed. + +4. **DMABUF zero-copy mpv perf measurement** — iter8 harness couldn't measure this path (`--vo=null` falls back; `--vo=gpu` blocked by Mali-G52 Vulkan unavailability; `--vo=drm` blocked by sudo-shell KMS access). A dedicated harness running as a desktop-session user with a working VO would close this. + +5. **Firefox-with-HW-disabled SW baseline** — iter8 only measured Firefox's HW path. A complementary Firefox-SW row would frame the saving precisely (estimated 60-80pp+ saving extrapolated from mpv-SW vs mpv-HW). + +## Memory inheritance for future campaigns + +The campaign-specific memory at `~/.claude/projects/-home-mfritsche-src-libva-multiplanar/memory/` contains 12 entries. Most relevant for follow-on work: + +- `feedback_request_fd_lifecycle.md` — REINIT-vs-close+alloc lessons (iter4 → iter6 evolution) +- `feedback_kernel_obfuscation_compound.md` — V4L2 cluster validation pattern (per-control TRY isolation as the diagnostic escape) +- `feedback_seccomp_returns_enosys.md` — Firefox sandbox debugging signature +- `reference_ffmpeg_v4l2_request_is_authority.md` — FFmpeg's H.264 control semantics as the empirical authority +- `feedback_no_fixture_hardcoding.md` — test-harness honesty principle +- `project_no_upstreaming_philosophical.md` — campaign-defining stance recorded 2026-05-06 + +## Follow-on campaigns (chartered iter5, separate top-level) + +- **`fourier-fresnel`** — port the fork from PineTab2 (RK3566 via hantro/rk3568-vpu) to fresnel RK3399 (Pinebook Pro). Validates iter1-iter8 fixes on a second hardware target. Charter at `~/src/libva-multiplanar/firefox-fourier/...` or as a fresh repo when opened. +- **`panvk-bifrost`** — Vulkan-on-Mali for Bifrost-gen GPUs (Mali-G52 on RK3566/RK3568, etc.). Document-only at `~/src/panvk-bifrost/`. Sequenced after fourier-fresnel. + +Per `project_followon_campaigns.md`, neither opens without explicit operator instruction. + +## Final state + +- **Driver**: `libva-v4l2-request-fourier` HEAD `65969da`, `7aa59a1b...` sha256 of installed `.so`. iter1..iter8 substantive work landed across 12 commits. +- **Firefox**: `firefox-150.0.1-1.1` with the iter5-amendment combined sandbox patch (broker + RDD seccomp + Utility seccomp). Build infrastructure: `firefox-fourier` LXD on boltzmann, persistent. +- **Test harnesses**: 3 in `tests/` — `cap_pool_probe_pattern.c` (race regression), `run_msync_pixel_verify.sh` (pixel-correctness), `run_perf_binding_cell.sh` (perf anchor). +- **Campaign documentation**: `phase0_findings_iter[1..8].md`, `phase8_iteration[1..8]_close.md`, plus per-iteration situation/plan/review docs as needed. All committed to `git.reauktion.de:marfrit/libva-multiplanar`. All committed under `claude-noether ` from iter5 onward. + +The campaign closes with a working deliverable, anchored numbers, and an honest accounting. **Done.**