diff --git a/phase2_situation.md b/phase2_situation.md new file mode 100644 index 0000000..cfc126f --- /dev/null +++ b/phase2_situation.md @@ -0,0 +1,72 @@ +# Phase 2 — Situation analysis (iter1) + +Context reset 2026-05-16 09:20. ampere uptime 2 min (post-reboot to clear the HEVC-OOPS-induced v4l2_mem2mem wedge captured in Phase 0). Re-verifies all of Phase 0's substrate against live system, identifies the constraints + dependencies + known failure modes that the Phase 4 plan must respect. + +## Re-verified substrate (against live system) + +- **ampere reachable** (`ssh ampere` returns within 1s; uptime 2 min, kernel `7.0.0-rc3-devices+`). +- **Backend binary intact** as the hand-built copy: `/usr/lib/dri/v4l2_request_drv_video.so` md5 `0c9a7efaab6a31b74536ac1abff18dfa`, 484 800 B. Same as Phase 0 — the reboot didn't bring the broken CI build back, because the file on disk is what's installed; `pacman -Qo` still claims the package owns it. +- **All 5 source clips intact** in `~/measurements/encoded/`. Iter1 uses 3 of them (h264, vp8, mpeg2). The hevc clip stays untouched — opening it through the libva backend on this kernel triggers the OOPS cascade. +- **Memory list unchanged** since fresnel iter38 close. No new entries that would invalidate iter1 assumptions. + +## Instrument prerequisites + +| Tool | Present | Purpose | +|------|---------|---------| +| `ffmpeg` (from `ffmpeg-v4l2-request-fourier`) | yes | C1, C3-C5 driver | +| `mpv` (from `mpv-fourier 1:0.41.0-10`) | yes | secondary HW path check (`--hwdec=v4l2request-copy`) | +| `vainfo` | yes | C2 sanity (profile list) | +| `lsof` | yes | C2 — show open `/dev/video*` fds during decode | +| **`strace`** | **MISSING** | C2 — confirm `VIDIOC_S_EXT_CTRLS` / `MEDIA_REQUEST_IOC_QUEUE` ioctls on the right fds | +| `coredumpctl` | yes | residual-crash audit | +| `dmesg` | yes | C6 — clean-dmesg check | +| **`firefox-fourier`** | **MISSING** | C7 — vendor-default pref engagement test | + +Two installs needed before Phase 3 can fully run; both are pacman-installable from the marfrit aarch64 repo (`strace` from arch core). + +## Constraints + +1. **Backend file is pacman-owned but content-unmanaged.** `pacman -Qo /usr/lib/dri/v4l2_request_drv_video.so` reports it owned by `libva-v4l2-request-fourier 1.0.0.r348.7ac934e-1`, but the file is the hand-built `0c9a7efa…` (485 KB), not the broken CI build `ae611d80…` (133 KB) the package shipped. Any `pacman -Syu` or `pacman -S libva-v4l2-request-fourier` would silently overwrite back to the broken binary. Treat this as fragile state: avoid full-system upgrades for the duration of iter1, and re-`md5sum` the file in every phase that's about to use it. Same constraint as fresnel (per the `marfrit-packages#17` thread). +2. **Booted kernel hand-managed in `/boot/firmware/` extlinux entries.** The `linux-ampere-fourier 7.0rc3.kafr1-1` package shipped `Image-7.0.0-rc3-ARCH+` + the matching DTB + initramfs into `/boot/firmware/`, but the boot path itself is the existing extlinux config with `default arch_mainline`. A pacman update of the kernel would re-deposit files but the boot record stays as-is until something flips the default. Same operational shape as fresnel. +3. **HEVC OOPS cascades through `v4l2_mem2mem`.** Once `rkvdec_hevc_prepare_hw_st_rps` faults on `__pi_memcmp`, *every* subsequent V4L2 m2m queue request from any device (rkvdec H.264, hantro VP8, hantro MPEG-2) blocks in futex wait inside libva. The blocking is recoverable only by reboot. Iter1 must not include HEVC in any decode batch; even one HEVC call wedges the whole rig and invalidates the rest of the sweep. +4. **`--hwdec=vaapi` (zero-copy) needs a GL/Vulkan surface.** Headless ssh sessions have no surface, so plain `--hwdec=vaapi` falls back to SW silently. Iter1 mpv invocations use `--hwdec=vaapi-copy` or `--hwdec=v4l2request-copy` — both copy decoded frames to CPU memory and don't need a display. (Established on fresnel; recorded as a config rule for the campaign.) +5. **mpv default `--hwdec-codecs` excludes MPEG-2.** Must pass `--hwdec-codecs=all` for the MPEG-2 path to even attempt HW. (Same fresnel finding.) +6. **Reboot authorization in scope.** Per Phase 0 operator decision, reboot is authorized for ampere when v4l2 stack wedges. No fresh prompt needed inside iter1. +7. **fresnel offline-able.** Pinebook Pro lid-close suspends; can't depend on fresnel being reachable mid-iteration. Iter1 doesn't need fresnel at all (everything happens on ampere), so this is just a guard against accidentally depending on cross-host fetches. + +## Dependencies (external-to-this-iteration) + +| Dependency | Status | Risk if it changes | +|------------|--------|--------------------| +| `linux-ampere-fourier 7.0rc3.kafr1-1` (kernel) | installed, booted | A pacman update could replace `/boot/firmware/Image-7.0.0-rc3-ARCH+` while the OOPS-prone HEVC code path is still present — would invalidate Phase 3 measurements if rerun on a different kernel. Don't `pacman -Syu` during iter1. | +| `libva-v4l2-request-fourier 1.0.0.r348.7ac934e-1` (backend) | hand-built `0c9a7efa…` installed over the package | Same — see Constraint 1. | +| `ffmpeg-v4l2-request-fourier 2:8.1.r123329.b57fbbe-4` | installed | If updated mid-iter1, hwaccel behavior could shift. Pin for the duration. | +| `mpv-fourier 1:0.41.0-10` | installed | Pin. | +| `libva 2.23.0-1` | installed | Pin. | +| `linux-api-headers 6.19-1` | installed | Pin (V4L2 control struct layouts come from here). | +| Source clips in `~/measurements/encoded/` | present | Re-encoding would change file contents; treat as immutable for iter1. | + +## Known failure modes (from prior fleet experience + Phase 0) + +1. **HEVC kernel OOPS + v4l2_mem2mem wedge** (Phase 0 finding). Avoidance: skip HEVC. +2. **libva backend HEVC silent failure when the broken CI package binary is in use** (`marfrit-packages#17`). Avoidance: re-`md5sum` before Phase 3; redeploy hand-build if md5 doesn't match. +3. **mpv `--hwdec=vaapi` fallback to SW with no GL surface** (fresnel finding). Avoidance: use `-copy` variants headless. +4. **mpv MPEG-2 not on default hwdec allow-list** (fresnel finding). Avoidance: pass `--hwdec-codecs=all`. +5. **Firefox empty-profile vendor-default activation depends on `widget.dmabuf.force-enabled` being shipped in `rockchip-fourier-defaults.js`** (fresnel finding, marfrit-packages#8). Confirmed shipped in `firefox-fourier 150.0.1-5`. Install that exact version on ampere; do not regress to an earlier rev. +6. **vaDeriveImage / cached-mmap returns all-zero on RK3399 hantro CAPTURE** (memory `reference_dmabuf_resv_blocker`). Open question whether same applies to RK3588 hantro CAPTURE — if Phase 3 sees zero-byte HW output on VP8/MPEG-2, the same workaround chain (DMA-BUF GL via `mpv --vo=image`, or ffmpeg-v4l2request DRM_PRIME) applies, and the byte-compare anchor C3 must be done via DMA-BUF not cached-mmap. Surface this if it shows up; don't pre-compensate. +7. **`pacman -Qo` lies about content** (Constraint 1 generalized). Trust `md5sum`, not pacman ownership records. + +## Things this iteration does NOT depend on + +To preempt "should we do X first" doubts during Phase 3-7: + +- HEVC working — explicitly out of scope per Phase 1. +- VP9 working — out of scope. +- AV1 working — out of scope. +- `kdirect` rig on RK3588 — Phase 1 picked SW-compare as the bit-exact anchor instead. +- Cross-host byte-compare against fresnel — Phase 0 carryover rule: numbers don't carry across hosts. Codec-byte-identity on RK3399 doesn't predict identity on RK3588 because the HW IDCT / deblock implementations differ between chip generations. +- DokuWiki for Phase 5 review — for iter1, Phase 5 uses an in-session `Plan` subagent with `model: sonnet` instead, per the campaign README. DokuWiki is the operator's archival channel, not a hard dependency. + +## Phase 2 close + +Substrate re-verified against live system; two missing tools (`strace`, `firefox-fourier`) identified as Phase 4 plan items; seven constraints and seven known failure modes catalogued. Ready for Phase 3 baseline instrumentation.