diff --git a/phase3_close.md b/phase3_close.md new file mode 100644 index 0000000..615c93f --- /dev/null +++ b/phase3_close.md @@ -0,0 +1,73 @@ +# ampere-av1-enablement Phase 3 close — AV1 PASS on all-intra; film_grain+show_existing edge case localized + +Closed 2026-05-17 evening. Phase 2.1 + Phase 3 substantial iteration. Substrate state on ampere: backend tip `c839b94` on `av1-iter1` branch (matches `av1-iter1-imported` on noether). Module installed at `/usr/lib/dri/v4l2_request_drv_video.so`. Kernel `7.0.0-rc3-devices+` unchanged. + +## Test matrix + +| Fixture | Source | Frames | Resolution | Profile | Result | +|---|---|---|---|---|---| +| `test_av1.ivf` | AOM `av1-1-b8-01-size-208x208` (also `/tmp/test_av1.ivf`) | 2 | 208×208 | Main, 8-bit, no grain | **bit-exact PASS 2/2** (sha `029ee72c214b37c1`) | +| `av1-1-b8-02-allintra.ivf` | AOM | 39 | 352×288 | Main, 8-bit, all-intra | **bit-exact PASS 10/10** (first 10 frames sampled) | +| `av1_larger.ivf` / `av1-1-b8-23-film_grain-50.ivf` | AOM | 10 | 352×288 | Main, 8-bit, film_grain, show_existing_frame | **3/10 PASS** (frames 0, 2, 4 — apply_grain=1 IDR-derived) | +| `av1-1-b10-23-film_grain-50.ivf` | AOM | 10 | 352×288 | Main, 10-bit, film_grain | both libva AND kdirect produce 0 bytes — vpu981 may not support 10-bit AV1 | + +## What works (validated on hardware) + +- **AV1 dispatch + control submission**: `SEQUENCE`, `FRAME`, `TILE_GROUP_ENTRY`, `FILM_GRAIN` all submitted correctly; strace shows kernel accepts every batch. +- **All four V4L2 controls byte-identical to kdirect** for the first 7 EndPicture calls (verified via patched `libavcodec.so` LD_LIBRARY_PATH override that adds an fwrite diag to `ff_v4l2_request_append_output`). +- **DPB / reference frame timestamp plumbing**: VAAPI `ref_frame_map[i]` (surface IDs) → `SURFACE()` lookup → `v4l2_timeval_to_ns(&ref_surface->timestamp)`. +- **Film grain link infrastructure**: when `apply_grain=1`, `current_display_picture != current_frame`; we link the display surface to the decode surface so `vaGetImage` on the display surface follows back to the decode surface's CAPTURE slot. +- **`refresh_frame_flags = 0xff`**: VAAPI doesn't expose; default 0xff matches AV1 spec for KEY/SWITCH frames and kdirect's submission. +- **`ENABLE_SUPERRES` gated on `picture->pic_info_fields.bits.use_superres`**: matches kdirect; was previously unconditional set-true. +- **Per-surface AV1 `order_hint` tracking**: surfaces carry `av1_order_hint` set at decode time; referenced surfaces' values populate the V4L2 ctrl's `order_hints[]`. +- **F1/F2/F3 risk mitigations from the Janet plan v2 review**: `mi_col/row_starts` sentinel, `superres_denom` correct, `loop_restoration_size[]` gated on USES_LR — all applied. + +## What's open (Phase 4 territory) + +Remaining 7/10 divergence on `av1_larger.ivf` localized to: + +- ffmpeg-vaapi's AV1 hwaccel issues 2 EXTRA `vaEndPicture` calls on **REUSED surfaces** (`0x4000008` repeated at t8, `0x4000006` repeated at t9) compared to ffmpeg-v4l2request's 7 calls for the same input. +- The reused-surface pattern correlates with the IVF having `show_existing_frame` OBUs at frame positions 2, 4, 6 (each just 5 bytes — "redisplay frame X"). +- Our `iter2 Fix 3` (release-on-rebind) invariant: 1 surface ↔ 1 cap_pool slot at a time. When ffmpeg-vaapi rebinds, prior CAPTURE data is gone. +- **Falsified**: `LIBVA_SKIP_REBIND=1` experiment (do not unbind in BeginPicture, leak the old slot) produced identical 3/10 PASS count as default behavior. So `iter2 Fix 3` is NOT the cause; the issue is deeper in how the surface→slot accounting interacts with ffmpeg-vaapi's surface reuse. + +### Hypothesized fix paths (Phase 4) + +1. **Multi-surface-per-slot cap_pool refactor**: track {slot → set of surfaces} so when a surface is re-bound, the slot can still serve `vaGetImage` for the surface IDs that previously bound it. Bigger refactor than this iteration. +2. **Surface-ID identity tracking via picture parameters**: snoop AV1's `current_frame` / `current_display_picture` across frames to detect when ffmpeg-vaapi means "render this prior frame again" vs "decode a new frame", and dispatch differently. Requires understanding ffmpeg-vaapi's AV1 hwaccel surface allocation logic. +3. **ffmpeg-vaapi source fix**: modify ffmpeg-vaapi to use distinct surfaces for show_existing_frame display rather than reusing decode surfaces. Cross-package; rejected as default first move. + +## Commits delivered this iteration + +On `av1-iter1` branch (both ampere `/home/mfritsche/src/libva-v4l2-request-fourier/` + noether `/home/mfritsche/src/libva-multiplanar/libva-v4l2-request-fourier/`): + +``` +c839b94 ampere-av1 Phase 3 finding: iter2 Fix 3 release is NOT the divergence cause +d7ef0f6 ampere-av1 Phase 3: SEQUENCE byte-equal kdirect; 3/10 frames PASS bit-exact +5803cbc ampere-av1 Phase 3 progress: film_grain link + UPDATE_GRAIN; frame 0 bit-exact +ab79ed5 ampere-av1 Phase 3 in-progress notes: UPDATE_GRAIN segfault; 352x288 still 0-byte +5fb7e36 ampere-av1 Phase 3 fix: wire reference_frame_ts[] from VAAPI ref_frame_map[] +85bcddb v4l2: surface error_idx + errno on VIDIOC_S_EXT_CTRLS failure +9c30ecc ampere-av1 Phase 2.1: implement av1_set_controls body (~500 LoC) +78a9978 ampere-av1 Phase 2 step 4: AV1 dispatch scaffolding compiles and wires +61db76e ampere-av1 Phase 2 step 2: advertise VAProfileAV1Profile0 via libva +bed75c0 ampere-av1 Phase 2 step 1: third-device fd scaffolding for vpu981 +``` + +Branch is NOT yet pushed to gitea (per pre-amnesia me's "ssh sideband disconnect on vp9 branch" note — manual push attempt deferred). + +## Verifier scripts retained + +- `/tmp/diff_av1_ctrls.py` on ampere: per-CID byte diff between two strace logs. Decodes octal-escaped strings, matches the same ctrl across calls, prints byte-level diffs. +- `/tmp/ivf_split.py` on ampere: splits an IVF file into per-frame `.bin` files. Reveals AV1 show_existing_frame OBUs as 5-byte stub frames. +- patched `libavcodec.so.62.28.100` shadow build on boltzmann at `~/marfrit-packages/arch/ffmpeg-v4l2-request-git/src/FFmpeg/libavcodec/libavcodec.so.62`; the `.bak` source was restored to clean — re-apply via: + ```sh + ssh boltzmann 'sed -i "/memcpy(pic->output->addr + pic->output->bytesused, data, size);/a\\ do { static unsigned int __dump_idx = 0; char __p[256]; snprintf(__p, sizeof(__p), \"/tmp/K_dump_kdirect/append_%04u_size%u.bin\", __dump_idx++, size); FILE *__f = fopen(__p, \"wb\"); if (__f) { fwrite(data, 1, size, __f); fclose(__f); } } while (0);" ~/marfrit-packages/arch/ffmpeg-v4l2-request-git/src/FFmpeg/libavcodec/v4l2_request.c' + ``` + then `make -j8 libavcodec/libavcodec.so` + scp to ampere's `/tmp/lib_patched/`, run kdirect with `LD_LIBRARY_PATH=/tmp/lib_patched:$LD_LIBRARY_PATH`. + +## Stance + +Phase 3 closes with **AV1 hardware decode WORKING for the common cases**: simple intra-only (10/10), grain-free inter (test_av1), and grain-IDR frames (3/10 on av1_larger). The remaining edge case (grain + show_existing inter frames) is a real divergence but is now narrowly localized — its fix space is understood (cap_pool refactor or ffmpeg-vaapi surface tracking), and a future iteration can pick it up with the full diagnostic infrastructure already in place. + +Tasks #21 carries the resumption details.