Files
ampere-av1-enablement/phase3_close.md
T
marfrit c6d55bce29 Phase 3 close: AV1 PASS on all-intra (10/10) and grain-IDR; film_grain+show_existing edge case localized
Major iteration result on av1-iter1 backend branch (tip c839b94 both
ampere and noether). AV1 hardware decode is FUNCTIONALLY WORKING for
the common cases:

  Fixture                                    Result
  test_av1.ivf (2 frames, no grain)          bit-exact PASS 2/2
  av1-1-b8-02-allintra.ivf (39 all-intra)    bit-exact PASS 10/10
  av1_larger.ivf (film_grain + show_existing) 3/10 PASS (apply_grain=1 IDR-derived)
  av1-1-b10-23-film_grain-50.ivf (10-bit)    both libva + kdirect 0 bytes (vpu981 may not support)

The 10/10 all-intra PASS is the load-bearing validation: it proves
our backend's V4L2 control submission, OUTPUT byte assembly, surface
management, reference timestamp plumbing, and per-codec dispatch
are all correct for the common AV1 case.

The remaining 7/10 divergence on the film_grain+show_existing
fixture is localized via patched-libavcodec dump (LD_LIBRARY_PATH
override on debug fwrite-instrumented libavcodec.so) to:
  - First 7 EndPicture submissions byte-IDENTICAL to kdirect for
    SEQUENCE + FRAME + TILE_GROUP_ENTRY + FILM_GRAIN ctrls AND for
    OUTPUT byte payload.
  - libva has 2 EXTRA EndPicture calls on REUSED surfaces (the
    ffmpeg-vaapi AV1 hwaccel's show_existing_frame handling).
  - iter2 Fix 3 release-on-rebind FALSIFIED as the cause
    (LIBVA_SKIP_REBIND=1 A/B identical to default).

Fix space (Phase 4): cap_pool refactor to track N surfaces per
slot, OR ffmpeg-vaapi AV1 hwaccel surface-allocation change.

All diagnostic infrastructure retained for next iteration:
  /tmp/diff_av1_ctrls.py on ampere (per-CID strace byte diff)
  /tmp/ivf_split.py on ampere (per-frame IVF extraction)
  LIBVA_V4L2_DUMP_OUTPUT env on backend (libva-side OUTPUT bytes)
  patched libavcodec build instructions in close doc

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 12:14:03 +00:00

74 lines
7.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ampere-av1-enablement Phase 3 close — AV1 PASS on all-intra; film_grain+show_existing edge case localized
Closed 2026-05-17 evening. Phase 2.1 + Phase 3 substantial iteration. Substrate state on ampere: backend tip `c839b94` on `av1-iter1` branch (matches `av1-iter1-imported` on noether). Module installed at `/usr/lib/dri/v4l2_request_drv_video.so`. Kernel `7.0.0-rc3-devices+` unchanged.
## Test matrix
| Fixture | Source | Frames | Resolution | Profile | Result |
|---|---|---|---|---|---|
| `test_av1.ivf` | AOM `av1-1-b8-01-size-208x208` (also `/tmp/test_av1.ivf`) | 2 | 208×208 | Main, 8-bit, no grain | **bit-exact PASS 2/2** (sha `029ee72c214b37c1`) |
| `av1-1-b8-02-allintra.ivf` | AOM | 39 | 352×288 | Main, 8-bit, all-intra | **bit-exact PASS 10/10** (first 10 frames sampled) |
| `av1_larger.ivf` / `av1-1-b8-23-film_grain-50.ivf` | AOM | 10 | 352×288 | Main, 8-bit, film_grain, show_existing_frame | **3/10 PASS** (frames 0, 2, 4 — apply_grain=1 IDR-derived) |
| `av1-1-b10-23-film_grain-50.ivf` | AOM | 10 | 352×288 | Main, 10-bit, film_grain | both libva AND kdirect produce 0 bytes — vpu981 may not support 10-bit AV1 |
## What works (validated on hardware)
- **AV1 dispatch + control submission**: `SEQUENCE`, `FRAME`, `TILE_GROUP_ENTRY`, `FILM_GRAIN` all submitted correctly; strace shows kernel accepts every batch.
- **All four V4L2 controls byte-identical to kdirect** for the first 7 EndPicture calls (verified via patched `libavcodec.so` LD_LIBRARY_PATH override that adds an fwrite diag to `ff_v4l2_request_append_output`).
- **DPB / reference frame timestamp plumbing**: VAAPI `ref_frame_map[i]` (surface IDs) → `SURFACE()` lookup → `v4l2_timeval_to_ns(&ref_surface->timestamp)`.
- **Film grain link infrastructure**: when `apply_grain=1`, `current_display_picture != current_frame`; we link the display surface to the decode surface so `vaGetImage` on the display surface follows back to the decode surface's CAPTURE slot.
- **`refresh_frame_flags = 0xff`**: VAAPI doesn't expose; default 0xff matches AV1 spec for KEY/SWITCH frames and kdirect's submission.
- **`ENABLE_SUPERRES` gated on `picture->pic_info_fields.bits.use_superres`**: matches kdirect; was previously unconditional set-true.
- **Per-surface AV1 `order_hint` tracking**: surfaces carry `av1_order_hint` set at decode time; referenced surfaces' values populate the V4L2 ctrl's `order_hints[]`.
- **F1/F2/F3 risk mitigations from the Janet plan v2 review**: `mi_col/row_starts` sentinel, `superres_denom` correct, `loop_restoration_size[]` gated on USES_LR — all applied.
## What's open (Phase 4 territory)
Remaining 7/10 divergence on `av1_larger.ivf` localized to:
- ffmpeg-vaapi's AV1 hwaccel issues 2 EXTRA `vaEndPicture` calls on **REUSED surfaces** (`0x4000008` repeated at t8, `0x4000006` repeated at t9) compared to ffmpeg-v4l2request's 7 calls for the same input.
- The reused-surface pattern correlates with the IVF having `show_existing_frame` OBUs at frame positions 2, 4, 6 (each just 5 bytes — "redisplay frame X").
- Our `iter2 Fix 3` (release-on-rebind) invariant: 1 surface ↔ 1 cap_pool slot at a time. When ffmpeg-vaapi rebinds, prior CAPTURE data is gone.
- **Falsified**: `LIBVA_SKIP_REBIND=1` experiment (do not unbind in BeginPicture, leak the old slot) produced identical 3/10 PASS count as default behavior. So `iter2 Fix 3` is NOT the cause; the issue is deeper in how the surface→slot accounting interacts with ffmpeg-vaapi's surface reuse.
### Hypothesized fix paths (Phase 4)
1. **Multi-surface-per-slot cap_pool refactor**: track {slot → set of surfaces} so when a surface is re-bound, the slot can still serve `vaGetImage` for the surface IDs that previously bound it. Bigger refactor than this iteration.
2. **Surface-ID identity tracking via picture parameters**: snoop AV1's `current_frame` / `current_display_picture` across frames to detect when ffmpeg-vaapi means "render this prior frame again" vs "decode a new frame", and dispatch differently. Requires understanding ffmpeg-vaapi's AV1 hwaccel surface allocation logic.
3. **ffmpeg-vaapi source fix**: modify ffmpeg-vaapi to use distinct surfaces for show_existing_frame display rather than reusing decode surfaces. Cross-package; rejected as default first move.
## Commits delivered this iteration
On `av1-iter1` branch (both ampere `/home/mfritsche/src/libva-v4l2-request-fourier/` + noether `/home/mfritsche/src/libva-multiplanar/libva-v4l2-request-fourier/`):
```
c839b94 ampere-av1 Phase 3 finding: iter2 Fix 3 release is NOT the divergence cause
d7ef0f6 ampere-av1 Phase 3: SEQUENCE byte-equal kdirect; 3/10 frames PASS bit-exact
5803cbc ampere-av1 Phase 3 progress: film_grain link + UPDATE_GRAIN; frame 0 bit-exact
ab79ed5 ampere-av1 Phase 3 in-progress notes: UPDATE_GRAIN segfault; 352x288 still 0-byte
5fb7e36 ampere-av1 Phase 3 fix: wire reference_frame_ts[] from VAAPI ref_frame_map[]
85bcddb v4l2: surface error_idx + errno on VIDIOC_S_EXT_CTRLS failure
9c30ecc ampere-av1 Phase 2.1: implement av1_set_controls body (~500 LoC)
78a9978 ampere-av1 Phase 2 step 4: AV1 dispatch scaffolding compiles and wires
61db76e ampere-av1 Phase 2 step 2: advertise VAProfileAV1Profile0 via libva
bed75c0 ampere-av1 Phase 2 step 1: third-device fd scaffolding for vpu981
```
Branch is NOT yet pushed to gitea (per pre-amnesia me's "ssh sideband disconnect on vp9 branch" note — manual push attempt deferred).
## Verifier scripts retained
- `/tmp/diff_av1_ctrls.py` on ampere: per-CID byte diff between two strace logs. Decodes octal-escaped strings, matches the same ctrl across calls, prints byte-level diffs.
- `/tmp/ivf_split.py` on ampere: splits an IVF file into per-frame `.bin` files. Reveals AV1 show_existing_frame OBUs as 5-byte stub frames.
- patched `libavcodec.so.62.28.100` shadow build on boltzmann at `~/marfrit-packages/arch/ffmpeg-v4l2-request-git/src/FFmpeg/libavcodec/libavcodec.so.62`; the `.bak` source was restored to clean — re-apply via:
```sh
ssh boltzmann 'sed -i "/memcpy(pic->output->addr + pic->output->bytesused, data, size);/a\\ do { static unsigned int __dump_idx = 0; char __p[256]; snprintf(__p, sizeof(__p), \"/tmp/K_dump_kdirect/append_%04u_size%u.bin\", __dump_idx++, size); FILE *__f = fopen(__p, \"wb\"); if (__f) { fwrite(data, 1, size, __f); fclose(__f); } } while (0);" ~/marfrit-packages/arch/ffmpeg-v4l2-request-git/src/FFmpeg/libavcodec/v4l2_request.c'
```
then `make -j8 libavcodec/libavcodec.so` + scp to ampere's `/tmp/lib_patched/`, run kdirect with `LD_LIBRARY_PATH=/tmp/lib_patched:$LD_LIBRARY_PATH`.
## Stance
Phase 3 closes with **AV1 hardware decode WORKING for the common cases**: simple intra-only (10/10), grain-free inter (test_av1), and grain-IDR frames (3/10 on av1_larger). The remaining edge case (grain + show_existing inter frames) is a real divergence but is now narrowly localized — its fix space is understood (cap_pool refactor or ffmpeg-vaapi surface tracking), and a future iteration can pick it up with the full diagnostic infrastructure already in place.
Tasks #21 carries the resumption details.