Files
libva-v4l2-request-fourier/phase7_pi5_hevc_close.md
T
claude-noether 9037934b21 phase7_pi5_hevc_close: iter40 partial — backend integration works, decode rejected by rpi-hevc-dec
C1 vainfo PASS, C3 HW engagement PASS, C6 decode-correctness FAIL
(V4L2_BUF_FLAG_ERROR on every CAPTURE DQBUF). Root cause empirically
located: SPS sps_max_num_reorder_pics + sps_max_latency_increase_plus1
fields. Our backend uses a spec-legal fallback (sps_max_dec_pic_buffering_minus1, 0)
because VAAPI doesn't forward these fields; rkvdec accepts it,
rpi-hevc-dec validates against bitstream-true values and rejects.

Real fix needs SPS NAL parse via the iter2 vendored GStreamer parser
to populate bitstream-true values for the V4L2 SPS ctrl. Estimated
1 more 8(+1)-phase loop (iter40b).

Phase 8 + Phase 9 deferred — won't package + deploy + ship a broken
backend; won't distill lessons until the real fix lands.

Sibling iter38 baseline NOT yet re-verified on fresnel + ampere
post-iter40. Code paths gated on video_fd_rpi_hevc_dec >= 0 stay
no-op on non-Pi hosts; only __arm__ → __aarch64__ guard change is
globally observable but its is_10bit sub-gate stays dormant on
8-bit fixtures. Verify before declaring no-regression.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 19:18:16 +00:00

151 lines
7.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 7 close — iter40 Pi 5 HEVC partial
Closed 2026-05-17 evening. Backend tip `3ffa9d0` on master. Higgs (Pi CM5,
Debian 13 trixie, kernel 6.12.75+rpt-rpi-2712) is the test target.
## Verification matrix
| Criterion | Result | Notes |
|---|---|---|
| C1 — vainfo enumeration | **PASS** ✓ | `VAProfileHEVCMain : VAEntrypointVLD` listed under v4l2-request driver |
| C2 — bit-exact libva vs kdirect | **FAIL** ✗ | All 3 fixtures (640 / 1280 / 1920) produce correct-sized output (10 frames × bytes/frame) but content differs from kdirect. Real decode failure — see C5. |
| C3 — HW engagement | **PASS** ✓ | lsof shows `/dev/video19` open by ffmpeg-vaapi during libva decode. `iter40: also opened rpi-hevc-dec at video_fd=5 media_fd=6` log line fires every session. |
| C4 — Stability under N=3 | n/a | Output deterministic but wrong; N=3 would reproduce same wrong SHA. |
| C5 — Sibling baseline preserved | **expected PASS** | Not yet re-verified post-iter40. All new fd / video_format / per-driver gates are no-op when rpi-hevc-dec absent (fresnel / ampere). |
| C6 — Decode succeeds at kernel level | **FAIL** ✗ | Every CAPTURE DQBUF returns `V4L2_BUF_FLAG_ERROR`. Decode fails per-frame. |
## What works
- Build clean on higgs (meson `release` + Debian 13 toolchain, after
`nv12_col128.h` + `nv15.h` fallback `#define`s for headers that omit
the mainline fourccs).
- ICD discovery: `LIBVA_DRIVER_NAME=v4l2_request` opens at
`/usr/lib/aarch64-linux-gnu/dri/v4l2_request_drv_video.so`.
- Multi-device probe (iter38 extended to 3 slots) finds rpi-hevc-dec via
`find_decoder_device_by_driver`. New `known_decoder_drivers[]` entry +
`else if (strcmp(info.driver, "rpi-hevc-dec") == 0)` branch in the
primary-driver detection block (Phase 5 review F3 fix).
- `request_device_kind_for_profile``'p'` override for HEVC when
rpi-hevc-dec is present.
- `request_switch_device_for_profile` retargets to the rpi fds.
- Synthetic-SPS pre-seed gated off for rpi-hevc-dec (Phase 5 review F6
fix — rpi doesn't have the iter25 rkvdec EBUSY problem).
- NC12 video_format entry; `v4l2_set_format` uses
`driver_data->video_format->v4l2_format` (not hardcoded NV12), so
S_FMT(CAPTURE) gets `NC12` (uppercase, single-plane) instead of `Nc12`
(multi-plane non-contig). Kernel returns expected
`sizeimage=1382400 bytesperline=1080 num_planes=1` for 1280×720.
- `nv12_col128_detile_y` + `_uv` primitives copy per-column row-by-row
via memcpy(128 bytes per row × num_columns rows). Unit test
(`tests/test_nv12_col128_detile.c`) passes 10/10 (Y + UV at 640 / 1280
/ 1920 / 1366 widths + UV offset helper).
- `nv12_col128_uv_plane_offset` returns the correct within-column UV
start = `128 * ALIGN(height, 8)`. Earlier wrong formula
(`num_columns × 128 × aligned_h` = sizeof linear Y plane) was caught
by Phase 7 SEGV on 640 + 1920 widths — SAND interleaves Y+UV per
column, NOT plane-concatenated.
- `image.c` `#ifdef __arm__` guard extended to
`#if defined(__arm__) || defined(__aarch64__)` (Phase 5 review F1
fix — this was already silently dead-coding the iter39 NV15→P010
detile on fresnel + ampere; iter39 5/5 PASS masked it because no
10-bit path was exercised). The `tiled_to_planar` (Sunxi) call is
kept arm-only since the asm symbol isn't built on aarch64.
- `RequestCreateImage` NC12 override sets `pitches[0] = width` (linear
NV12 Y stride) instead of the kernel-returned column stride (1080
for 1280×720).
## What fails
`V4L2_BUF_FLAG_ERROR` on every CAPTURE DQBUF. Kernel `rpi-hevc-dec`
rejects each frame's decode submission. Output buffer is left at its
initial (all-zero) state — the consumer (ffmpeg's `hwdownload`) reads
that and writes 0x00 to `format=nv12` output, producing the wrong SHA.
### Root cause identified — SPS field encoding diverges from bitstream
Compared per-frame `S_EXT_CTRLS class=0xf010000` payload bytes vs
kdirect (`ffmpeg -hwaccel drm -c:v hevc`):
SPS ctrl (id=0xa40a90, size=40), first 16 bytes:
- ours: `00 00 00 05 d0 02 00 00 04 04` **`04 00`** `01 01 00 03`
- kdirect: `00 00 00 05 d0 02 00 00 04 04` **`02 04`** `01 01 00 03`
Differing bytes at offset 1011:
- offset 10: `sps_max_num_reorder_pics` — ours=4, kdirect=2
- offset 11: `sps_max_latency_increase_plus1` — ours=0, kdirect=4
Per `src/h265.c:139-140`:
```c
/* iter11 α-13: VAAPI doesn't forward sps_max_num_reorder_pics or
* sps_max_latency_increase_plus1. ... */
sps->sps_max_num_reorder_pics = picture->sps_max_dec_pic_buffering_minus1;
sps->sps_max_latency_increase_plus1 = 0;
```
We use `sps_max_dec_pic_buffering_minus1` as a safe upper bound
fallback because VAAPI's `VAPictureParameterBufferHEVC` doesn't expose
`sps_max_num_reorder_pics` or `sps_max_latency_increase_plus1`.
That fallback is **accepted by rkvdec** (RK3399 + RK3588 — verified
across iter11iter39) but **rejected by rpi-hevc-dec**. Per H.265
§A.4.2 the constraint is `sps_max_num_reorder_pics ≤
sps_max_dec_pic_buffering_minus1`, so our value is spec-legal — but
rpi-hevc-dec apparently validates against the bitstream-true value and
errors when ours diverges.
Other per-frame ctrl differences also worth investigating once SPS is
right:
- kdirect sends **4** ctrls (SPS + PPS + decode_params + slice_array).
- We send **5** (SPS + PPS + slice_array + scaling_matrix +
decode_params) — order also differs.
## Real fix (out of scope this loop)
The iter2 ampere-VDPU381 chapter already vendors a GStreamer 1.28.2
H.265 parser (`src/h265_parser/`) precisely to extract bitstream-true
SPS / PPS fields VAAPI doesn't forward. The fix is:
1. Wherever h265.c reads SPS from VAAPI's `VAPictureParameterBufferHEVC`,
ALSO parse the SPS NAL from the OUTPUT slice payload using
`gst_h265_parser_parse_sps`.
2. Populate the V4L2 ctrl SPS struct with **bitstream-true** values for
the fields VAAPI omits: `sps_max_num_reorder_pics`,
`sps_max_latency_increase_plus1`, and any others in the same class.
3. Gate per-driver — only override on rpi-hevc-dec, leave the legacy
fallback for rkvdec (avoid disturbing the iter39 5/5 baseline on
fresnel + ampere).
4. Optionally: suppress the scaling_matrix ctrl when the SPS doesn't
set `sps_scaling_list_data_present_flag` — match kdirect's ctrl
count of 4.
Estimated additional surface area: ~150 LoC in h265.c, plus the parser
plumbing that iter2 already provides. Probably 1 more 8(+1)-phase
loop — Phase 0 verify rpi accepts bitstream-true values, Phase 1 lock
"libva==kdirect on all 3 fixtures", Phase 6 implement, Phase 7 verify.
## What's shipped this iter
Branch master `3ffa9d0`. NO debian/ packaging yet (Phase 8 deferred
until decode actually works — packaging a broken `.so` is mis-direction).
NO Phase 9 memory entry yet — waiting on the iter40b SPS-parse fix to
distill the full lesson.
The dev-process Phase 8 packaging + deploy-host re-verify rule wasn't
violated: the criterion (Phase 7 bit-exact PASS) wasn't met, so the
backend was not packaged + not promoted to a release. Local `.so`
install on higgs only, for debugging.
## Sibling regression status
fresnel iter38 5/5 baseline + ampere 9-profile vainfo NOT re-verified
post-iter40. Expected unchanged — every iter40 code path is gated on
`video_fd_rpi_hevc_dec >= 0` which stays false on non-Pi hosts. The
only globally-touched line is the `__arm__ → __aarch64__` guard in
image.c, which now ALSO enables the existing NV15→P010 detile on
aarch64 — that path was already silently dead (per iter39 close
addendum); enabling it MIGHT cause a behavior change for any consumer
that happens to request P010 from an 8-bit-decode surface, but the
gate `driver_data->is_10bit` keeps it dormant for 8-bit fixtures (the
iter38 baseline). Verify before declaring the regression-free promise
intact.