C1 vainfo PASS, C3 HW engagement PASS, C6 decode-correctness FAIL (V4L2_BUF_FLAG_ERROR on every CAPTURE DQBUF). Root cause empirically located: SPS sps_max_num_reorder_pics + sps_max_latency_increase_plus1 fields. Our backend uses a spec-legal fallback (sps_max_dec_pic_buffering_minus1, 0) because VAAPI doesn't forward these fields; rkvdec accepts it, rpi-hevc-dec validates against bitstream-true values and rejects. Real fix needs SPS NAL parse via the iter2 vendored GStreamer parser to populate bitstream-true values for the V4L2 SPS ctrl. Estimated 1 more 8(+1)-phase loop (iter40b). Phase 8 + Phase 9 deferred — won't package + deploy + ship a broken backend; won't distill lessons until the real fix lands. Sibling iter38 baseline NOT yet re-verified on fresnel + ampere post-iter40. Code paths gated on video_fd_rpi_hevc_dec >= 0 stay no-op on non-Pi hosts; only __arm__ → __aarch64__ guard change is globally observable but its is_10bit sub-gate stays dormant on 8-bit fixtures. Verify before declaring no-regression. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
7.7 KiB
Phase 7 close — iter40 Pi 5 HEVC partial
Closed 2026-05-17 evening. Backend tip 3ffa9d0 on master. Higgs (Pi CM5,
Debian 13 trixie, kernel 6.12.75+rpt-rpi-2712) is the test target.
Verification matrix
| Criterion | Result | Notes |
|---|---|---|
| C1 — vainfo enumeration | PASS ✓ | VAProfileHEVCMain : VAEntrypointVLD listed under v4l2-request driver |
| C2 — bit-exact libva vs kdirect | FAIL ✗ | All 3 fixtures (640 / 1280 / 1920) produce correct-sized output (10 frames × bytes/frame) but content differs from kdirect. Real decode failure — see C5. |
| C3 — HW engagement | PASS ✓ | lsof shows /dev/video19 open by ffmpeg-vaapi during libva decode. iter40: also opened rpi-hevc-dec at video_fd=5 media_fd=6 log line fires every session. |
| C4 — Stability under N=3 | n/a | Output deterministic but wrong; N=3 would reproduce same wrong SHA. |
| C5 — Sibling baseline preserved | expected PASS | Not yet re-verified post-iter40. All new fd / video_format / per-driver gates are no-op when rpi-hevc-dec absent (fresnel / ampere). |
| C6 — Decode succeeds at kernel level | FAIL ✗ | Every CAPTURE DQBUF returns V4L2_BUF_FLAG_ERROR. Decode fails per-frame. |
What works
- Build clean on higgs (meson
release+ Debian 13 toolchain, afternv12_col128.h+nv15.hfallback#defines for headers that omit the mainline fourccs). - ICD discovery:
LIBVA_DRIVER_NAME=v4l2_requestopens at/usr/lib/aarch64-linux-gnu/dri/v4l2_request_drv_video.so. - Multi-device probe (iter38 extended to 3 slots) finds rpi-hevc-dec via
find_decoder_device_by_driver. Newknown_decoder_drivers[]entry +else if (strcmp(info.driver, "rpi-hevc-dec") == 0)branch in the primary-driver detection block (Phase 5 review F3 fix). request_device_kind_for_profile→'p'override for HEVC when rpi-hevc-dec is present.request_switch_device_for_profileretargets to the rpi fds.- Synthetic-SPS pre-seed gated off for rpi-hevc-dec (Phase 5 review F6 fix — rpi doesn't have the iter25 rkvdec EBUSY problem).
- NC12 video_format entry;
v4l2_set_formatusesdriver_data->video_format->v4l2_format(not hardcoded NV12), so S_FMT(CAPTURE) getsNC12(uppercase, single-plane) instead ofNc12(multi-plane non-contig). Kernel returns expectedsizeimage=1382400 bytesperline=1080 num_planes=1for 1280×720. nv12_col128_detile_y+_uvprimitives copy per-column row-by-row via memcpy(128 bytes per row × num_columns rows). Unit test (tests/test_nv12_col128_detile.c) passes 10/10 (Y + UV at 640 / 1280 / 1920 / 1366 widths + UV offset helper).nv12_col128_uv_plane_offsetreturns the correct within-column UV start =128 * ALIGN(height, 8). Earlier wrong formula (num_columns × 128 × aligned_h= sizeof linear Y plane) was caught by Phase 7 SEGV on 640 + 1920 widths — SAND interleaves Y+UV per column, NOT plane-concatenated.image.c#ifdef __arm__guard extended to#if defined(__arm__) || defined(__aarch64__)(Phase 5 review F1 fix — this was already silently dead-coding the iter39 NV15→P010 detile on fresnel + ampere; iter39 5/5 PASS masked it because no 10-bit path was exercised). Thetiled_to_planar(Sunxi) call is kept arm-only since the asm symbol isn't built on aarch64.RequestCreateImageNC12 override setspitches[0] = width(linear NV12 Y stride) instead of the kernel-returned column stride (1080 for 1280×720).
What fails
V4L2_BUF_FLAG_ERROR on every CAPTURE DQBUF. Kernel rpi-hevc-dec
rejects each frame's decode submission. Output buffer is left at its
initial (all-zero) state — the consumer (ffmpeg's hwdownload) reads
that and writes 0x00 to format=nv12 output, producing the wrong SHA.
Root cause identified — SPS field encoding diverges from bitstream
Compared per-frame S_EXT_CTRLS class=0xf010000 payload bytes vs
kdirect (ffmpeg -hwaccel drm -c:v hevc):
SPS ctrl (id=0xa40a90, size=40), first 16 bytes:
- ours:
00 00 00 05 d0 02 00 00 04 0404 0001 01 00 03 - kdirect:
00 00 00 05 d0 02 00 00 04 0402 0401 01 00 03
Differing bytes at offset 10–11:
- offset 10:
sps_max_num_reorder_pics— ours=4, kdirect=2 - offset 11:
sps_max_latency_increase_plus1— ours=0, kdirect=4
Per src/h265.c:139-140:
/* iter11 α-13: VAAPI doesn't forward sps_max_num_reorder_pics or
* sps_max_latency_increase_plus1. ... */
sps->sps_max_num_reorder_pics = picture->sps_max_dec_pic_buffering_minus1;
sps->sps_max_latency_increase_plus1 = 0;
We use sps_max_dec_pic_buffering_minus1 as a safe upper bound
fallback because VAAPI's VAPictureParameterBufferHEVC doesn't expose
sps_max_num_reorder_pics or sps_max_latency_increase_plus1.
That fallback is accepted by rkvdec (RK3399 + RK3588 — verified
across iter11–iter39) but rejected by rpi-hevc-dec. Per H.265
§A.4.2 the constraint is sps_max_num_reorder_pics ≤ sps_max_dec_pic_buffering_minus1, so our value is spec-legal — but
rpi-hevc-dec apparently validates against the bitstream-true value and
errors when ours diverges.
Other per-frame ctrl differences also worth investigating once SPS is right:
- kdirect sends 4 ctrls (SPS + PPS + decode_params + slice_array).
- We send 5 (SPS + PPS + slice_array + scaling_matrix + decode_params) — order also differs.
Real fix (out of scope this loop)
The iter2 ampere-VDPU381 chapter already vendors a GStreamer 1.28.2
H.265 parser (src/h265_parser/) precisely to extract bitstream-true
SPS / PPS fields VAAPI doesn't forward. The fix is:
- Wherever h265.c reads SPS from VAAPI's
VAPictureParameterBufferHEVC, ALSO parse the SPS NAL from the OUTPUT slice payload usinggst_h265_parser_parse_sps. - Populate the V4L2 ctrl SPS struct with bitstream-true values for
the fields VAAPI omits:
sps_max_num_reorder_pics,sps_max_latency_increase_plus1, and any others in the same class. - Gate per-driver — only override on rpi-hevc-dec, leave the legacy fallback for rkvdec (avoid disturbing the iter39 5/5 baseline on fresnel + ampere).
- Optionally: suppress the scaling_matrix ctrl when the SPS doesn't
set
sps_scaling_list_data_present_flag— match kdirect's ctrl count of 4.
Estimated additional surface area: ~150 LoC in h265.c, plus the parser plumbing that iter2 already provides. Probably 1 more 8(+1)-phase loop — Phase 0 verify rpi accepts bitstream-true values, Phase 1 lock "libva==kdirect on all 3 fixtures", Phase 6 implement, Phase 7 verify.
What's shipped this iter
Branch master 3ffa9d0. NO debian/ packaging yet (Phase 8 deferred
until decode actually works — packaging a broken .so is mis-direction).
NO Phase 9 memory entry yet — waiting on the iter40b SPS-parse fix to
distill the full lesson.
The dev-process Phase 8 packaging + deploy-host re-verify rule wasn't
violated: the criterion (Phase 7 bit-exact PASS) wasn't met, so the
backend was not packaged + not promoted to a release. Local .so
install on higgs only, for debugging.
Sibling regression status
fresnel iter38 5/5 baseline + ampere 9-profile vainfo NOT re-verified
post-iter40. Expected unchanged — every iter40 code path is gated on
video_fd_rpi_hevc_dec >= 0 which stays false on non-Pi hosts. The
only globally-touched line is the __arm__ → __aarch64__ guard in
image.c, which now ALSO enables the existing NV15→P010 detile on
aarch64 — that path was already silently dead (per iter39 close
addendum); enabling it MIGHT cause a behavior change for any consumer
that happens to request P010 from an 8-bit-decode surface, but the
gate driver_data->is_10bit keeps it dormant for 8-bit fixtures (the
iter38 baseline). Verify before declaring the regression-free promise
intact.