Phase 1 measurable goal: HEVC Main 8-bit bit-exact libva-vs-kdirect on higgs for 640x360 / 1280x720 / 1920x1080 fixtures with HW path engagement verified via lsof + ffmpeg-vaapi log signal. Phase 2 surface-area audit: ~250 LoC backend + 100 LoC standalone detile primitive. Reuses iter38 multi-device-probe pattern (now 3 slots: rkvdec + hantro + rpi-hevc-dec) + iter2 per-driver gating shape. h265_set_controls + iter31 a-29 plumbing transfers unchanged. iter25 SPS pre-seed gated off for rpi-hevc-dec. Phase 3 baseline locked: N=3 bit-exact SW==kdirect for all three fixtures on higgs. kdirect engagement signal: Hwaccel V4L2 HEVC stateless V4; devices: /dev/media1,/dev/video19; buffers: src DMABuf, dst DMABuf; swfmt=rpi4_8 Phase 4 plan: 7 sequenced steps (request.h -> request.c -> video.c -> nv12_col128.c new -> image.c branch -> meson/Makefile -> build on higgs). NC12 tile geometry locked from kernel hevc_d_video.c math + ffmpeg/Kynesim av_rpi_sand_to_planar_y8 byte-offset formula. Risks + mitigations enumerated. Phase 5 sonnet review explicitly requested per CLAUDE.md no-skip-reviews rule. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
13 KiB
Phase 1+2+3+4 — Pi 5 HEVC chapter (iter40)
Per feedback_dev_process, Phase 1 (goal), Phase 2 (situation analysis),
Phase 3 (baselines), Phase 4 (plan) for adding rpi-hevc-dec as a third
multi-device-probe slot in libva-v4l2-request-fourier. Phase 0 substrate
- open-question answers live at
phase0_pi5_hevc.md.
Phase 1 — Goal
libva-v4l2-request-fourier on higgs decodes HEVC Main 8-bit input producing NV12 output bit-exact vs kdirect for three reference fixtures (640×360, 1280×720, 1920×1080 — Main profile, libx265 ultrafast). HW path engagement verified via the kernel-driver lsof signal (
/dev/video19open) AND ffmpeg-vaapi engagement signal (Hwaccel V4L2 HEVC stateless V4; devices: /dev/media1,/dev/video19).
Measurable:
| Criterion | Metric |
|---|---|
| C1 — vainfo enumeration | LIBVA_DRIVER_NAME=v4l2_request vainfo lists VAProfileHEVCMain : VAEntrypointVLD |
| C2 — bit-exact decode | sha256 of libva NV12 output == sha256 of kdirect NV12 output, per fixture, N=1 |
| C3 — HW engagement | lsof shows /dev/video19 open by ffmpeg-vaapi during libva run |
| C4 — Stability under N=3 | C2 holds at N=3 repeated runs (deterministic) |
| C5 — Sibling baseline preserved | fresnel iter38 5/5 still PASS post-iter40 (no regression to rkvdec/hantro path) |
Out of scope this iter: Main10 (10-bit / NC30), VP9, AV1, Firefox VA-API engagement testing, performance benchmarks. All later chapters.
Phase 2 — Situation Analysis
Backend architecture already in place
- Multi-device probe (iter38): at
VA_DRIVER_INITopens bothrkvdec+hantro-vpuviafind_decoder_device_by_driver(name). Stores per-driver fds (video_fd_{rkvdec,hantro},media_fd_{rkvdec,hantro}).RequestCreateConfigretargets the "active"driver_data->{video,media}_fdper profile viarequest_switch_device_for_profile()(request.c:426-478). - Per-driver feature gating:
request_data->has_hevc_ext_sps_rps_{rkvdec,hantro}pair, withh265_set_controlsconsulting the per-fd flag. Established by iter2 / Phase 5 review (request.h:99-100). This is the canonical per-driver gating shape for iter40. - HEVC ctrl population:
h265_set_controlspopulates the standardV4L2_CID_STATELESS_HEVC_*set (h265.c). Probe-gates EXT_SPS_*_RPS via the iter2 path — naturally dormant for rpi-hevc-dec since the controls don't exist. - Synthetic SPS pre-seed (iter25/26): needed for rkvdec to resolve
image_fmtbefore CAPTURE alloc. Phase 0 strace shows rpi-hevc-dec does NOT need this — it accepts NC12 + explicit dims onS_FMT CAPTUREdirectly. The pre-seed code path stays in place for rkvdec; rpi-hevc-dec just doesn't trigger it (gate on driver_kind). - CAPTURE detile primitive:
nv15_unpack_plane_to_p010()(nv15.c) is the template — backend already CPU-detiles when a Pi-or-Rockchip- specific CAPTURE format meets a linear consumer (VAImage NV12 / P010). - Single-plane (S) vs multi-plane (M) handling: hantro uses MPLANE, rkvdec uses both depending on codec. rpi-hevc-dec exposes MPLANE for BOTH OUTPUT (HEVC_SLICE) and CAPTURE (NC12) per the strace. iter38 already supports MPLANE handling for hantro; rpi reuses that.
Surface area to touch (audit)
| File | What changes | Size |
|---|---|---|
src/request.h |
Add video_fd_rpi_hevc_dec, media_fd_rpi_hevc_dec, has_hevc_ext_sps_rps_rpi_hevc_dec (mirror iter38 + iter2 layout) |
~10 lines |
src/request.c |
(a) Extend init -1 block to cover new fds. (b) Recognize rpi-hevc-dec as a 3rd primary/alt driver string in the probe loop. (c) Extend request_device_kind_for_profile so HEVC→'p' when rpi-hevc-dec is present, else 'r'. (d) Extend request_switch_device_for_profile 'p' branch. (e) Probe HEVC ext_sps on the new fd (will be false, mirrors hantro entry). |
~80 lines |
src/video.c |
Add V4L2_PIX_FMT_NV12_COL128 (NC12) video_format entry: 4:2:0, planes=1, alignment via dedicated bytesperline/sizeimage formula. NOT marked linear. |
~20 lines |
src/nv12_col128.c (NEW) |
nv12_col128_detile_to_nv12(): Y plane + UV plane detile primitive. Adapted from ffmpeg/Kynesim av_rpi_sand_to_planar_y8 core math. Header doc traces back to videodev2.h docstring + raspberrypi/linux hevc_dec/hevc_d_video.c size formula. |
~80 lines + 30-line header |
src/image.c |
Add NC12 → NV12 branch in copy_surface_to_image, gated on image->format.fourcc == VA_FOURCC_NV12 && video_format->v4l2_format == V4L2_PIX_FMT_NV12_COL128 (sibling to existing NV15→P010 branch). |
~25 lines |
src/meson.build + src/Makefile.am |
List nv12_col128.c/.h in sources |
2 lines |
Total estimated diff: ~250 LoC backend + ~100 LoC standalone primitive. Roughly half the surface area of iter38; smaller than iter2.
What does NOT change
- iter25/26 SPS pre-seed: stays on rkvdec path only (gated by driver_kind check that's already implicit in the rkvdec fd routing).
- iter2 EXT_SPS plumbing: probe-gated off on rpi-hevc-dec; vendored GStreamer parser unused. Confirmed via the EINVAL on ctrl 0xa97.
- iter31 α-29 slice_params st_rps_bits: APPLIES to rpi-hevc-dec unchanged. Same plumbing.
- iter33 VP8 hantro start-code prepend: not relevant (rpi-hevc-dec is HEVC-only; VP8 still goes through hantro on RK).
- iter38 single-libva-session multi-codec semantics: extends from 5 codecs to 5+1 (HEVC reroutes on Pi).
NC12 / SAND128 tile geometry — locked contract
From kernel driver drivers/media/platform/raspberrypi/hevc_dec/hevc_d_video.c
(via github raspberrypi/linux rpi-6.12.y):
case V4L2_PIX_FMT_NV12_COL128:
width = ALIGN(width, 128); /* Width rounds up to columns */
height = ALIGN(height, 8);
bytesperline = constrain2x(bytesperline, height * 3 / 2);
sizeimage = bytesperline * width;
break;
For 1280×720:
- width = 1280 (already 128-aligned)
- height = 720 (already 8-aligned)
- bytesperline = 720 × 3/2 = 1080 (matches Phase 0 strace observation)
- sizeimage = 1080 × 1280 = 1,382,400 (matches strace; equals linear NV12 byte count coincidentally)
Geometry interpretation (cross-verified against ffmpeg/Kynesim
rpi_sand_fn_pw.h av_rpi_sand_to_planar_y8):
- Image is divided into
(width + 127) / 128columns; each column is 128 px wide × height px tall. - Within a column:
128 × heightbytes of Y data, immediately followed by128 × height/2bytes of interleaved CbCr (so 128 ×bytesperlinebytes per column, wherebytesperlineis the column stride). - Across columns: column N starts at offset
N × stride1 × stride2wherestride1 = 128(column width) andstride2 = bytesperline. - Pixel (x, y) byte offset =
(x & 127) + y × 128 + (x & ~127) × bytesperlinefor Y; same formula withy/2for UV plane (which begins at offset128 × height × num_columnsfrom the start).
Reference for the detile loop: av_rpi_sand_to_planar_y8 (Kynesim
ffmpeg, libavutil/rpi_sand_fn_pw.h with PW=1). Our primitive copies
the single-stripe fast-path math; we don't import NEON ASM (CPU
detile is the safe path for Phase 1; SIMD a Phase 2 perf bump if needed).
Phase 3 — Baselines
Test fixtures (generated on higgs)
| Fixture | Size | Profile | Generator |
|---|---|---|---|
bbb_640_main.mp4 |
640×360 | Main 8-bit | ffmpeg -f lavfi -i testsrc=duration=2 -pix_fmt yuv420p -c:v libx265 -preset ultrafast -profile:v main |
bbb_1280_main.mp4 |
1280×720 | Main 8-bit | same |
bbb_1920_main.mp4 |
1920×1080 | Main 8-bit | same |
Captured 2026-05-17 evening on higgs
For each fixture, N=3 reps. Both SW (no hwaccel) and kdirect
(ffmpeg -hwaccel drm -c:v hevc) → -frames:v 10 -f rawvideo -pix_fmt nv12,
sha256 of first 16 chars:
bbb_640_main SW={9a81038065e9b7cd} HW={9a81038065e9b7cd} → BIT-EXACT × N=3
bbb_1280_main SW={d3bb055655d6f195} HW={d3bb055655d6f195} → BIT-EXACT × N=3
bbb_1920_main SW={0bc2bd6f693db039} HW={0bc2bd6f693db039} → BIT-EXACT × N=3
HW engagement signal (per-run): Hwaccel V4L2 HEVC stateless V4; devices: /dev/media1,/dev/video19; buffers: src DMABuf, dst DMABuf; swfmt=rpi4_8
This is the kdirect baseline. Phase 7 verification will compare libva output against these SHAs.
Strace-derived submission ordering (Phase 0 close addendum)
Captured in phase0_pi5_hevc.md. Briefly: standard V4L2-request
stateless flow, both queues DMABUF, no SPS pre-seed dance needed
(rpi-hevc-dec accepts NC12 + dims directly on CAPTURE S_FMT).
Phase 4 — Plan
Implementation steps (sequenced)
request.h: extendrequest_datawith the new fd pair + ext_sps flag, mirroring iter38/iter2 layout. (no behavior change yet)request.c:find_decoder_device_by_driver("rpi-hevc-dec", ...)accepts new driver string.- Init -1 block extends to new fds.
- Probe loop: if primary is
rkvdecorhantro-vpu, also proberpi-hevc-dec(third slot). On Pi 5 there's norkvdecorhantro-vpu, so primary becomesrpi-hevc-decand the alt-probes for the other two return absent (their fds stay -1). request_device_kind_for_profile: when profile isVAProfileHEVCMain, prefer'p'(rpi-hevc-dec) IFvideo_fd_rpi_hevc_dec >= 0, else fall through to'r'(rkvdec). All other profiles stay routed as today.request_switch_device_for_profile: add'p'branch.- ext_sps probe runs on the new fd; result stored in
has_hevc_ext_sps_rps_rpi_hevc_dec. Will be false (controls absent).
video.c: add NC12 video_format entry. Mark it MPLANE-only (per Phase 0 strace). bytesperline/sizeimage formula encoded per kernel driver math.src/nv12_col128.c+.h(NEW): single-file primitive,nv12_col128_detile_to_nv12(dst_y, dst_uv, src_y, src_uv, width, height, src_stride2). CPU per-column row-memcpy loop; not NEON for Phase 1 (correctness first). Self-test intests/test_nv12_col128_detile.c.image.c: branch incopy_surface_to_image. Gate:image->format.fourcc == VA_FOURCC_NV12 && video_format->v4l2_format == V4L2_PIX_FMT_NV12_COL128. Calls the primitive. Existing NV12-linear path stays.meson.build+Makefile.am: source list updates.- Build clean on higgs — first build target IS higgs (since iter40 only matters on Pi). Cross-build for ampere/fresnel is unaffected because they don't have rpi-hevc-dec — the new fd stays -1 and the per-driver routing falls through to existing rkvdec/hantro paths.
Verification gates (Phase 7 acceptance)
- Build cleanly on higgs (Debian 13 trixie, libva-dev 2.22.0-3, libdrm-dev 2.4.131).
- Local-install the resulting
.soto/usr/lib/aarch64-linux-gnu/dri/v4l2_request_drv_video.so. LIBVA_DRIVER_NAME=v4l2_request vainfolistsVAProfileHEVCMain.- For each Phase 3 fixture: libva output SHA == kdirect SHA (the Phase 3 recorded value).
lsofduring libva decode shows/dev/video19open.- Sibling regression check: fresnel
phase7_iter39_test_rigequivalent still 5/5 PASS (no regression to existing routing).
Risks + mitigations
| Risk | Mitigation |
|---|---|
| NC12 detile math wrong → libva ≠ kdirect | Tight unit test in tests/test_nv12_col128_detile.c with hand-crafted NC12 bytes + known linear output, before integration. |
request_switch_device_for_profile falls through wrong path on systems with BOTH rkvdec AND rpi-hevc-dec |
Prefer rpi-hevc-dec for HEVC when present. Explicit comment in switch. Test on fresnel = no rpi → falls to 'r'; test on higgs = no rkvdec → falls to 'p'. |
| Debian build env differs from Arch — see feedback_package_build_flags_unmask_bugs | Build with explicit -O2 -D_FORTIFY_SOURCE=2 -fstack-protector-strong flags to match Debian dpkg-buildflags. |
| Synthetic SPS pre-seed accidentally fires on rpi-hevc-dec | Gate on driver_kind != 'p' in the pre-seed call site. Verify via strace: pre-seed ioctl pattern absent. |
| iter2 EXT_SPS path accidentally engages on rpi | Already probe-gated; has_hevc_ext_sps_rps_rpi_hevc_dec = false naturally. |
Phase 5 review explicitly requested
Per CLAUDE.md global "Reviews are never skippable" + feedback_review_empirical_over_theoretical: this plan goes to a sonnet Plan-agent review. Specific review focus:
- Routing correctness when 0/1/2/3 of the three drivers are present.
- NC12 geometry: did we copy ffmpeg's per-row memcpy math correctly? Did we miss UV stride considerations?
image.cgate predicate — does it exclude any legitimate NV12-linear case on Pi? (No: rpi only exposes NC12/NC30 CAPTURE, no plain NV12.)- Cross-device regression scope (fresnel + ampere paths untouched?).
Empty-result review IS a green light; "we should have skipped it" is the prohibited move.