Files
libva-v4l2-request-fourier/phase1_pi5_hevc.md
T
claude-noether bf52725ab3 phase1_pi5_hevc: lock goal + situation + N=3 baseline + plan (iter40)
Phase 1 measurable goal: HEVC Main 8-bit bit-exact libva-vs-kdirect
on higgs for 640x360 / 1280x720 / 1920x1080 fixtures with HW path
engagement verified via lsof + ffmpeg-vaapi log signal.

Phase 2 surface-area audit: ~250 LoC backend + 100 LoC standalone
detile primitive. Reuses iter38 multi-device-probe pattern (now
3 slots: rkvdec + hantro + rpi-hevc-dec) + iter2 per-driver
gating shape. h265_set_controls + iter31 a-29 plumbing transfers
unchanged. iter25 SPS pre-seed gated off for rpi-hevc-dec.

Phase 3 baseline locked: N=3 bit-exact SW==kdirect for all three
fixtures on higgs. kdirect engagement signal:
  Hwaccel V4L2 HEVC stateless V4; devices: /dev/media1,/dev/video19;
  buffers: src DMABuf, dst DMABuf; swfmt=rpi4_8

Phase 4 plan: 7 sequenced steps (request.h -> request.c -> video.c
-> nv12_col128.c new -> image.c branch -> meson/Makefile -> build
on higgs). NC12 tile geometry locked from kernel hevc_d_video.c
math + ffmpeg/Kynesim av_rpi_sand_to_planar_y8 byte-offset formula.
Risks + mitigations enumerated.

Phase 5 sonnet review explicitly requested per CLAUDE.md
no-skip-reviews rule.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 19:00:35 +00:00

13 KiB
Raw Blame History

Phase 1+2+3+4 — Pi 5 HEVC chapter (iter40)

Per feedback_dev_process, Phase 1 (goal), Phase 2 (situation analysis), Phase 3 (baselines), Phase 4 (plan) for adding rpi-hevc-dec as a third multi-device-probe slot in libva-v4l2-request-fourier. Phase 0 substrate

  • open-question answers live at phase0_pi5_hevc.md.

Phase 1 — Goal

libva-v4l2-request-fourier on higgs decodes HEVC Main 8-bit input producing NV12 output bit-exact vs kdirect for three reference fixtures (640×360, 1280×720, 1920×1080 — Main profile, libx265 ultrafast). HW path engagement verified via the kernel-driver lsof signal (/dev/video19 open) AND ffmpeg-vaapi engagement signal (Hwaccel V4L2 HEVC stateless V4; devices: /dev/media1,/dev/video19).

Measurable:

Criterion Metric
C1 — vainfo enumeration LIBVA_DRIVER_NAME=v4l2_request vainfo lists VAProfileHEVCMain : VAEntrypointVLD
C2 — bit-exact decode sha256 of libva NV12 output == sha256 of kdirect NV12 output, per fixture, N=1
C3 — HW engagement lsof shows /dev/video19 open by ffmpeg-vaapi during libva run
C4 — Stability under N=3 C2 holds at N=3 repeated runs (deterministic)
C5 — Sibling baseline preserved fresnel iter38 5/5 still PASS post-iter40 (no regression to rkvdec/hantro path)

Out of scope this iter: Main10 (10-bit / NC30), VP9, AV1, Firefox VA-API engagement testing, performance benchmarks. All later chapters.

Phase 2 — Situation Analysis

Backend architecture already in place

  • Multi-device probe (iter38): at VA_DRIVER_INIT opens both rkvdec + hantro-vpu via find_decoder_device_by_driver(name). Stores per-driver fds (video_fd_{rkvdec,hantro}, media_fd_{rkvdec,hantro}). RequestCreateConfig retargets the "active" driver_data->{video,media}_fd per profile via request_switch_device_for_profile() (request.c:426-478).
  • Per-driver feature gating: request_data->has_hevc_ext_sps_rps_{rkvdec,hantro} pair, with h265_set_controls consulting the per-fd flag. Established by iter2 / Phase 5 review (request.h:99-100). This is the canonical per-driver gating shape for iter40.
  • HEVC ctrl population: h265_set_controls populates the standard V4L2_CID_STATELESS_HEVC_* set (h265.c). Probe-gates EXT_SPS_*_RPS via the iter2 path — naturally dormant for rpi-hevc-dec since the controls don't exist.
  • Synthetic SPS pre-seed (iter25/26): needed for rkvdec to resolve image_fmt before CAPTURE alloc. Phase 0 strace shows rpi-hevc-dec does NOT need this — it accepts NC12 + explicit dims on S_FMT CAPTURE directly. The pre-seed code path stays in place for rkvdec; rpi-hevc-dec just doesn't trigger it (gate on driver_kind).
  • CAPTURE detile primitive: nv15_unpack_plane_to_p010() (nv15.c) is the template — backend already CPU-detiles when a Pi-or-Rockchip- specific CAPTURE format meets a linear consumer (VAImage NV12 / P010).
  • Single-plane (S) vs multi-plane (M) handling: hantro uses MPLANE, rkvdec uses both depending on codec. rpi-hevc-dec exposes MPLANE for BOTH OUTPUT (HEVC_SLICE) and CAPTURE (NC12) per the strace. iter38 already supports MPLANE handling for hantro; rpi reuses that.

Surface area to touch (audit)

File What changes Size
src/request.h Add video_fd_rpi_hevc_dec, media_fd_rpi_hevc_dec, has_hevc_ext_sps_rps_rpi_hevc_dec (mirror iter38 + iter2 layout) ~10 lines
src/request.c (a) Extend init -1 block to cover new fds. (b) Recognize rpi-hevc-dec as a 3rd primary/alt driver string in the probe loop. (c) Extend request_device_kind_for_profile so HEVC→'p' when rpi-hevc-dec is present, else 'r'. (d) Extend request_switch_device_for_profile 'p' branch. (e) Probe HEVC ext_sps on the new fd (will be false, mirrors hantro entry). ~80 lines
src/video.c Add V4L2_PIX_FMT_NV12_COL128 (NC12) video_format entry: 4:2:0, planes=1, alignment via dedicated bytesperline/sizeimage formula. NOT marked linear. ~20 lines
src/nv12_col128.c (NEW) nv12_col128_detile_to_nv12(): Y plane + UV plane detile primitive. Adapted from ffmpeg/Kynesim av_rpi_sand_to_planar_y8 core math. Header doc traces back to videodev2.h docstring + raspberrypi/linux hevc_dec/hevc_d_video.c size formula. ~80 lines + 30-line header
src/image.c Add NC12 → NV12 branch in copy_surface_to_image, gated on image->format.fourcc == VA_FOURCC_NV12 && video_format->v4l2_format == V4L2_PIX_FMT_NV12_COL128 (sibling to existing NV15→P010 branch). ~25 lines
src/meson.build + src/Makefile.am List nv12_col128.c/.h in sources 2 lines

Total estimated diff: ~250 LoC backend + ~100 LoC standalone primitive. Roughly half the surface area of iter38; smaller than iter2.

What does NOT change

  • iter25/26 SPS pre-seed: stays on rkvdec path only (gated by driver_kind check that's already implicit in the rkvdec fd routing).
  • iter2 EXT_SPS plumbing: probe-gated off on rpi-hevc-dec; vendored GStreamer parser unused. Confirmed via the EINVAL on ctrl 0xa97.
  • iter31 α-29 slice_params st_rps_bits: APPLIES to rpi-hevc-dec unchanged. Same plumbing.
  • iter33 VP8 hantro start-code prepend: not relevant (rpi-hevc-dec is HEVC-only; VP8 still goes through hantro on RK).
  • iter38 single-libva-session multi-codec semantics: extends from 5 codecs to 5+1 (HEVC reroutes on Pi).

NC12 / SAND128 tile geometry — locked contract

From kernel driver drivers/media/platform/raspberrypi/hevc_dec/hevc_d_video.c (via github raspberrypi/linux rpi-6.12.y):

case V4L2_PIX_FMT_NV12_COL128:
    width = ALIGN(width, 128);           /* Width rounds up to columns */
    height = ALIGN(height, 8);
    bytesperline = constrain2x(bytesperline, height * 3 / 2);
    sizeimage = bytesperline * width;
    break;

For 1280×720:

  • width = 1280 (already 128-aligned)
  • height = 720 (already 8-aligned)
  • bytesperline = 720 × 3/2 = 1080 (matches Phase 0 strace observation)
  • sizeimage = 1080 × 1280 = 1,382,400 (matches strace; equals linear NV12 byte count coincidentally)

Geometry interpretation (cross-verified against ffmpeg/Kynesim rpi_sand_fn_pw.h av_rpi_sand_to_planar_y8):

  • Image is divided into (width + 127) / 128 columns; each column is 128 px wide × height px tall.
  • Within a column: 128 × height bytes of Y data, immediately followed by 128 × height/2 bytes of interleaved CbCr (so 128 × bytesperline bytes per column, where bytesperline is the column stride).
  • Across columns: column N starts at offset N × stride1 × stride2 where stride1 = 128 (column width) and stride2 = bytesperline.
  • Pixel (x, y) byte offset = (x & 127) + y × 128 + (x & ~127) × bytesperline for Y; same formula with y/2 for UV plane (which begins at offset 128 × height × num_columns from the start).

Reference for the detile loop: av_rpi_sand_to_planar_y8 (Kynesim ffmpeg, libavutil/rpi_sand_fn_pw.h with PW=1). Our primitive copies the single-stripe fast-path math; we don't import NEON ASM (CPU detile is the safe path for Phase 1; SIMD a Phase 2 perf bump if needed).

Phase 3 — Baselines

Test fixtures (generated on higgs)

Fixture Size Profile Generator
bbb_640_main.mp4 640×360 Main 8-bit ffmpeg -f lavfi -i testsrc=duration=2 -pix_fmt yuv420p -c:v libx265 -preset ultrafast -profile:v main
bbb_1280_main.mp4 1280×720 Main 8-bit same
bbb_1920_main.mp4 1920×1080 Main 8-bit same

Captured 2026-05-17 evening on higgs

For each fixture, N=3 reps. Both SW (no hwaccel) and kdirect (ffmpeg -hwaccel drm -c:v hevc) → -frames:v 10 -f rawvideo -pix_fmt nv12, sha256 of first 16 chars:

bbb_640_main  SW={9a81038065e9b7cd} HW={9a81038065e9b7cd}  → BIT-EXACT × N=3
bbb_1280_main SW={d3bb055655d6f195} HW={d3bb055655d6f195}  → BIT-EXACT × N=3
bbb_1920_main SW={0bc2bd6f693db039} HW={0bc2bd6f693db039}  → BIT-EXACT × N=3

HW engagement signal (per-run): Hwaccel V4L2 HEVC stateless V4; devices: /dev/media1,/dev/video19; buffers: src DMABuf, dst DMABuf; swfmt=rpi4_8

This is the kdirect baseline. Phase 7 verification will compare libva output against these SHAs.

Strace-derived submission ordering (Phase 0 close addendum)

Captured in phase0_pi5_hevc.md. Briefly: standard V4L2-request stateless flow, both queues DMABUF, no SPS pre-seed dance needed (rpi-hevc-dec accepts NC12 + dims directly on CAPTURE S_FMT).

Phase 4 — Plan

Implementation steps (sequenced)

  1. request.h: extend request_data with the new fd pair + ext_sps flag, mirroring iter38/iter2 layout. (no behavior change yet)
  2. request.c:
    • find_decoder_device_by_driver("rpi-hevc-dec", ...) accepts new driver string.
    • Init -1 block extends to new fds.
    • Probe loop: if primary is rkvdec or hantro-vpu, also probe rpi-hevc-dec (third slot). On Pi 5 there's no rkvdec or hantro-vpu, so primary becomes rpi-hevc-dec and the alt-probes for the other two return absent (their fds stay -1).
    • request_device_kind_for_profile: when profile is VAProfileHEVCMain, prefer 'p' (rpi-hevc-dec) IF video_fd_rpi_hevc_dec >= 0, else fall through to 'r' (rkvdec). All other profiles stay routed as today.
    • request_switch_device_for_profile: add 'p' branch.
    • ext_sps probe runs on the new fd; result stored in has_hevc_ext_sps_rps_rpi_hevc_dec. Will be false (controls absent).
  3. video.c: add NC12 video_format entry. Mark it MPLANE-only (per Phase 0 strace). bytesperline/sizeimage formula encoded per kernel driver math.
  4. src/nv12_col128.c + .h (NEW): single-file primitive, nv12_col128_detile_to_nv12(dst_y, dst_uv, src_y, src_uv, width, height, src_stride2). CPU per-column row-memcpy loop; not NEON for Phase 1 (correctness first). Self-test in tests/test_nv12_col128_detile.c.
  5. image.c: branch in copy_surface_to_image. Gate: image->format.fourcc == VA_FOURCC_NV12 && video_format->v4l2_format == V4L2_PIX_FMT_NV12_COL128. Calls the primitive. Existing NV12-linear path stays.
  6. meson.build + Makefile.am: source list updates.
  7. Build clean on higgs — first build target IS higgs (since iter40 only matters on Pi). Cross-build for ampere/fresnel is unaffected because they don't have rpi-hevc-dec — the new fd stays -1 and the per-driver routing falls through to existing rkvdec/hantro paths.

Verification gates (Phase 7 acceptance)

  • Build cleanly on higgs (Debian 13 trixie, libva-dev 2.22.0-3, libdrm-dev 2.4.131).
  • Local-install the resulting .so to /usr/lib/aarch64-linux-gnu/dri/v4l2_request_drv_video.so.
  • LIBVA_DRIVER_NAME=v4l2_request vainfo lists VAProfileHEVCMain.
  • For each Phase 3 fixture: libva output SHA == kdirect SHA (the Phase 3 recorded value).
  • lsof during libva decode shows /dev/video19 open.
  • Sibling regression check: fresnel phase7_iter39_test_rig equivalent still 5/5 PASS (no regression to existing routing).

Risks + mitigations

Risk Mitigation
NC12 detile math wrong → libva ≠ kdirect Tight unit test in tests/test_nv12_col128_detile.c with hand-crafted NC12 bytes + known linear output, before integration.
request_switch_device_for_profile falls through wrong path on systems with BOTH rkvdec AND rpi-hevc-dec Prefer rpi-hevc-dec for HEVC when present. Explicit comment in switch. Test on fresnel = no rpi → falls to 'r'; test on higgs = no rkvdec → falls to 'p'.
Debian build env differs from Arch — see feedback_package_build_flags_unmask_bugs Build with explicit -O2 -D_FORTIFY_SOURCE=2 -fstack-protector-strong flags to match Debian dpkg-buildflags.
Synthetic SPS pre-seed accidentally fires on rpi-hevc-dec Gate on driver_kind != 'p' in the pre-seed call site. Verify via strace: pre-seed ioctl pattern absent.
iter2 EXT_SPS path accidentally engages on rpi Already probe-gated; has_hevc_ext_sps_rps_rpi_hevc_dec = false naturally.

Phase 5 review explicitly requested

Per CLAUDE.md global "Reviews are never skippable" + feedback_review_empirical_over_theoretical: this plan goes to a sonnet Plan-agent review. Specific review focus:

  • Routing correctness when 0/1/2/3 of the three drivers are present.
  • NC12 geometry: did we copy ffmpeg's per-row memcpy math correctly? Did we miss UV stride considerations?
  • image.c gate predicate — does it exclude any legitimate NV12-linear case on Pi? (No: rpi only exposes NC12/NC30 CAPTURE, no plain NV12.)
  • Cross-device regression scope (fresnel + ampere paths untouched?).

Empty-result review IS a green light; "we should have skipped it" is the prohibited move.