diff --git a/phase1_pi5_hevc.md b/phase1_pi5_hevc.md new file mode 100644 index 0000000..056fba1 --- /dev/null +++ b/phase1_pi5_hevc.md @@ -0,0 +1,230 @@ +# Phase 1+2+3+4 — Pi 5 HEVC chapter (iter40) + +Per [[feedback_dev_process]], Phase 1 (goal), Phase 2 (situation analysis), +Phase 3 (baselines), Phase 4 (plan) for adding rpi-hevc-dec as a third +multi-device-probe slot in `libva-v4l2-request-fourier`. Phase 0 substrate ++ open-question answers live at `phase0_pi5_hevc.md`. + +## Phase 1 — Goal + +> **libva-v4l2-request-fourier on higgs** decodes HEVC Main 8-bit input +> producing NV12 output **bit-exact vs kdirect** for three reference +> fixtures (640×360, 1280×720, 1920×1080 — Main profile, libx265 +> ultrafast). HW path engagement verified via the kernel-driver lsof +> signal (`/dev/video19` open) AND ffmpeg-vaapi engagement signal +> (`Hwaccel V4L2 HEVC stateless V4; devices: /dev/media1,/dev/video19`). + +Measurable: + +| Criterion | Metric | +|---|---| +| C1 — vainfo enumeration | `LIBVA_DRIVER_NAME=v4l2_request vainfo` lists `VAProfileHEVCMain : VAEntrypointVLD` | +| C2 — bit-exact decode | sha256 of libva NV12 output == sha256 of kdirect NV12 output, per fixture, N=1 | +| C3 — HW engagement | `lsof` shows `/dev/video19` open by ffmpeg-vaapi during libva run | +| C4 — Stability under N=3 | C2 holds at N=3 repeated runs (deterministic) | +| C5 — Sibling baseline preserved | fresnel iter38 5/5 still PASS post-iter40 (no regression to rkvdec/hantro path) | + +Out of scope this iter: Main10 (10-bit / NC30), VP9, AV1, Firefox VA-API +engagement testing, performance benchmarks. All later chapters. + +## Phase 2 — Situation Analysis + +### Backend architecture already in place + +- **Multi-device probe (iter38)**: at `VA_DRIVER_INIT` opens both + `rkvdec` + `hantro-vpu` via `find_decoder_device_by_driver(name)`. + Stores per-driver fds (`video_fd_{rkvdec,hantro}`, + `media_fd_{rkvdec,hantro}`). `RequestCreateConfig` retargets the + "active" `driver_data->{video,media}_fd` per profile via + `request_switch_device_for_profile()` (request.c:426-478). +- **Per-driver feature gating**: `request_data->has_hevc_ext_sps_rps_{rkvdec,hantro}` + pair, with `h265_set_controls` consulting the per-fd flag. Established + by iter2 / Phase 5 review (request.h:99-100). This is the canonical + per-driver gating shape for iter40. +- **HEVC ctrl population**: `h265_set_controls` populates the standard + `V4L2_CID_STATELESS_HEVC_*` set (h265.c). Probe-gates EXT_SPS_*_RPS + via the iter2 path — naturally dormant for rpi-hevc-dec since the + controls don't exist. +- **Synthetic SPS pre-seed (iter25/26)**: needed for rkvdec to resolve + `image_fmt` before CAPTURE alloc. Phase 0 strace shows rpi-hevc-dec + does NOT need this — it accepts NC12 + explicit dims on `S_FMT + CAPTURE` directly. The pre-seed code path stays in place for rkvdec; + rpi-hevc-dec just doesn't trigger it (gate on driver_kind). +- **CAPTURE detile primitive**: `nv15_unpack_plane_to_p010()` (nv15.c) + is the template — backend already CPU-detiles when a Pi-or-Rockchip- + specific CAPTURE format meets a linear consumer (VAImage NV12 / P010). +- **Single-plane (S) vs multi-plane (M) handling**: hantro uses MPLANE, + rkvdec uses both depending on codec. rpi-hevc-dec exposes MPLANE for + BOTH OUTPUT (HEVC_SLICE) and CAPTURE (NC12) per the strace. iter38 + already supports MPLANE handling for hantro; rpi reuses that. + +### Surface area to touch (audit) + +| File | What changes | Size | +|------|--------------|------| +| `src/request.h` | Add `video_fd_rpi_hevc_dec`, `media_fd_rpi_hevc_dec`, `has_hevc_ext_sps_rps_rpi_hevc_dec` (mirror iter38 + iter2 layout) | ~10 lines | +| `src/request.c` | (a) Extend init -1 block to cover new fds. (b) Recognize `rpi-hevc-dec` as a 3rd primary/alt driver string in the probe loop. (c) Extend `request_device_kind_for_profile` so HEVC→'p' when rpi-hevc-dec is present, else 'r'. (d) Extend `request_switch_device_for_profile` 'p' branch. (e) Probe HEVC ext_sps on the new fd (will be false, mirrors hantro entry). | ~80 lines | +| `src/video.c` | Add `V4L2_PIX_FMT_NV12_COL128` (NC12) `video_format` entry: 4:2:0, planes=1, alignment via dedicated bytesperline/sizeimage formula. NOT marked linear. | ~20 lines | +| `src/nv12_col128.c` (NEW) | `nv12_col128_detile_to_nv12()`: Y plane + UV plane detile primitive. Adapted from ffmpeg/Kynesim `av_rpi_sand_to_planar_y8` core math. Header doc traces back to videodev2.h docstring + raspberrypi/linux `hevc_dec/hevc_d_video.c` size formula. | ~80 lines + 30-line header | +| `src/image.c` | Add NC12 → NV12 branch in `copy_surface_to_image`, gated on `image->format.fourcc == VA_FOURCC_NV12 && video_format->v4l2_format == V4L2_PIX_FMT_NV12_COL128` (sibling to existing NV15→P010 branch). | ~25 lines | +| `src/meson.build` + `src/Makefile.am` | List `nv12_col128.c`/`.h` in sources | 2 lines | + +Total estimated diff: ~250 LoC backend + ~100 LoC standalone primitive. +Roughly half the surface area of iter38; smaller than iter2. + +### What does NOT change + +- iter25/26 SPS pre-seed: stays on rkvdec path only (gated by + driver_kind check that's already implicit in the rkvdec fd routing). +- iter2 EXT_SPS plumbing: probe-gated off on rpi-hevc-dec; vendored + GStreamer parser unused. Confirmed via the EINVAL on ctrl 0xa97. +- iter31 α-29 slice_params st_rps_bits: APPLIES to rpi-hevc-dec + unchanged. Same plumbing. +- iter33 VP8 hantro start-code prepend: not relevant (rpi-hevc-dec is + HEVC-only; VP8 still goes through hantro on RK). +- iter38 single-libva-session multi-codec semantics: extends from 5 + codecs to 5+1 (HEVC reroutes on Pi). + +### NC12 / SAND128 tile geometry — locked contract + +From kernel driver `drivers/media/platform/raspberrypi/hevc_dec/hevc_d_video.c` +(via [[github raspberrypi/linux rpi-6.12.y]]): + +```c +case V4L2_PIX_FMT_NV12_COL128: + width = ALIGN(width, 128); /* Width rounds up to columns */ + height = ALIGN(height, 8); + bytesperline = constrain2x(bytesperline, height * 3 / 2); + sizeimage = bytesperline * width; + break; +``` + +For 1280×720: +- width = 1280 (already 128-aligned) +- height = 720 (already 8-aligned) +- bytesperline = 720 × 3/2 = **1080** (matches Phase 0 strace observation) +- sizeimage = 1080 × 1280 = **1,382,400** (matches strace; equals linear NV12 byte count coincidentally) + +**Geometry interpretation** (cross-verified against ffmpeg/Kynesim +`rpi_sand_fn_pw.h` `av_rpi_sand_to_planar_y8`): +- Image is divided into `(width + 127) / 128` columns; each column is + **128 px wide × height px tall**. +- Within a column: `128 × height` bytes of Y data, immediately followed + by `128 × height/2` bytes of interleaved CbCr (so 128 × `bytesperline` + bytes per column, where `bytesperline` is the column stride). +- Across columns: column N starts at offset `N × stride1 × stride2` + where `stride1 = 128` (column width) and `stride2 = bytesperline`. +- **Pixel (x, y) byte offset = `(x & 127) + y × 128 + (x & ~127) × bytesperline`** + for Y; same formula with `y/2` for UV plane (which begins at offset + `128 × height × num_columns` from the start). + +Reference for the detile loop: `av_rpi_sand_to_planar_y8` (Kynesim +ffmpeg, `libavutil/rpi_sand_fn_pw.h` with PW=1). Our primitive copies +the single-stripe fast-path math; we don't import NEON ASM (CPU +detile is the safe path for Phase 1; SIMD a Phase 2 perf bump if needed). + +## Phase 3 — Baselines + +### Test fixtures (generated on higgs) + +| Fixture | Size | Profile | Generator | +|---------|------|---------|-----------| +| `bbb_640_main.mp4` | 640×360 | Main 8-bit | `ffmpeg -f lavfi -i testsrc=duration=2 -pix_fmt yuv420p -c:v libx265 -preset ultrafast -profile:v main` | +| `bbb_1280_main.mp4` | 1280×720 | Main 8-bit | same | +| `bbb_1920_main.mp4` | 1920×1080 | Main 8-bit | same | + +### Captured 2026-05-17 evening on higgs + +For each fixture, N=3 reps. Both SW (no hwaccel) and kdirect +(`ffmpeg -hwaccel drm -c:v hevc`) → `-frames:v 10 -f rawvideo -pix_fmt nv12`, +sha256 of first 16 chars: + +``` +bbb_640_main SW={9a81038065e9b7cd} HW={9a81038065e9b7cd} → BIT-EXACT × N=3 +bbb_1280_main SW={d3bb055655d6f195} HW={d3bb055655d6f195} → BIT-EXACT × N=3 +bbb_1920_main SW={0bc2bd6f693db039} HW={0bc2bd6f693db039} → BIT-EXACT × N=3 +``` + +HW engagement signal (per-run): `Hwaccel V4L2 HEVC stateless V4; devices: /dev/media1,/dev/video19; buffers: src DMABuf, dst DMABuf; swfmt=rpi4_8` + +This is the kdirect baseline. Phase 7 verification will compare libva +output against these SHAs. + +### Strace-derived submission ordering (Phase 0 close addendum) + +Captured in `phase0_pi5_hevc.md`. Briefly: standard V4L2-request +stateless flow, both queues DMABUF, no SPS pre-seed dance needed +(rpi-hevc-dec accepts NC12 + dims directly on CAPTURE S_FMT). + +## Phase 4 — Plan + +### Implementation steps (sequenced) + +1. **`request.h`**: extend `request_data` with the new fd pair + ext_sps + flag, mirroring iter38/iter2 layout. (no behavior change yet) +2. **`request.c`**: + - `find_decoder_device_by_driver("rpi-hevc-dec", ...)` accepts new + driver string. + - Init -1 block extends to new fds. + - Probe loop: if primary is `rkvdec` or `hantro-vpu`, also probe + `rpi-hevc-dec` (third slot). On Pi 5 there's no `rkvdec` or + `hantro-vpu`, so primary becomes `rpi-hevc-dec` and the alt-probes + for the other two return absent (their fds stay -1). + - `request_device_kind_for_profile`: when profile is `VAProfileHEVCMain`, + prefer `'p'` (rpi-hevc-dec) IF `video_fd_rpi_hevc_dec >= 0`, else + fall through to `'r'` (rkvdec). All other profiles stay routed as + today. + - `request_switch_device_for_profile`: add `'p'` branch. + - ext_sps probe runs on the new fd; result stored in + `has_hevc_ext_sps_rps_rpi_hevc_dec`. Will be false (controls absent). +3. **`video.c`**: add NC12 video_format entry. Mark it MPLANE-only (per + Phase 0 strace). bytesperline/sizeimage formula encoded per kernel + driver math. +4. **`src/nv12_col128.c` + `.h`** (NEW): single-file primitive, + `nv12_col128_detile_to_nv12(dst_y, dst_uv, src_y, src_uv, width, + height, src_stride2)`. CPU per-column row-memcpy loop; not NEON + for Phase 1 (correctness first). Self-test in `tests/test_nv12_col128_detile.c`. +5. **`image.c`**: branch in `copy_surface_to_image`. Gate: + `image->format.fourcc == VA_FOURCC_NV12 && video_format->v4l2_format == V4L2_PIX_FMT_NV12_COL128`. + Calls the primitive. Existing NV12-linear path stays. +6. **`meson.build` + `Makefile.am`**: source list updates. +7. **Build clean on higgs** — first build target IS higgs (since iter40 + only matters on Pi). Cross-build for ampere/fresnel is unaffected + because they don't have rpi-hevc-dec — the new fd stays -1 and the + per-driver routing falls through to existing rkvdec/hantro paths. + +### Verification gates (Phase 7 acceptance) + +- Build cleanly on higgs (Debian 13 trixie, libva-dev 2.22.0-3, + libdrm-dev 2.4.131). +- Local-install the resulting `.so` to `/usr/lib/aarch64-linux-gnu/dri/v4l2_request_drv_video.so`. +- `LIBVA_DRIVER_NAME=v4l2_request vainfo` lists `VAProfileHEVCMain`. +- For each Phase 3 fixture: libva output SHA == kdirect SHA (the Phase 3 + recorded value). +- `lsof` during libva decode shows `/dev/video19` open. +- Sibling regression check: fresnel `phase7_iter39_test_rig` equivalent + still 5/5 PASS (no regression to existing routing). + +### Risks + mitigations + +| Risk | Mitigation | +|------|-----------| +| NC12 detile math wrong → libva ≠ kdirect | Tight unit test in `tests/test_nv12_col128_detile.c` with hand-crafted NC12 bytes + known linear output, before integration. | +| `request_switch_device_for_profile` falls through wrong path on systems with BOTH rkvdec AND rpi-hevc-dec | Prefer rpi-hevc-dec for HEVC when present. Explicit comment in switch. Test on fresnel = no rpi → falls to 'r'; test on higgs = no rkvdec → falls to 'p'. | +| Debian build env differs from Arch — see [[feedback_package_build_flags_unmask_bugs]] | Build with explicit `-O2 -D_FORTIFY_SOURCE=2 -fstack-protector-strong` flags to match Debian dpkg-buildflags. | +| Synthetic SPS pre-seed accidentally fires on rpi-hevc-dec | Gate on `driver_kind != 'p'` in the pre-seed call site. Verify via strace: pre-seed ioctl pattern absent. | +| iter2 EXT_SPS path accidentally engages on rpi | Already probe-gated; `has_hevc_ext_sps_rps_rpi_hevc_dec` = false naturally. | + +### Phase 5 review explicitly requested + +Per CLAUDE.md global "Reviews are never skippable" + [[feedback_review_empirical_over_theoretical]]: +this plan goes to a sonnet Plan-agent review. Specific review focus: +- Routing correctness when 0/1/2/3 of the three drivers are present. +- NC12 geometry: did we copy ffmpeg's per-row memcpy math correctly? + Did we miss UV stride considerations? +- `image.c` gate predicate — does it exclude any legitimate NV12-linear + case on Pi? (No: rpi only exposes NC12/NC30 CAPTURE, no plain NV12.) +- Cross-device regression scope (fresnel + ampere paths untouched?). + +Empty-result review IS a green light; "we should have skipped it" is the +prohibited move.