forked from marfrit/libva-v4l2-request-fourier
bf52725ab3
Phase 1 measurable goal: HEVC Main 8-bit bit-exact libva-vs-kdirect on higgs for 640x360 / 1280x720 / 1920x1080 fixtures with HW path engagement verified via lsof + ffmpeg-vaapi log signal. Phase 2 surface-area audit: ~250 LoC backend + 100 LoC standalone detile primitive. Reuses iter38 multi-device-probe pattern (now 3 slots: rkvdec + hantro + rpi-hevc-dec) + iter2 per-driver gating shape. h265_set_controls + iter31 a-29 plumbing transfers unchanged. iter25 SPS pre-seed gated off for rpi-hevc-dec. Phase 3 baseline locked: N=3 bit-exact SW==kdirect for all three fixtures on higgs. kdirect engagement signal: Hwaccel V4L2 HEVC stateless V4; devices: /dev/media1,/dev/video19; buffers: src DMABuf, dst DMABuf; swfmt=rpi4_8 Phase 4 plan: 7 sequenced steps (request.h -> request.c -> video.c -> nv12_col128.c new -> image.c branch -> meson/Makefile -> build on higgs). NC12 tile geometry locked from kernel hevc_d_video.c math + ffmpeg/Kynesim av_rpi_sand_to_planar_y8 byte-offset formula. Risks + mitigations enumerated. Phase 5 sonnet review explicitly requested per CLAUDE.md no-skip-reviews rule. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
231 lines
13 KiB
Markdown
231 lines
13 KiB
Markdown
# Phase 1+2+3+4 — Pi 5 HEVC chapter (iter40)
|
||
|
||
Per [[feedback_dev_process]], Phase 1 (goal), Phase 2 (situation analysis),
|
||
Phase 3 (baselines), Phase 4 (plan) for adding rpi-hevc-dec as a third
|
||
multi-device-probe slot in `libva-v4l2-request-fourier`. Phase 0 substrate
|
||
+ open-question answers live at `phase0_pi5_hevc.md`.
|
||
|
||
## Phase 1 — Goal
|
||
|
||
> **libva-v4l2-request-fourier on higgs** decodes HEVC Main 8-bit input
|
||
> producing NV12 output **bit-exact vs kdirect** for three reference
|
||
> fixtures (640×360, 1280×720, 1920×1080 — Main profile, libx265
|
||
> ultrafast). HW path engagement verified via the kernel-driver lsof
|
||
> signal (`/dev/video19` open) AND ffmpeg-vaapi engagement signal
|
||
> (`Hwaccel V4L2 HEVC stateless V4; devices: /dev/media1,/dev/video19`).
|
||
|
||
Measurable:
|
||
|
||
| Criterion | Metric |
|
||
|---|---|
|
||
| C1 — vainfo enumeration | `LIBVA_DRIVER_NAME=v4l2_request vainfo` lists `VAProfileHEVCMain : VAEntrypointVLD` |
|
||
| C2 — bit-exact decode | sha256 of libva NV12 output == sha256 of kdirect NV12 output, per fixture, N=1 |
|
||
| C3 — HW engagement | `lsof` shows `/dev/video19` open by ffmpeg-vaapi during libva run |
|
||
| C4 — Stability under N=3 | C2 holds at N=3 repeated runs (deterministic) |
|
||
| C5 — Sibling baseline preserved | fresnel iter38 5/5 still PASS post-iter40 (no regression to rkvdec/hantro path) |
|
||
|
||
Out of scope this iter: Main10 (10-bit / NC30), VP9, AV1, Firefox VA-API
|
||
engagement testing, performance benchmarks. All later chapters.
|
||
|
||
## Phase 2 — Situation Analysis
|
||
|
||
### Backend architecture already in place
|
||
|
||
- **Multi-device probe (iter38)**: at `VA_DRIVER_INIT` opens both
|
||
`rkvdec` + `hantro-vpu` via `find_decoder_device_by_driver(name)`.
|
||
Stores per-driver fds (`video_fd_{rkvdec,hantro}`,
|
||
`media_fd_{rkvdec,hantro}`). `RequestCreateConfig` retargets the
|
||
"active" `driver_data->{video,media}_fd` per profile via
|
||
`request_switch_device_for_profile()` (request.c:426-478).
|
||
- **Per-driver feature gating**: `request_data->has_hevc_ext_sps_rps_{rkvdec,hantro}`
|
||
pair, with `h265_set_controls` consulting the per-fd flag. Established
|
||
by iter2 / Phase 5 review (request.h:99-100). This is the canonical
|
||
per-driver gating shape for iter40.
|
||
- **HEVC ctrl population**: `h265_set_controls` populates the standard
|
||
`V4L2_CID_STATELESS_HEVC_*` set (h265.c). Probe-gates EXT_SPS_*_RPS
|
||
via the iter2 path — naturally dormant for rpi-hevc-dec since the
|
||
controls don't exist.
|
||
- **Synthetic SPS pre-seed (iter25/26)**: needed for rkvdec to resolve
|
||
`image_fmt` before CAPTURE alloc. Phase 0 strace shows rpi-hevc-dec
|
||
does NOT need this — it accepts NC12 + explicit dims on `S_FMT
|
||
CAPTURE` directly. The pre-seed code path stays in place for rkvdec;
|
||
rpi-hevc-dec just doesn't trigger it (gate on driver_kind).
|
||
- **CAPTURE detile primitive**: `nv15_unpack_plane_to_p010()` (nv15.c)
|
||
is the template — backend already CPU-detiles when a Pi-or-Rockchip-
|
||
specific CAPTURE format meets a linear consumer (VAImage NV12 / P010).
|
||
- **Single-plane (S) vs multi-plane (M) handling**: hantro uses MPLANE,
|
||
rkvdec uses both depending on codec. rpi-hevc-dec exposes MPLANE for
|
||
BOTH OUTPUT (HEVC_SLICE) and CAPTURE (NC12) per the strace. iter38
|
||
already supports MPLANE handling for hantro; rpi reuses that.
|
||
|
||
### Surface area to touch (audit)
|
||
|
||
| File | What changes | Size |
|
||
|------|--------------|------|
|
||
| `src/request.h` | Add `video_fd_rpi_hevc_dec`, `media_fd_rpi_hevc_dec`, `has_hevc_ext_sps_rps_rpi_hevc_dec` (mirror iter38 + iter2 layout) | ~10 lines |
|
||
| `src/request.c` | (a) Extend init -1 block to cover new fds. (b) Recognize `rpi-hevc-dec` as a 3rd primary/alt driver string in the probe loop. (c) Extend `request_device_kind_for_profile` so HEVC→'p' when rpi-hevc-dec is present, else 'r'. (d) Extend `request_switch_device_for_profile` 'p' branch. (e) Probe HEVC ext_sps on the new fd (will be false, mirrors hantro entry). | ~80 lines |
|
||
| `src/video.c` | Add `V4L2_PIX_FMT_NV12_COL128` (NC12) `video_format` entry: 4:2:0, planes=1, alignment via dedicated bytesperline/sizeimage formula. NOT marked linear. | ~20 lines |
|
||
| `src/nv12_col128.c` (NEW) | `nv12_col128_detile_to_nv12()`: Y plane + UV plane detile primitive. Adapted from ffmpeg/Kynesim `av_rpi_sand_to_planar_y8` core math. Header doc traces back to videodev2.h docstring + raspberrypi/linux `hevc_dec/hevc_d_video.c` size formula. | ~80 lines + 30-line header |
|
||
| `src/image.c` | Add NC12 → NV12 branch in `copy_surface_to_image`, gated on `image->format.fourcc == VA_FOURCC_NV12 && video_format->v4l2_format == V4L2_PIX_FMT_NV12_COL128` (sibling to existing NV15→P010 branch). | ~25 lines |
|
||
| `src/meson.build` + `src/Makefile.am` | List `nv12_col128.c`/`.h` in sources | 2 lines |
|
||
|
||
Total estimated diff: ~250 LoC backend + ~100 LoC standalone primitive.
|
||
Roughly half the surface area of iter38; smaller than iter2.
|
||
|
||
### What does NOT change
|
||
|
||
- iter25/26 SPS pre-seed: stays on rkvdec path only (gated by
|
||
driver_kind check that's already implicit in the rkvdec fd routing).
|
||
- iter2 EXT_SPS plumbing: probe-gated off on rpi-hevc-dec; vendored
|
||
GStreamer parser unused. Confirmed via the EINVAL on ctrl 0xa97.
|
||
- iter31 α-29 slice_params st_rps_bits: APPLIES to rpi-hevc-dec
|
||
unchanged. Same plumbing.
|
||
- iter33 VP8 hantro start-code prepend: not relevant (rpi-hevc-dec is
|
||
HEVC-only; VP8 still goes through hantro on RK).
|
||
- iter38 single-libva-session multi-codec semantics: extends from 5
|
||
codecs to 5+1 (HEVC reroutes on Pi).
|
||
|
||
### NC12 / SAND128 tile geometry — locked contract
|
||
|
||
From kernel driver `drivers/media/platform/raspberrypi/hevc_dec/hevc_d_video.c`
|
||
(via [[github raspberrypi/linux rpi-6.12.y]]):
|
||
|
||
```c
|
||
case V4L2_PIX_FMT_NV12_COL128:
|
||
width = ALIGN(width, 128); /* Width rounds up to columns */
|
||
height = ALIGN(height, 8);
|
||
bytesperline = constrain2x(bytesperline, height * 3 / 2);
|
||
sizeimage = bytesperline * width;
|
||
break;
|
||
```
|
||
|
||
For 1280×720:
|
||
- width = 1280 (already 128-aligned)
|
||
- height = 720 (already 8-aligned)
|
||
- bytesperline = 720 × 3/2 = **1080** (matches Phase 0 strace observation)
|
||
- sizeimage = 1080 × 1280 = **1,382,400** (matches strace; equals linear NV12 byte count coincidentally)
|
||
|
||
**Geometry interpretation** (cross-verified against ffmpeg/Kynesim
|
||
`rpi_sand_fn_pw.h` `av_rpi_sand_to_planar_y8`):
|
||
- Image is divided into `(width + 127) / 128` columns; each column is
|
||
**128 px wide × height px tall**.
|
||
- Within a column: `128 × height` bytes of Y data, immediately followed
|
||
by `128 × height/2` bytes of interleaved CbCr (so 128 × `bytesperline`
|
||
bytes per column, where `bytesperline` is the column stride).
|
||
- Across columns: column N starts at offset `N × stride1 × stride2`
|
||
where `stride1 = 128` (column width) and `stride2 = bytesperline`.
|
||
- **Pixel (x, y) byte offset = `(x & 127) + y × 128 + (x & ~127) × bytesperline`**
|
||
for Y; same formula with `y/2` for UV plane (which begins at offset
|
||
`128 × height × num_columns` from the start).
|
||
|
||
Reference for the detile loop: `av_rpi_sand_to_planar_y8` (Kynesim
|
||
ffmpeg, `libavutil/rpi_sand_fn_pw.h` with PW=1). Our primitive copies
|
||
the single-stripe fast-path math; we don't import NEON ASM (CPU
|
||
detile is the safe path for Phase 1; SIMD a Phase 2 perf bump if needed).
|
||
|
||
## Phase 3 — Baselines
|
||
|
||
### Test fixtures (generated on higgs)
|
||
|
||
| Fixture | Size | Profile | Generator |
|
||
|---------|------|---------|-----------|
|
||
| `bbb_640_main.mp4` | 640×360 | Main 8-bit | `ffmpeg -f lavfi -i testsrc=duration=2 -pix_fmt yuv420p -c:v libx265 -preset ultrafast -profile:v main` |
|
||
| `bbb_1280_main.mp4` | 1280×720 | Main 8-bit | same |
|
||
| `bbb_1920_main.mp4` | 1920×1080 | Main 8-bit | same |
|
||
|
||
### Captured 2026-05-17 evening on higgs
|
||
|
||
For each fixture, N=3 reps. Both SW (no hwaccel) and kdirect
|
||
(`ffmpeg -hwaccel drm -c:v hevc`) → `-frames:v 10 -f rawvideo -pix_fmt nv12`,
|
||
sha256 of first 16 chars:
|
||
|
||
```
|
||
bbb_640_main SW={9a81038065e9b7cd} HW={9a81038065e9b7cd} → BIT-EXACT × N=3
|
||
bbb_1280_main SW={d3bb055655d6f195} HW={d3bb055655d6f195} → BIT-EXACT × N=3
|
||
bbb_1920_main SW={0bc2bd6f693db039} HW={0bc2bd6f693db039} → BIT-EXACT × N=3
|
||
```
|
||
|
||
HW engagement signal (per-run): `Hwaccel V4L2 HEVC stateless V4; devices: /dev/media1,/dev/video19; buffers: src DMABuf, dst DMABuf; swfmt=rpi4_8`
|
||
|
||
This is the kdirect baseline. Phase 7 verification will compare libva
|
||
output against these SHAs.
|
||
|
||
### Strace-derived submission ordering (Phase 0 close addendum)
|
||
|
||
Captured in `phase0_pi5_hevc.md`. Briefly: standard V4L2-request
|
||
stateless flow, both queues DMABUF, no SPS pre-seed dance needed
|
||
(rpi-hevc-dec accepts NC12 + dims directly on CAPTURE S_FMT).
|
||
|
||
## Phase 4 — Plan
|
||
|
||
### Implementation steps (sequenced)
|
||
|
||
1. **`request.h`**: extend `request_data` with the new fd pair + ext_sps
|
||
flag, mirroring iter38/iter2 layout. (no behavior change yet)
|
||
2. **`request.c`**:
|
||
- `find_decoder_device_by_driver("rpi-hevc-dec", ...)` accepts new
|
||
driver string.
|
||
- Init -1 block extends to new fds.
|
||
- Probe loop: if primary is `rkvdec` or `hantro-vpu`, also probe
|
||
`rpi-hevc-dec` (third slot). On Pi 5 there's no `rkvdec` or
|
||
`hantro-vpu`, so primary becomes `rpi-hevc-dec` and the alt-probes
|
||
for the other two return absent (their fds stay -1).
|
||
- `request_device_kind_for_profile`: when profile is `VAProfileHEVCMain`,
|
||
prefer `'p'` (rpi-hevc-dec) IF `video_fd_rpi_hevc_dec >= 0`, else
|
||
fall through to `'r'` (rkvdec). All other profiles stay routed as
|
||
today.
|
||
- `request_switch_device_for_profile`: add `'p'` branch.
|
||
- ext_sps probe runs on the new fd; result stored in
|
||
`has_hevc_ext_sps_rps_rpi_hevc_dec`. Will be false (controls absent).
|
||
3. **`video.c`**: add NC12 video_format entry. Mark it MPLANE-only (per
|
||
Phase 0 strace). bytesperline/sizeimage formula encoded per kernel
|
||
driver math.
|
||
4. **`src/nv12_col128.c` + `.h`** (NEW): single-file primitive,
|
||
`nv12_col128_detile_to_nv12(dst_y, dst_uv, src_y, src_uv, width,
|
||
height, src_stride2)`. CPU per-column row-memcpy loop; not NEON
|
||
for Phase 1 (correctness first). Self-test in `tests/test_nv12_col128_detile.c`.
|
||
5. **`image.c`**: branch in `copy_surface_to_image`. Gate:
|
||
`image->format.fourcc == VA_FOURCC_NV12 && video_format->v4l2_format == V4L2_PIX_FMT_NV12_COL128`.
|
||
Calls the primitive. Existing NV12-linear path stays.
|
||
6. **`meson.build` + `Makefile.am`**: source list updates.
|
||
7. **Build clean on higgs** — first build target IS higgs (since iter40
|
||
only matters on Pi). Cross-build for ampere/fresnel is unaffected
|
||
because they don't have rpi-hevc-dec — the new fd stays -1 and the
|
||
per-driver routing falls through to existing rkvdec/hantro paths.
|
||
|
||
### Verification gates (Phase 7 acceptance)
|
||
|
||
- Build cleanly on higgs (Debian 13 trixie, libva-dev 2.22.0-3,
|
||
libdrm-dev 2.4.131).
|
||
- Local-install the resulting `.so` to `/usr/lib/aarch64-linux-gnu/dri/v4l2_request_drv_video.so`.
|
||
- `LIBVA_DRIVER_NAME=v4l2_request vainfo` lists `VAProfileHEVCMain`.
|
||
- For each Phase 3 fixture: libva output SHA == kdirect SHA (the Phase 3
|
||
recorded value).
|
||
- `lsof` during libva decode shows `/dev/video19` open.
|
||
- Sibling regression check: fresnel `phase7_iter39_test_rig` equivalent
|
||
still 5/5 PASS (no regression to existing routing).
|
||
|
||
### Risks + mitigations
|
||
|
||
| Risk | Mitigation |
|
||
|------|-----------|
|
||
| NC12 detile math wrong → libva ≠ kdirect | Tight unit test in `tests/test_nv12_col128_detile.c` with hand-crafted NC12 bytes + known linear output, before integration. |
|
||
| `request_switch_device_for_profile` falls through wrong path on systems with BOTH rkvdec AND rpi-hevc-dec | Prefer rpi-hevc-dec for HEVC when present. Explicit comment in switch. Test on fresnel = no rpi → falls to 'r'; test on higgs = no rkvdec → falls to 'p'. |
|
||
| Debian build env differs from Arch — see [[feedback_package_build_flags_unmask_bugs]] | Build with explicit `-O2 -D_FORTIFY_SOURCE=2 -fstack-protector-strong` flags to match Debian dpkg-buildflags. |
|
||
| Synthetic SPS pre-seed accidentally fires on rpi-hevc-dec | Gate on `driver_kind != 'p'` in the pre-seed call site. Verify via strace: pre-seed ioctl pattern absent. |
|
||
| iter2 EXT_SPS path accidentally engages on rpi | Already probe-gated; `has_hevc_ext_sps_rps_rpi_hevc_dec` = false naturally. |
|
||
|
||
### Phase 5 review explicitly requested
|
||
|
||
Per CLAUDE.md global "Reviews are never skippable" + [[feedback_review_empirical_over_theoretical]]:
|
||
this plan goes to a sonnet Plan-agent review. Specific review focus:
|
||
- Routing correctness when 0/1/2/3 of the three drivers are present.
|
||
- NC12 geometry: did we copy ffmpeg's per-row memcpy math correctly?
|
||
Did we miss UV stride considerations?
|
||
- `image.c` gate predicate — does it exclude any legitimate NV12-linear
|
||
case on Pi? (No: rpi only exposes NC12/NC30 CAPTURE, no plain NV12.)
|
||
- Cross-device regression scope (fresnel + ampere paths untouched?).
|
||
|
||
Empty-result review IS a green light; "we should have skipped it" is the
|
||
prohibited move.
|