iter40: Pi 5 HEVC chapter — backend integration lands, bit-exact pending

Phase 6 implementation. Backend builds clean on higgs (Debian 13
trixie, aarch64), vainfo lists VAProfileHEVCMain via rpi-hevc-dec,
multi-device probe finds /dev/video19 + /dev/media1, CreateContext
+ S_FMT + REQBUFS + STREAMON all succeed.

Phase 7 partial: infrastructure works, 10 frames flow through the
pipeline (correct byte counts produced — 13824000 for 1280x720 x 10
NV12 frames). But every DQBUF CAPTURE returns V4L2_BUF_FLAG_ERROR
so output content is wrong (libva sha != kdirect sha). The decode
itself is failing on the rpi-hevc-dec side despite all ctrl
submissions returning success.

Code changes:
- request.h: video_fd_rpi_hevc_dec / media_fd_rpi_hevc_dec slots +
  has_hevc_ext_sps_rps_rpi_hevc_dec flag (mirrors iter38 + iter2
  pair-of-flags pattern, naturally false on Pi).
- request.c: known_decoder_drivers gains rpi-hevc-dec; primary-driver
  probe gets an else-if branch setting the new fds (Phase 5 F3);
  request_switch_device_for_profile prefers 'p' for HEVC when
  rpi-hevc-dec present.
- context.c: per-fd want_pixfmt (NC12 on Pi), capture_pixelformat
  taken from video_format slot (not hardcoded NV12/NV15);
  synthetic-SPS pre-seed gated off for Pi (Phase 5 F6);
  destination_sizes uses nv12_col128_uv_plane_offset for NC12 SAND
  layout (Phase 5 F2);
  per-driver HEVC_START_CODE (NONE on Pi, ANNEX_B on RK);
  per-driver context_object->h264_start_code (skip prepend on Pi).
- video.c: NV12_COL128 video_format entry (8-bit SAND, single
  buffer, 2 planes, NV12 drm_format with MOD_NONE so detile branch
  fires rather than tiled_to_planar).
- nv12_col128.c/.h: detile primitive (Y + UV per-plane, kernel
  hevc_d_video.c bytesperline formula + ffmpeg/Kynesim per-pixel
  offset). UV plane offset = 128 * ALIGN(h, 8) — within-column
  (SAND interleaves Y+UV per column, NOT plane-concatenated;
  earlier wrong formula caught by Phase 7 SEGV).
- image.c: #ifdef __arm__ extended to __arm__ || __aarch64__
  (Phase 5 F1 — guard was killing detile path on all aarch64
  hosts including fresnel iter39 NV15 path, masked because 10-bit
  never exercised); RequestCreateImage NC12 → NV12 stride override
  (linear width, not column-stride); copy_surface_to_image NC12
  detile branch (gates on fourcc + v4l2_format).
- nv15.h: fallback V4L2_PIX_FMT_NV15 define (Debian 13 headers
  omit it though they have NC12).
- nv12_col128.h: fallback V4L2_PIX_FMT_NV12_COL128 +
  V4L2_PIX_FMT_NV12_10_COL128 (Arch / mainline pre-Pi headers).
- tests/test_nv12_col128_detile.c: hand-crafted-bytes unit test;
  passes (8 cases: Y + UV for 4 widths incl. 1366 misaligned;
  UV-offset helper).
- meson.build / nv12_col128 sources listed.

Phase 7 status: not yet bit-exact. Remaining diagnosis: per-frame
S_EXT_CTRLS payload diff vs kdirect (kdirect sends 4 ctrls
SPS+PPS+decode_params+slice_array; ours sends 5 incl. scaling_matrix;
field ordering differs). Likely the slice_array contents need
per-driver handling for rpi-hevc-dec's expected layout. Beyond
in-session reach.

iter38 5/5 baseline on fresnel + ampere should be unaffected (new
fd stays -1 on non-Pi hosts; all gates either short-circuit on
fd-not-present or no-op).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-17 19:17:14 +00:00
parent f1be489c75
commit 3ffa9d0d17
10 changed files with 706 additions and 31 deletions
+114
View File
@@ -0,0 +1,114 @@
/*
* V4L2_PIX_FMT_NV12_COL128 → linear NV12 detile primitive. Pi 5 / CM5
* rpi-hevc-dec CAPTURE. iter40 (2026-05-17).
*
* Math derived from kernel hevc_d_video.c (size formula) +
* ffmpeg/Kynesim libavutil/rpi_sand_fn_pw.h (per-pixel offset). The
* single-stripe fast path memcpy's 128 bytes at a time when an output
* row falls entirely within one tile column (the common case);
* straddling rows are split into two memcpy halves.
*
* No NEON / SIMD here — correctness first. Each output row generates
* (width / 128) + ~1 memcpys of up to 128 bytes; for 1920x1080 that's
* ~17000 small memcpys per frame, fine for Phase 1 PoC.
*/
#include "nv12_col128.h"
#include <string.h>
/*
* Tile column width in bytes. The 'COL128' name embeds this; if it ever
* varies, take it from V4L2_PIX_FMT_NV12_COL128's kernel definition.
*/
#define NC12_TILE_W 128
/*
* Common Y / UV plane detile — the layout is identical (single-byte per
* pixel, column-major 128-wide tiles). The only thing that varies is
* what plane the caller passes in. width here is plane width in bytes
* (= image width for both Y and CbCr-interleaved NV12 UV); height is
* plane height in pixels (image height for Y, image height / 2 for UV).
*/
static void nv12_col128_detile_plane(uint8_t *dst, unsigned int dst_stride,
const uint8_t *src,
unsigned int src_col_stride,
unsigned int width, unsigned int height)
{
unsigned int y, x;
for (y = 0; y < height; y++) {
uint8_t *drow = dst + y * dst_stride;
x = 0;
while (x < width) {
unsigned int col = x / NC12_TILE_W;
unsigned int in_col = x % NC12_TILE_W;
unsigned int n = NC12_TILE_W - in_col;
if (n > width - x)
n = width - x;
/*
* Source byte = base + col*128*col_stride + y*128 + in_col
* Copy n contiguous bytes (all within this tile column,
* since n is capped at the remaining width-in-column).
*/
const uint8_t *p = src
+ (size_t)col * NC12_TILE_W * src_col_stride
+ (size_t)y * NC12_TILE_W
+ in_col;
memcpy(drow + x, p, n);
x += n;
}
}
}
void nv12_col128_detile_y(uint8_t *dst, unsigned int dst_stride,
const uint8_t *src_y, unsigned int src_col_stride,
unsigned int width, unsigned int height)
{
nv12_col128_detile_plane(dst, dst_stride, src_y, src_col_stride,
width, height);
}
void nv12_col128_detile_uv(uint8_t *dst, unsigned int dst_stride,
const uint8_t *src_uv, unsigned int src_col_stride,
unsigned int width, unsigned int uv_height)
{
/* UV plane (CbCr interleaved): byte-width equals Y-plane width
* (one Cb + one Cr per 2x2 Y block → 2 bytes per 2 horizontal Y
* samples → 1 byte per Y pixel horizontally). Height is half. */
nv12_col128_detile_plane(dst, dst_stride, src_uv, src_col_stride,
width, uv_height);
}
unsigned int nv12_col128_uv_plane_offset(unsigned int image_width,
unsigned int image_height)
{
unsigned int aligned_h = (image_height + 7) & ~7u;
/*
* In the COL128 SAND layout, Y and UV are NOT separate planes
* concatenated end-to-end. Within EACH 128-pixel-wide column:
* first 128 * height bytes = Y data for this column strip
* next 128 * height / 2 bytes = UV data for this column strip
* total 128 * bytesperline (= 128 * height * 3/2) bytes per column
*
* The "UV plane base" pointer (data[1] in AVFrame convention) is
* just data[0] + (128 * height) — the offset of the UV bytes
* WITHIN the first column. All subsequent UV bytes are reached by
* the same column-stride arithmetic the Y plane uses (col *
* 128 * bytesperline + y * 128 + in_col), so passing this offset
* pointer + iterating y over [0, height/2) traverses all UV rows
* across all columns correctly.
*
* Earlier wrong formula was num_columns * 128 * aligned_h (i.e.
* sizeof(linear Y plane)) — that pushed past the end of the SAND
* buffer because the layout isn't planes-end-to-end.
*
* Cross-check: kernel sizeimage = bytesperline * width =
* (aligned_h * 3/2) * num_columns * 128 = num_columns * 128 *
* aligned_h * 3/2. Per column: 128 * aligned_h * 3/2. Y portion
* per column: 128 * aligned_h. UV portion per column: half of Y.
* Sum across columns: matches sizeimage.
*/
return NC12_TILE_W * aligned_h;
}