iter40: Pi 5 HEVC chapter — backend integration lands, bit-exact pending

Phase 6 implementation. Backend builds clean on higgs (Debian 13 trixie, aarch64), vainfo lists VAProfileHEVCMain via rpi-hevc-dec, multi-device probe finds /dev/video19 + /dev/media1, CreateContext + S_FMT + REQBUFS + STREAMON all succeed. Phase 7 partial: infrastructure works, 10 frames flow through the pipeline (correct byte counts produced — 13824000 for 1280x720 x 10 NV12 frames). But every DQBUF CAPTURE returns V4L2_BUF_FLAG_ERROR so output content is wrong (libva sha != kdirect sha). The decode itself is failing on the rpi-hevc-dec side despite all ctrl submissions returning success. Code changes: - request.h: video_fd_rpi_hevc_dec / media_fd_rpi_hevc_dec slots + has_hevc_ext_sps_rps_rpi_hevc_dec flag (mirrors iter38 + iter2 pair-of-flags pattern, naturally false on Pi). - request.c: known_decoder_drivers gains rpi-hevc-dec; primary-driver probe gets an else-if branch setting the new fds (Phase 5 F3); request_switch_device_for_profile prefers 'p' for HEVC when rpi-hevc-dec present. - context.c: per-fd want_pixfmt (NC12 on Pi), capture_pixelformat taken from video_format slot (not hardcoded NV12/NV15); synthetic-SPS pre-seed gated off for Pi (Phase 5 F6); destination_sizes uses nv12_col128_uv_plane_offset for NC12 SAND layout (Phase 5 F2); per-driver HEVC_START_CODE (NONE on Pi, ANNEX_B on RK); per-driver context_object->h264_start_code (skip prepend on Pi). - video.c: NV12_COL128 video_format entry (8-bit SAND, single buffer, 2 planes, NV12 drm_format with MOD_NONE so detile branch fires rather than tiled_to_planar). - nv12_col128.c/.h: detile primitive (Y + UV per-plane, kernel hevc_d_video.c bytesperline formula + ffmpeg/Kynesim per-pixel offset). UV plane offset = 128 * ALIGN(h, 8) — within-column (SAND interleaves Y+UV per column, NOT plane-concatenated; earlier wrong formula caught by Phase 7 SEGV). - image.c: #ifdef __arm__ extended to __arm__ || __aarch64__ (Phase 5 F1 — guard was killing detile path on all aarch64 hosts including fresnel iter39 NV15 path, masked because 10-bit never exercised); RequestCreateImage NC12 → NV12 stride override (linear width, not column-stride); copy_surface_to_image NC12 detile branch (gates on fourcc + v4l2_format). - nv15.h: fallback V4L2_PIX_FMT_NV15 define (Debian 13 headers omit it though they have NC12). - nv12_col128.h: fallback V4L2_PIX_FMT_NV12_COL128 + V4L2_PIX_FMT_NV12_10_COL128 (Arch / mainline pre-Pi headers). - tests/test_nv12_col128_detile.c: hand-crafted-bytes unit test; passes (8 cases: Y + UV for 4 widths incl. 1366 misaligned; UV-offset helper). - meson.build / nv12_col128 sources listed. Phase 7 status: not yet bit-exact. Remaining diagnosis: per-frame S_EXT_CTRLS payload diff vs kdirect (kdirect sends 4 ctrls SPS+PPS+decode_params+slice_array; ours sends 5 incl. scaling_matrix; field ordering differs). Likely the slice_array contents need per-driver handling for rpi-hevc-dec's expected layout. Beyond in-session reach. iter38 5/5 baseline on fresnel + ampere should be unaffected (new fd stays -1 on non-Pi hosts; all gates either short-circuit on fd-not-present or no-op). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 19:17:14 +00:00
parent f1be489c75
commit 3ffa9d0d17
10 changed files with 706 additions and 31 deletions
@@ -0,0 +1,114 @@
+/*
+ * V4L2_PIX_FMT_NV12_COL128 → linear NV12 detile primitive. Pi 5 / CM5
+ * rpi-hevc-dec CAPTURE. iter40 (2026-05-17).
+ *
+ * Math derived from kernel hevc_d_video.c (size formula) +
+ * ffmpeg/Kynesim libavutil/rpi_sand_fn_pw.h (per-pixel offset). The
+ * single-stripe fast path memcpy's 128 bytes at a time when an output
+ * row falls entirely within one tile column (the common case);
+ * straddling rows are split into two memcpy halves.
+ *
+ * No NEON / SIMD here — correctness first. Each output row generates
+ * (width / 128) + ~1 memcpys of up to 128 bytes; for 1920x1080 that's
+ * ~17000 small memcpys per frame, fine for Phase 1 PoC.
+ */
+
+#include "nv12_col128.h"
+
+#include <string.h>
+
+/*
+ * Tile column width in bytes. The 'COL128' name embeds this; if it ever
+ * varies, take it from V4L2_PIX_FMT_NV12_COL128's kernel definition.
+ */
+#define NC12_TILE_W   128
+
+/*
+ * Common Y / UV plane detile — the layout is identical (single-byte per
+ * pixel, column-major 128-wide tiles). The only thing that varies is
+ * what plane the caller passes in. width here is plane width in bytes
+ * (= image width for both Y and CbCr-interleaved NV12 UV); height is
+ * plane height in pixels (image height for Y, image height / 2 for UV).
+ */
+static void nv12_col128_detile_plane(uint8_t *dst, unsigned int dst_stride,
+                                     const uint8_t *src,
+                                     unsigned int src_col_stride,
+                                     unsigned int width, unsigned int height)
+{
+	unsigned int y, x;
+
+	for (y = 0; y < height; y++) {
+		uint8_t *drow = dst + y * dst_stride;
+		x = 0;
+		while (x < width) {
+			unsigned int col = x / NC12_TILE_W;
+			unsigned int in_col = x % NC12_TILE_W;
+			unsigned int n = NC12_TILE_W - in_col;
+			if (n > width - x)
+				n = width - x;
+			/*
+			 * Source byte = base + col*128*col_stride + y*128 + in_col
+			 * Copy n contiguous bytes (all within this tile column,
+			 * since n is capped at the remaining width-in-column).
+			 */
+			const uint8_t *p = src
+				+ (size_t)col * NC12_TILE_W * src_col_stride
+				+ (size_t)y * NC12_TILE_W
+				+ in_col;
+			memcpy(drow + x, p, n);
+			x += n;
+		}
+	}
+}
+
+void nv12_col128_detile_y(uint8_t *dst, unsigned int dst_stride,
+                          const uint8_t *src_y, unsigned int src_col_stride,
+                          unsigned int width, unsigned int height)
+{
+	nv12_col128_detile_plane(dst, dst_stride, src_y, src_col_stride,
+				 width, height);
+}
+
+void nv12_col128_detile_uv(uint8_t *dst, unsigned int dst_stride,
+                           const uint8_t *src_uv, unsigned int src_col_stride,
+                           unsigned int width, unsigned int uv_height)
+{
+	/* UV plane (CbCr interleaved): byte-width equals Y-plane width
+	 * (one Cb + one Cr per 2x2 Y block → 2 bytes per 2 horizontal Y
+	 * samples → 1 byte per Y pixel horizontally). Height is half. */
+	nv12_col128_detile_plane(dst, dst_stride, src_uv, src_col_stride,
+				 width, uv_height);
+}
+
+unsigned int nv12_col128_uv_plane_offset(unsigned int image_width,
+                                         unsigned int image_height)
+{
+	unsigned int aligned_h = (image_height + 7) & ~7u;
+
+	/*
+	 * In the COL128 SAND layout, Y and UV are NOT separate planes
+	 * concatenated end-to-end. Within EACH 128-pixel-wide column:
+	 *   first 128 * height bytes  = Y data for this column strip
+	 *   next  128 * height / 2 bytes = UV data for this column strip
+	 *   total 128 * bytesperline (= 128 * height * 3/2) bytes per column
+	 *
+	 * The "UV plane base" pointer (data[1] in AVFrame convention) is
+	 * just data[0] + (128 * height) — the offset of the UV bytes
+	 * WITHIN the first column. All subsequent UV bytes are reached by
+	 * the same column-stride arithmetic the Y plane uses (col *
+	 * 128 * bytesperline + y * 128 + in_col), so passing this offset
+	 * pointer + iterating y over [0, height/2) traverses all UV rows
+	 * across all columns correctly.
+	 *
+	 * Earlier wrong formula was num_columns * 128 * aligned_h (i.e.
+	 * sizeof(linear Y plane)) — that pushed past the end of the SAND
+	 * buffer because the layout isn't planes-end-to-end.
+	 *
+	 * Cross-check: kernel sizeimage = bytesperline * width =
+	 * (aligned_h * 3/2) * num_columns * 128 = num_columns * 128 *
+	 * aligned_h * 3/2. Per column: 128 * aligned_h * 3/2. Y portion
+	 * per column: 128 * aligned_h. UV portion per column: half of Y.
+	 * Sum across columns: matches sizeimage.
+	 */
+	return NC12_TILE_W * aligned_h;
+}