forked from marfrit/libva-v4l2-request-fourier
iter40: Pi 5 HEVC chapter — backend integration lands, bit-exact pending
Phase 6 implementation. Backend builds clean on higgs (Debian 13 trixie, aarch64), vainfo lists VAProfileHEVCMain via rpi-hevc-dec, multi-device probe finds /dev/video19 + /dev/media1, CreateContext + S_FMT + REQBUFS + STREAMON all succeed. Phase 7 partial: infrastructure works, 10 frames flow through the pipeline (correct byte counts produced — 13824000 for 1280x720 x 10 NV12 frames). But every DQBUF CAPTURE returns V4L2_BUF_FLAG_ERROR so output content is wrong (libva sha != kdirect sha). The decode itself is failing on the rpi-hevc-dec side despite all ctrl submissions returning success. Code changes: - request.h: video_fd_rpi_hevc_dec / media_fd_rpi_hevc_dec slots + has_hevc_ext_sps_rps_rpi_hevc_dec flag (mirrors iter38 + iter2 pair-of-flags pattern, naturally false on Pi). - request.c: known_decoder_drivers gains rpi-hevc-dec; primary-driver probe gets an else-if branch setting the new fds (Phase 5 F3); request_switch_device_for_profile prefers 'p' for HEVC when rpi-hevc-dec present. - context.c: per-fd want_pixfmt (NC12 on Pi), capture_pixelformat taken from video_format slot (not hardcoded NV12/NV15); synthetic-SPS pre-seed gated off for Pi (Phase 5 F6); destination_sizes uses nv12_col128_uv_plane_offset for NC12 SAND layout (Phase 5 F2); per-driver HEVC_START_CODE (NONE on Pi, ANNEX_B on RK); per-driver context_object->h264_start_code (skip prepend on Pi). - video.c: NV12_COL128 video_format entry (8-bit SAND, single buffer, 2 planes, NV12 drm_format with MOD_NONE so detile branch fires rather than tiled_to_planar). - nv12_col128.c/.h: detile primitive (Y + UV per-plane, kernel hevc_d_video.c bytesperline formula + ffmpeg/Kynesim per-pixel offset). UV plane offset = 128 * ALIGN(h, 8) — within-column (SAND interleaves Y+UV per column, NOT plane-concatenated; earlier wrong formula caught by Phase 7 SEGV). - image.c: #ifdef __arm__ extended to __arm__ || __aarch64__ (Phase 5 F1 — guard was killing detile path on all aarch64 hosts including fresnel iter39 NV15 path, masked because 10-bit never exercised); RequestCreateImage NC12 → NV12 stride override (linear width, not column-stride); copy_surface_to_image NC12 detile branch (gates on fourcc + v4l2_format). - nv15.h: fallback V4L2_PIX_FMT_NV15 define (Debian 13 headers omit it though they have NC12). - nv12_col128.h: fallback V4L2_PIX_FMT_NV12_COL128 + V4L2_PIX_FMT_NV12_10_COL128 (Arch / mainline pre-Pi headers). - tests/test_nv12_col128_detile.c: hand-crafted-bytes unit test; passes (8 cases: Y + UV for 4 widths incl. 1366 misaligned; UV-offset helper). - meson.build / nv12_col128 sources listed. Phase 7 status: not yet bit-exact. Remaining diagnosis: per-frame S_EXT_CTRLS payload diff vs kdirect (kdirect sends 4 ctrls SPS+PPS+decode_params+slice_array; ours sends 5 incl. scaling_matrix; field ordering differs). Likely the slice_array contents need per-driver handling for rpi-hevc-dec's expected layout. Beyond in-session reach. iter38 5/5 baseline on fresnel + ampere should be unaffected (new fd stays -1 on non-Pi hosts; all gates either short-circuit on fd-not-present or no-op). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,114 @@
|
||||
/*
|
||||
* V4L2_PIX_FMT_NV12_COL128 → linear NV12 detile primitive. Pi 5 / CM5
|
||||
* rpi-hevc-dec CAPTURE. iter40 (2026-05-17).
|
||||
*
|
||||
* Math derived from kernel hevc_d_video.c (size formula) +
|
||||
* ffmpeg/Kynesim libavutil/rpi_sand_fn_pw.h (per-pixel offset). The
|
||||
* single-stripe fast path memcpy's 128 bytes at a time when an output
|
||||
* row falls entirely within one tile column (the common case);
|
||||
* straddling rows are split into two memcpy halves.
|
||||
*
|
||||
* No NEON / SIMD here — correctness first. Each output row generates
|
||||
* (width / 128) + ~1 memcpys of up to 128 bytes; for 1920x1080 that's
|
||||
* ~17000 small memcpys per frame, fine for Phase 1 PoC.
|
||||
*/
|
||||
|
||||
#include "nv12_col128.h"
|
||||
|
||||
#include <string.h>
|
||||
|
||||
/*
|
||||
* Tile column width in bytes. The 'COL128' name embeds this; if it ever
|
||||
* varies, take it from V4L2_PIX_FMT_NV12_COL128's kernel definition.
|
||||
*/
|
||||
#define NC12_TILE_W 128
|
||||
|
||||
/*
|
||||
* Common Y / UV plane detile — the layout is identical (single-byte per
|
||||
* pixel, column-major 128-wide tiles). The only thing that varies is
|
||||
* what plane the caller passes in. width here is plane width in bytes
|
||||
* (= image width for both Y and CbCr-interleaved NV12 UV); height is
|
||||
* plane height in pixels (image height for Y, image height / 2 for UV).
|
||||
*/
|
||||
static void nv12_col128_detile_plane(uint8_t *dst, unsigned int dst_stride,
|
||||
const uint8_t *src,
|
||||
unsigned int src_col_stride,
|
||||
unsigned int width, unsigned int height)
|
||||
{
|
||||
unsigned int y, x;
|
||||
|
||||
for (y = 0; y < height; y++) {
|
||||
uint8_t *drow = dst + y * dst_stride;
|
||||
x = 0;
|
||||
while (x < width) {
|
||||
unsigned int col = x / NC12_TILE_W;
|
||||
unsigned int in_col = x % NC12_TILE_W;
|
||||
unsigned int n = NC12_TILE_W - in_col;
|
||||
if (n > width - x)
|
||||
n = width - x;
|
||||
/*
|
||||
* Source byte = base + col*128*col_stride + y*128 + in_col
|
||||
* Copy n contiguous bytes (all within this tile column,
|
||||
* since n is capped at the remaining width-in-column).
|
||||
*/
|
||||
const uint8_t *p = src
|
||||
+ (size_t)col * NC12_TILE_W * src_col_stride
|
||||
+ (size_t)y * NC12_TILE_W
|
||||
+ in_col;
|
||||
memcpy(drow + x, p, n);
|
||||
x += n;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
void nv12_col128_detile_y(uint8_t *dst, unsigned int dst_stride,
|
||||
const uint8_t *src_y, unsigned int src_col_stride,
|
||||
unsigned int width, unsigned int height)
|
||||
{
|
||||
nv12_col128_detile_plane(dst, dst_stride, src_y, src_col_stride,
|
||||
width, height);
|
||||
}
|
||||
|
||||
void nv12_col128_detile_uv(uint8_t *dst, unsigned int dst_stride,
|
||||
const uint8_t *src_uv, unsigned int src_col_stride,
|
||||
unsigned int width, unsigned int uv_height)
|
||||
{
|
||||
/* UV plane (CbCr interleaved): byte-width equals Y-plane width
|
||||
* (one Cb + one Cr per 2x2 Y block → 2 bytes per 2 horizontal Y
|
||||
* samples → 1 byte per Y pixel horizontally). Height is half. */
|
||||
nv12_col128_detile_plane(dst, dst_stride, src_uv, src_col_stride,
|
||||
width, uv_height);
|
||||
}
|
||||
|
||||
unsigned int nv12_col128_uv_plane_offset(unsigned int image_width,
|
||||
unsigned int image_height)
|
||||
{
|
||||
unsigned int aligned_h = (image_height + 7) & ~7u;
|
||||
|
||||
/*
|
||||
* In the COL128 SAND layout, Y and UV are NOT separate planes
|
||||
* concatenated end-to-end. Within EACH 128-pixel-wide column:
|
||||
* first 128 * height bytes = Y data for this column strip
|
||||
* next 128 * height / 2 bytes = UV data for this column strip
|
||||
* total 128 * bytesperline (= 128 * height * 3/2) bytes per column
|
||||
*
|
||||
* The "UV plane base" pointer (data[1] in AVFrame convention) is
|
||||
* just data[0] + (128 * height) — the offset of the UV bytes
|
||||
* WITHIN the first column. All subsequent UV bytes are reached by
|
||||
* the same column-stride arithmetic the Y plane uses (col *
|
||||
* 128 * bytesperline + y * 128 + in_col), so passing this offset
|
||||
* pointer + iterating y over [0, height/2) traverses all UV rows
|
||||
* across all columns correctly.
|
||||
*
|
||||
* Earlier wrong formula was num_columns * 128 * aligned_h (i.e.
|
||||
* sizeof(linear Y plane)) — that pushed past the end of the SAND
|
||||
* buffer because the layout isn't planes-end-to-end.
|
||||
*
|
||||
* Cross-check: kernel sizeimage = bytesperline * width =
|
||||
* (aligned_h * 3/2) * num_columns * 128 = num_columns * 128 *
|
||||
* aligned_h * 3/2. Per column: 128 * aligned_h * 3/2. Y portion
|
||||
* per column: 128 * aligned_h. UV portion per column: half of Y.
|
||||
* Sum across columns: matches sizeimage.
|
||||
*/
|
||||
return NC12_TILE_W * aligned_h;
|
||||
}
|
||||
Reference in New Issue
Block a user