Phase 8.8: throughput baseline + multi-codec streams + HDR
Per the correctness-before-speed principle: measure before
optimising. Roadmap going in said "QPU dispatch substitution
to hit 30fps@1080p". Measurement on hertz shows the FFmpeg
software path already hits 65-88 fps@1080p across all three
codecs — QPU substitution would be premature optimisation.
So 8.8 ships what's actually useful:
1. Per-frame timing in test_m2m_stream.
2. Multi-frame AV1 + H.264 streams verified byte-exact at
1080p (closes the "VP9-only stream tests" gap from 8.7).
3. HDR / 10-bit via V4L2_PIX_FMT_P010 + daemon
pack_p010_to_plane.
Test harness (tools/test_m2m_stream.c):
- Per-frame µs timing via CLOCK_MONOTONIC; reports mean/p50/
p99/min/max + wall ms + fps.
- Annex-B H.264 parser: split on 3-/4-byte start codes,
accumulate NALs into access units (push on VCL NAL types
1 or 5). Without AU grouping FFmpeg rejects SPS/PPS-only
buffers as "no frame!".
- Format auto-detect (DKIF magic → IVF; else Annex-B).
- Optional 6th arg `[capture]`: nv12m | p010.
- CAPTURE mmap path generalised for num_planes==1 (P010).
Kernel (kernel/daedalus_v4l2_main.c):
- CAPTURE formats array {NV12M, P010}; enum_fmt walks it.
- daedalus_fill_capture_fmt takes a fourcc:
NV12M: 2 planes, W*H + W*H/2 bytes, bpl=W
P010: 1 plane, W*H*2 + W*H bytes, bpl=W*2
- try_fmt preserves caller fourcc when supported.
- daedalus_complete_resp_frame's dmabuf path now sets each
plane's payload to vb2_plane_size(vb,p) — generalises
cleanly across 1-plane (P010) and 2-plane (NV12M) layouts;
the daemon fully populates the plane so payload =
sizeimage.
Daemon (daemon/src/decoder.c):
- pack_p010_to_plane: YUV420P10LE → P010 single-plane.
10-bit samples shifted left by 6 to MSB-align in 16-bit
words per V4L2 ABI. Y at base+0, interleaved CbCr right
after Y plane (per format spec for single-plane P010).
Strips source stride padding; respects destination stride.
- daedalus_decoder_run_request dispatches on
req->capture_pix_fmt (NV12M → pack_nv12_to_planes; P010
→ pack_p010_to_plane; else warn + skip).
- Includes <linux/videodev2.h> for fourcc constants.
Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712):
1080p throughput baseline (30 frames testsrc, dmabuf path):
VP9 1080p: mean 12.0 ms, p99 15.9 ms, fps **83.1**, byte-exact ✓
AV1 1080p: mean 15.4 ms, p99 41.0 ms, fps **65.0**, byte-exact ✓
H.264 1080p: mean 11.3 ms, p99 21.5 ms, fps **88.3**, byte-exact ✓
All 2-3× over the 30fps-floor-is-fine criterion.
HDR / 10-bit 1080p P010:
10 frames, 62 MB output, fps **48.8**, byte-exact vs
`ffmpeg -pix_fmt p010le -f rawvideo`.
Small-frame P010 (320×240): fps 966 — fixed daemon overhead
dominates at low resolutions.
v4l2-compliance unchanged from 8.7: 49/49 passing.
Format enumeration confirms NM12 + P010 on CAPTURE.
Clean SIGTERM + rmmod; no kernel oops/WARN.
Roadmap update (docs/roadmap.md):
- 8.8 marked closed with closure-doc reference, including
the explicit "QPU substitution not needed" rationale.
- 8.9 reshaped: libva-v4l2-request consumer integration
(per project_consumer_target memory) — the actual
user-facing endpoint.
Per correctness-before-speed:
- Measured first; QPU work explicitly justified-out via data.
- Byte-exact pixel comparison for every codec/format combo
(NV12: VP9, AV1, H.264; P010: VP9 10-bit at 320×240 and
1080p).
- AU grouping in the Annex-B parser is the correct
semantic boundary, not just a workaround.
- vb2_plane_size for payload generalises to any plane
count, not hardcoded to 2.
Phase 8.9 next: libva-v4l2-request integration — close
the loop from YouTube/Firefox to /dev/video0 + daemon
playback.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
+98
-8
@@ -10,6 +10,8 @@
|
||||
#include <stdlib.h>
|
||||
#include <string.h>
|
||||
|
||||
#include <linux/videodev2.h>
|
||||
|
||||
#include <libavcodec/avcodec.h>
|
||||
#include <libavutil/pixfmt.h>
|
||||
|
||||
@@ -157,6 +159,80 @@ static int decoder_open_codec(struct daedalus_decoder *dec, uint32_t codec_id,
|
||||
* Returns 0 on success, -EINVAL if the source is not planar 4:2:0
|
||||
* (Phase 8.6 still expects yuv420p-class outputs; 8.7 widens).
|
||||
*/
|
||||
/*
|
||||
* Pack 10-bit planar YUV420P10LE into V4L2_PIX_FMT_P010 single
|
||||
* plane: Y plane (width × 2 bytes per pixel, height rows) +
|
||||
* interleaved CbCr plane at half-res (cw*2 bytes per row, ch
|
||||
* rows). P010 stores 10-bit samples in 16-bit words,
|
||||
* MSB-aligned (low 6 bits zero). libav's YUV420P10LE delivers
|
||||
* 10-bit samples in the LOW 10 bits, so we shift left by 6.
|
||||
*
|
||||
* The single-plane layout means Y and CbCr are concatenated in
|
||||
* planes->base[0]; planes->stride[0] is the Y stride (which we
|
||||
* also use for the CbCr rows since both have the same
|
||||
* per-line byte count for 4:2:0 with interleaved chroma).
|
||||
*/
|
||||
static int pack_p010_to_plane(struct AVFrame *fr,
|
||||
const AVPixFmtDescriptor *desc,
|
||||
const struct daedalus_capture_planes *planes)
|
||||
{
|
||||
int h = fr->height;
|
||||
int w = fr->width;
|
||||
int cw, ch, y, x;
|
||||
uint8_t *base;
|
||||
uint32_t stride;
|
||||
uint8_t *dst_y, *dst_uv;
|
||||
size_t y_size;
|
||||
|
||||
if (!desc || !planes || planes->nr < 1)
|
||||
return -EINVAL;
|
||||
if (desc->nb_components < 3)
|
||||
return -EINVAL;
|
||||
if (desc->log2_chroma_w != 1 || desc->log2_chroma_h != 1)
|
||||
return -EINVAL;
|
||||
/* Only 10-bit-per-sample sources packed into 16 bits per
|
||||
* libav convention. Anything else needs its own path. */
|
||||
if (desc->comp[0].depth != 10)
|
||||
return -EINVAL;
|
||||
|
||||
cw = AV_CEIL_RSHIFT(w, desc->log2_chroma_w);
|
||||
ch = AV_CEIL_RSHIFT(h, desc->log2_chroma_h);
|
||||
|
||||
base = planes->base[0];
|
||||
stride = planes->stride[0] ? planes->stride[0] : (uint32_t) (w * 2);
|
||||
if (!base)
|
||||
return -EINVAL;
|
||||
|
||||
dst_y = base;
|
||||
y_size = (size_t) stride * (size_t) h;
|
||||
dst_uv = base + y_size;
|
||||
|
||||
/* Y plane: shift 10-bit → MSB-aligned 16-bit. */
|
||||
for (y = 0; y < h; y++) {
|
||||
const uint16_t *src = (const uint16_t *) (fr->data[0] +
|
||||
(size_t) y * fr->linesize[0]);
|
||||
uint16_t *dst = (uint16_t *) (dst_y +
|
||||
(size_t) y * stride);
|
||||
for (x = 0; x < w; x++)
|
||||
dst[x] = (uint16_t) (src[x] << 6);
|
||||
}
|
||||
|
||||
/* Interleave Cb/Cr at half-res, also MSB-aligned. */
|
||||
for (y = 0; y < ch; y++) {
|
||||
const uint16_t *u = (const uint16_t *) (fr->data[1] +
|
||||
(size_t) y * fr->linesize[1]);
|
||||
const uint16_t *v = (const uint16_t *) (fr->data[2] +
|
||||
(size_t) y * fr->linesize[2]);
|
||||
uint16_t *dst = (uint16_t *) (dst_uv +
|
||||
(size_t) y * stride);
|
||||
for (x = 0; x < cw; x++) {
|
||||
dst[x * 2 + 0] = (uint16_t) (u[x] << 6);
|
||||
dst[x * 2 + 1] = (uint16_t) (v[x] << 6);
|
||||
}
|
||||
}
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int pack_nv12_to_planes(struct AVFrame *fr,
|
||||
const AVPixFmtDescriptor *desc,
|
||||
const struct daedalus_capture_planes *planes)
|
||||
@@ -337,16 +413,30 @@ int daedalus_decoder_run_request(struct daedalus_decoder *dec,
|
||||
resp->fnv1a_yuv = h;
|
||||
|
||||
/*
|
||||
* Pack pixels as NV12 directly into the mapped CAPTURE
|
||||
* dmabuf planes. No copy into a wire buffer — pixels
|
||||
* land in the V4L2 client's CAPTURE buffer the moment
|
||||
* the write touches the mmap.
|
||||
* Pack pixels directly into the mapped CAPTURE dmabuf
|
||||
* planes. Dispatch on the V4L2 fourcc the kernel
|
||||
* negotiated:
|
||||
* V4L2_PIX_FMT_NV12M (default, 8-bit, 2 planes)
|
||||
* V4L2_PIX_FMT_P010 (10-bit HDR, 1 plane)
|
||||
*/
|
||||
if (planes && planes->nr >= 2) {
|
||||
int prc = pack_nv12_to_planes(fr, desc, planes);
|
||||
if (planes && planes->nr >= 1) {
|
||||
int prc = 0;
|
||||
switch (req->capture_pix_fmt) {
|
||||
case V4L2_PIX_FMT_NV12M:
|
||||
prc = pack_nv12_to_planes(fr, desc, planes);
|
||||
break;
|
||||
case V4L2_PIX_FMT_P010:
|
||||
prc = pack_p010_to_plane(fr, desc, planes);
|
||||
break;
|
||||
default:
|
||||
log_warn("decoder: unsupported capture fourcc 0x%08x",
|
||||
req->capture_pix_fmt);
|
||||
prc = -EINVAL;
|
||||
break;
|
||||
}
|
||||
if (prc < 0)
|
||||
log_warn("decoder: NV12-pack-to-planes failed (pix_fmt=%d planes=%d) — kernel will see metadata only",
|
||||
fr->format, planes->nr);
|
||||
log_warn("decoder: pack failed (pix_fmt=%d cap_fourcc=0x%08x) — kernel will see metadata only",
|
||||
fr->format, req->capture_pix_fmt);
|
||||
}
|
||||
|
||||
log_info("decoder: OK %dx%d fmt=%d (%s) fnv1a=0x%08x luma=%u chroma=%u",
|
||||
|
||||
Reference in New Issue
Block a user