Commit Graph

17 Commits

Author SHA1 Message Date
claude-noether ea99dc8e27 ffmpeg-v4l2-request-fourier: preserve sl->mb for inspection callback (0017)
Companion to 0016 (PR #106).  Adds a coefficient side buffer in
H264Context, populated at the start of ff_h264_hl_decode_mb with a
single memcpy from sl->mb BEFORE IDCT-add zeros it.  The existing
post-pixel-work callback (still in 0016) can now read:
  - h->mb_inspect_coeffs  = pre-IDCT coefficients (this patch)
  - h->cur_pic.f->data    = post-pixel-work pre-deblock reconstruction

and derive P = pixels − IDCT(C) for daedalus-decoder's frame-major
dispatch in PR-A3+.

Memcpy gated on (h->mb_inspect_cb != NULL).  Zero cost when no
consumer is registered.  Side buffer = 16 * 48 int16 = 1536 bytes
(matches the 8-bit half of sl->mb's int16_t[16 * 48 * 2] declared
size; high-bit-depth uses the upper half — not preserved here since
the daedalus-decoder consumer is 8-bit-only).

Single-threaded decode assumed at the consumer side
(avctx->thread_count = 1).  Multi-slice / multi-threaded streams
would race on the single side buffer — explicit limitation of the
inspection mechanism, future extension would put per-slice buffers
in H264SliceContext.

Verified: patches 0016 + 0017 apply cleanly and build in sequence
against the Kwiboo v4l2-request-n8.1 fork at the pinned commit
b57fbbe5.  ff_h264_set_mb_inspect_cb symbol exported as before.

Wired into arch PKGBUILD + debian build-deb.sh patch sequence.
pkgrel bumped 13 → 14.

Refs reauktion/daedalus-decoder!14 (PR-A2 callback wiring complete,
PR-A3 coefficient extraction is the next consumer).
2026-05-26 09:46:10 +02:00
claude-noether 87cbb9b70a ffmpeg-v4l2-request-fourier: per-MB inspection callback for H.264 (0016)
Adds 0016-h264-mb-inspect-callback.patch to the FFmpeg fork.  Adds an
opt-in callback fired by ff_h264_hl_decode_mb after the existing
pixel work, for tools that need per-MB visibility into H.264 decode.

API:
  typedef void (*ff_h264_mb_inspect_cb)(void *opaque,
                                         const struct H264Context *h,
                                         int mb_x, int mb_y);
  void ff_h264_set_mb_inspect_cb(AVCodecContext *avctx,
                                  ff_h264_mb_inspect_cb cb, void *opaque);

Two new fields appended to H264Context (internal struct, declared in
h264dec.h not h264.h, no ABI surface to non-libavcodec callers).
Callback fires post-pixel-work for every MB in coded order; receives
const H264Context* so it can inspect any state (slice ctx via
h->slice_ctx, reconstructed pixels via h->cur_pic.f->data[plane],
etc.).

Default (cb==NULL): zero behaviour change, one load + one branch per
MB in the decoder hot path.

Shape distinction: per-MB observation, NOT per-kernel function-pointer
hijack (the 0003-0014 substitution-arc pattern that PR #105 reverted
+ daedalus-fourier PR #37's measurement-correction architecturally
retired).  Per-block synchronous Vulkan dispatch from libavcodec is
non-competitive; per-MB CPU-side observation feeding a per-frame
daedalus-decoder batch submit is the right shape (frame-major UMA
dispatch verdict, memory: dejavu).

Used by:
  - daedalus-decoder/tools/daedalus_decode_h264 (PR-A1b, follow-up)
  - future daedalus-v4l2 daemon refactor

Wired into arch PKGBUILD source[] + prepare() and debian build-deb.sh
patch sequence.  pkgrel bumped 12 → 13.

Refs reauktion/daedalus-decoder!12.
2026-05-26 05:57:49 +02:00
claude-noether f4047f3145 ffmpeg-v4l2-request-fourier: revert ctx flip — PR #36 was a measurement artifact (0015)
Reverts the no_qpu → qpu-capable ctx flip that landed via patch 0014
(marfrit-packages PR #104).

PR #104 was justified by daedalus-fourier PR #36's "QPU 4.30x faster
than CPU NEON" bench result.  That number was a measurement artifact:
v3d_runner.read_spv() did a bare cwd-relative fopen() with no path
search, so when the bench was run from the source dir (as in PR #36),
the SPVs at $builddir/v3d_*.spv were not found, every QPU dispatch
returned -1 fast, and the bench loop timed the failure path.

daedalus-fourier PR #37 fixes the SPV search + bench preflight.
Corrected numbers on hertz (Pi 5 V3D 7.1):

  kernel             CPU ns/op  QPU ns/op  winner
  IDCT 4x4 luma          10.75     217.63  CPU 20.24x
  IDCT 8x8 luma          29.69     785.94  CPU 26.47x
  Deblock luma_v         17.63     467.42  CPU 26.51x
  Deblock luma_h         38.30     498.53  CPU 13.02x
  qpel mc20 (8x8)        30.17    1300.44  CPU 43.10x
  qpel mc02 (8x8)        17.69    1363.40  CPU 77.08x
  qpel mc22 (8x8)        71.60    1948.37  CPU 27.21x

  1080p sum: CPU 5.57 ms vs QPU 123.54 ms — QPU 22x SLOWER.

Until daedalus QPU dispatch overhead is actually competitive (separate
multi-task effort tracked on the daedalus-fourier side), libavcodec.so
substitution must stay on daedalus_ctx_create_no_qpu() so the host
processes (firefox-fourier RDD, mpv-fourier, daedalus_v4l2_daemon)
don't pessimize their H.264 decode path.

Adds 0015-h264-ctx-revert-to-no-qpu.patch (2-line revert of patch 0014)
to both arch PKGBUILD and debian build-deb.sh.  Both pkgrel bumped
11 → 12.  Refs reauktion/daedalus-fourier!37.
2026-05-25 21:47:33 +02:00
claude-noether 9c70ffffe7 ffmpeg-v4l2-request-fourier: flip libavcodec daedalus ctx no_qpu → qpu-capable (0014)
(Renumbered from 0013 — PR #102 landed 0013-h264-deblock-chroma-intra
while this PR was open, so the next free slot is 0014.)

Patches 0003 (IDCT 4x4) and 0007 (qpel mc20) created the libavcodec.so
process-global daedalus_ctx via daedalus_ctx_create_no_qpu().  Rationale
at the time: cycle 6/9 had only CPU NEON paths, so a QPU-capable ctx
would have meant pointless Vulkan init in every host process.

Two things changed since:

  1. Every H.264 hot-path primitive now has a V3D7 compute shader.
     IDCT 4x4/8x8 + 8 deblock variants (luma+chroma × V+H × inter+intra)
     + 30 qpel positions.  See daedalus-fourier PRs #28-#35.

  2. Dispatch overhead has been hammered down — buffer pool in
     v3d_runner + persistent command buffer.  daedalus-fourier PR #36
     bench on hertz (Pi 5 V3D 7.1, 30 iters x 5 warmup):

       1080p worst-case sum (IDCT4 + deblock luma + qpel mc22):
         CPU NEON only:  5.57 ms
         QPU only:       1.30 ms   (CPU/QPU sum ratio = 4.30x)

PR #10's CPU-4x-faster-than-QPU verdict (which justified the original
no_qpu ctx choice) is reversed by ~17x.

This commit adds 0014-h264-ctx-qpu-capable.patch which flips both H.264
TUs (h264_idct_daedalus.c, h264_qpel_daedalus.c) from
daedalus_ctx_create_no_qpu() to daedalus_ctx_create().

daedalus_ctx_create() probes for a usable Vulkan device and falls back
to no_qpu mode if unavailable, so this is safe on hosts without V3D
(x86 build runners, Debian aarch64 builders without renderD, etc.).
Hosts WITH V3D (Pi 5 deployment targets) now route the H.264 hot-path
through V3D compute instead of CPU NEON.

Wired into both arch PKGBUILD (source[] + prepare()) and debian
build-deb.sh; both pkgrel bumped 10 → 11.

Refs reauktion/daedalus-fourier!36.
2026-05-25 21:18:18 +02:00
claude-noether 8f9487d355 ffmpeg-v4l2-request-fourier: route H.264 chroma intra deblock (4:2:0) through daedalus-fourier (0013)
Substitutes c->v_loop_filter_chroma_intra and c->h_loop_filter_chroma_intra
with daedalus wrappers in the bit_depth=8 / chroma_format_idc<=1 (4:2:0)
branch.  4:2:2 stays on the in-tree NEON path (the daedalus chroma intra
dispatch is 4:2:0-only).

The fourier dispatches were exposed in PR #11 (DEFINE_INTRA_DISPATCH
macro generates the public daedalus_dispatch_h264_deblock_chroma_*_intra
symbols + recipe wrappers).

Re-architects the chroma init: v_loop_filter_chroma_intra was previously
assigned unconditionally to the NEON variant (which works for both 4:2:0
and 4:2:2).  We now assign it INSIDE both branches of the chroma_format_idc
conditional — 4:2:0 picks daedalus, 4:2:2 keeps NEON.  No regression for
4:2:2 streams.

Same NEON-to-NEON via recipe shape as 0010 luma intra.

Closes the deblock substitution layer for the 4:2:0 / 8-bit hot path:
- 0005 luma_v non-intra ✓
- 0008 luma_h non-intra ✓
- 0009 chroma_v / chroma_h non-intra ✓
- 0010 luma_v / luma_h intra ✓
- 0013 chroma_v / chroma_h intra ✓

All 8 deblock variants for the common 4:2:0 path now route through
daedalus.  4:2:2 chroma + the chroma422 mbaff variants stay on in-tree
NEON.

Verified the patch applies cleanly on top of 0001-0012 against the
pinned upstream commit b57fbbe5 on hertz.
2026-05-25 14:22:02 +02:00
claude-noether 2732a022f8 ffmpeg-v4l2-request-fourier: route remaining H.264 qpel 8x8 positions through daedalus-fourier (0012)
Closes the H.264 qpel substitution.  Extends 0007 (which routed only
mc20 put_) to ALL 15 useful positions in BOTH the put_ and avg_
tables, skipping mc00 (integer copy / pointer-only fast path).

29 substitutions total: 14 new put_ + 15 avg_.  Each wraps a single
daedalus_recipe_dispatch_h264_qpel_{avg_,}mcXY call (the dispatches
landed in daedalus-fourier PRs #15-#20).  Collapsed via a single
DEFINE_QPEL_WRAPPER macro on the libavcodec shim side so the diff
is uniform.

All recipe-table entries route AUTO to CPU NEON — no QPU shaders
for any qpel position other than mc20 yet.  Plumbing-only
NEON-to-NEON via the daedalus recipe layer; bit-exact against
the in-tree ff_*_h264_qpel8_*_neon path (each daedalus dispatch is
already bit-exact-gated by the corresponding fourier PR's test).

16x16 qpel tables ([0][...]) stay on the in-tree NEON.  daedalus
only exposes 8x8 today; 16x16 substitution can land once fourier
provides those variants.

Verified the patch applies cleanly on top of 0001-0011 against the
pinned upstream commit b57fbbe5 on hertz.
2026-05-25 14:05:56 +02:00
claude-noether d8aa3aae8d ffmpeg-v4l2-request-fourier: route H.264 chroma DC Hadamard through daedalus-fourier (0011)
Substitutes H264DSPContext.chroma_dc_dequant_idct in the 4:2:0 /
bit_depth=8 init path with a wrapper that composes the daedalus
chroma DC Hadamard primitive (daedalus-fourier PR #25) with the
qmul scaling FFmpeg's reference does in one fused function
(h264idct_template.c::ff_h264_chroma_dc_dequant_idct).

Algorithm per H.264 §8.5.11.1 / §8.5.11.2:
  1. Extract 4 DCs from the scattered positions in the per-MB
     coefficient buffer (stride=32, xStride=16)
  2. 2x2 Hadamard transform (daedalus primitive)
  3. qmul scale + >> 7, write back to original positions

Bit-exact against ff_h264_chroma_dc_dequant_idct_8_c. The Hadamard
itself is gated by the fourier PR #23 7-case test suite (including
the H·H = 4·I algebraic invariant), and the public-API parity
test added in PR #25 confirms the src/ symbol matches the test ref.

4:2:2 chroma stays on the in-tree ff_h264_chroma422_dc_dequant_idct_c
path — same chroma_format_idc<=1 gating shape as 0009 chroma deblock.

Pin bump: _daedalus_fourier_commit / DAEDALUS_FOURIER_COMMIT bumped
to b9f9ff2a (post-PR #25) so the build picks up the public
daedalus_h264_chroma_dc_hadamard_2x2 symbol.

Verified the patch applies cleanly on top of 0001-0010 against the
pinned upstream commit b57fbbe5 on hertz.
2026-05-25 13:39:54 +02:00
claude-noether 45be17fbdf ffmpeg-v4l2-request-fourier: route H.264 luma intra deblock through daedalus-fourier (0010)
Adds the bS=4 intra-strength variants of the already-substituted
luma_v / luma_h deblock (0005, 0008).  Intra MBs and certain
inter-MB edges (4x4 transform boundaries inside an Intra_NxN
neighbour) force boundary strength to 4 per H.264 §8.7.2.1.

  H264DSPContext.v_loop_filter_luma_intra →
    daedalus_recipe_dispatch_h264_deblock_luma_v_intra
  H264DSPContext.h_loop_filter_luma_intra →
    daedalus_recipe_dispatch_h264_deblock_luma_h_intra

Both kernels landed in daedalus-fourier PR #11.  Recipe → CPU NEON
(no intra QPU shaders yet); plumbing-only NEON-to-NEON via daedalus.

Signature differs from bS<4: no tc0 argument.  Wrapper passes
daedalus_h264_deblock_meta with alpha/beta set; tc0[] is ignored by
the intra dispatch (bS=4 hardcodes the strength).

Chroma intra variants are deferred to a follow-up because the chroma
init has a 4:2:0 / 4:2:2 split (chroma_format_idc gating) — the
daedalus dispatch is 4:2:0-only and needs explicit conditional
substitution to avoid running on 4:2:2 chroma.

Verified the patch applies cleanly on top of 0001-0009 against the
pinned upstream commit b57fbbe5 on hertz.
2026-05-25 13:21:00 +02:00
claude-noether babb280410 ffmpeg-v4l2-request-fourier: route H.264 chroma v/h deblock through daedalus-fourier (0009)
Chroma siblings of 0005 (luma_v) and 0008 (luma_h).  Same
NEON-to-NEON pattern via the daedalus recipe layer:

  H264DSPContext.v_loop_filter_chroma →
    daedalus_recipe_dispatch_h264_deblock_chroma_v
  H264DSPContext.h_loop_filter_chroma →
    daedalus_recipe_dispatch_h264_deblock_chroma_h

Both kernels landed in daedalus-fourier PR #10.  Recipe table routes
AUTO to CPU NEON (no chroma QPU shaders yet), so this is plumbing-
only and stays bit-exact against the in-tree NEON.

Intra chroma (bS=4) loop filters remain on in-tree NEON;
daedalus_h264_deblock_meta covers the non-intra (bS<4) path.

Verified the patch applies cleanly on top of 0001-0008 against the
pinned upstream commit b57fbbe5 on hertz.  Wires the new patch into
both arch/PKGBUILD and debian/build-deb.sh.
2026-05-25 13:16:45 +02:00
claude-noether 624f83e877 ffmpeg-v4l2-request-fourier: route H.264 luma-h deblock through daedalus-fourier (0008)
Adds patch 0008 to the substitution arc, mirroring 0005's V variant
for H.264 non-intra bS<4 horizontal luma deblock.

  H264DSPContext.h_loop_filter_luma →
    daedalus_recipe_dispatch_h264_deblock_luma_h

The H kernel was added to daedalus-fourier in PR #9 (vendored
ff_h264_h_loop_filter_luma_neon, wired through the same CPU-dispatch
pattern as V).  Recipe table routes AUTO to CPU NEON (no QPU shader
for H yet), so this is a NEON-to-NEON substitution via the daedalus
recipe layer — same shape as 0005.

The libavcodec.so ctx remains no-QPU (daedalus_ctx_create_no_qpu),
matching the existing 0003/0004/0005/0007 patches.  Higher-cycle
QPU init waits for a feature-flag gating change in a separate PR.

Intra (bS=4) h_loop_filter_luma_intra stays on the in-tree NEON .S
code; daedalus_h264_deblock_meta covers the non-intra path only.
A follow-up can route intra once daedalus-fourier exposes the
intra-h dispatch (the kernel already exists internally per fourier
PR #11).

Wires the new patch into both arch/PKGBUILD and debian/build-deb.sh
sequences.  Verified the patch applies cleanly on top of 0001-0007
against the pinned upstream commit b57fbbe5 on hertz.
2026-05-25 13:10:05 +02:00
claude-noether 0bfc4ab03e ffmpeg-v4l2-request-fourier: substitute H.264 qpel mc20 → daedalus-fourier
H264QpelContext.put_h264_qpel_pixels_tab[1][2] (8x8 luma horizontal
half-pel, 6-tap "put" — the canonical representative of the H.264
luma motion-compensation family) now dispatches through
daedalus_recipe_dispatch_h264_qpel_mc20 instead of
ff_put_h264_qpel8_mc20_neon.

Cycle 9 of the daedalus-v4l2#11 step 2 substitution arc; closes the
4-cycle libavcodec.so substitution sequence:

  cycle 6 (PR #76)  H.264 IDCT 4x4         done
  cycle 7 (PR #85)  H.264 IDCT 8x8         done
  cycle 8 (PR #86)  H.264 luma-v deblock   done
  cycle 9 (this)    H.264 qpel mc20

Bumps daedalus-fourier pin d87239d → 209a421 (PR #2 — public API
gains daedalus_recipe_dispatch_h264_qpel_mc20 +
DAEDALUS_KERNEL_H264_QPEL_MC20).

Verdict per docs/k9_h264qpel_mc20.md: CPU NEON.  Per-block 7.6 ns at
131 Mblock/s gives 135× margin over 30 fps 1080p; QPU dispatch floor
at ~250 ns makes any V3D shader strictly worse.  Substitution is
plumbing-only — same daedalus_ctx_create_no_qpu pthread_once shape
the cycles 6/7/8 shims already own (kept SEPARATE from the H264DSP
shim's ctx because H264QPEL is its own libavcodec Makefile module
and link order does not guarantee a single .o owns the ctx symbol;
one extra ~µs init per process, paid lazily on first MC call).

Other H.264 luma MC variants (mc02, mc11, mc22 etc.) and the 16x16
size tier stay on the in-tree NEON .S code per the cycle-9 phase-1
rationale (mc20 8x8 is representative; remaining variants would
multiply recipe-lookup overhead without changing the substrate
verdict).

Bit-exact against ff_put_h264_qpel8_mc20_neon (daedalus-fourier
cycle 9 green; 10000/10000 random blocks bit-exact, M3 = 131 Mblock/s).

No SONAME change, no Depends change.  PKGREL 9 → 10.

Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.
2026-05-23 03:32:29 +02:00
marfrit 5c69460722 ffmpeg-v4l2-request-fourier: restore AV_CODEC_FLAG_LOW_DELAY in H.264 decoder
FFmpeg 8.x dropped the H.264 decoder's low_delay code path —
AV_CODEC_FLAG_LOW_DELAY no longer prevents h264_select_output_frame
from running the display-order DPB output queue.  The daedalus-v4l2
daemon's `ctx->flags |= AV_CODEC_FLAG_LOW_DELAY` at
daemon/src/decoder.c:202 has been a silent no-op since the SONAME
61→62 jump landed in reauktion/daedalus-v4l2 PR #16; on Firefox
YouTube this re-introduced the 2-1-4-3 B-frame pair-swap that PR
#12's daemon flag was supposed to prevent.

Fix lives in libavcodec, not the daemon: restore the documented
LOW_DELAY semantics so the daemon (and any other V4L2-stateless-
style consumer) keeps the one-frame-per-send_packet decode-order
output contract it already declares.

## Patch

0006-h264-restore-low-delay.patch touches libavcodec/h264_slice.c:

- h264_select_output_frame: early-exit when LOW_DELAY is set.
  Emit the just-decoded picture as next_output_pic, mirror the
  corruption / recovery-point tracking the main path performs,
  skip delayed_pic[] / POC reorder machinery entirely.

- h264_field_start: suppress the SPS-driven
  `has_b_frames = sps->num_reorder_frames` clobber when LOW_DELAY
  is set.  Without this the per-slice bitstream_restriction_flag
  re-pickup would reintroduce a nonzero reorder buffer mid-stream
  even after the daemon set has_b_frames=0 at avcodec_open2.

## Why not daemon-side

A daemon SPS-rewrite (`num_reorder_frames=0`) was considered but
rejected: it works only for the daemon's reconstructed SPS NAL,
not for any in-band SPS the daemon dlopens libavformat to parse
in other code paths.  Restoring documented FFmpeg flag semantics
is the smaller, more durable change and keeps the daemon
interface stable.

## Packaging

- PKGREL/pkgrel bump to 9.
- No new build-deps, no Depends change.
- Substitution arc cycles 6/7/8 unchanged.

## Refs

- reauktion/daedalus-v4l2#11 / #12 (LOW_DELAY half-measure on
  daemon side, originally landed against FFmpeg 7.x).
- daemon/src/decoder.c:202 (`ctx->flags |= AV_CODEC_FLAG_LOW_DELAY`
  for H.264 only — unchanged, but now actually has effect again).
2026-05-22 14:20:37 +02:00
marfrit 29e0852d11 ffmpeg-v4l2-request-fourier: substitute H.264 luma-v deblock → daedalus-fourier
Cycle 8 of the libavcodec.so substitution arc (reauktion/daedalus-v4l2#11
step 2).  H264DSPContext.v_loop_filter_luma — non-intra bS<4 vertical
luma deblock, called per macroblock-row edge from the slice deblock
loop in libavcodec/h264_loopfilter.c — now dispatches through
daedalus_recipe_dispatch_h264_deblock_luma_v instead of
ff_h264_v_loop_filter_luma_neon.

## What

- Add 0005-h264-deblock-luma-v-daedalus-fourier.patch (in both arch/
  and debian/ ffmpeg-v4l2-request-fourier/).  Extends
  libavcodec/aarch64/h264_idct_daedalus.c with
  ff_h264_v_loop_filter_luma_daedalus (constructs a
  daedalus_h264_deblock_meta from FFmpeg's (alpha, beta, tc0[4]) and
  calls daedalus_recipe_dispatch_h264_deblock_luma_v with n_edges=1).
  Patches libavcodec/aarch64/h264dsp_init_aarch64.c to wire
  c->v_loop_filter_luma to the new shim.
- arch/PKGBUILD + debian/build-deb.sh: append patch + bump pkgrel/PKGREL
  to 8.
- No new build-deps, no Depends change, no daedalus-fourier rev — the
  d87239d pin already exposes daedalus_recipe_dispatch_h264_deblock_luma_v.

## Why

Cycle 8 is marked "CPU primary; QPU opportunistic" in the daedalus-
fourier API docstring.  Per the hybrid substrate philosophy
("if there's a coprocessor, use it") we eventually want the QPU
opportunism active here.  But the libavcodec.so context is
process-global and shared with cycles 6/7 via pthread_once, and it
uses daedalus_ctx_create_no_qpu deliberately to avoid implicit
Vulkan init in arbitrary host processes (Firefox content, mpv-fourier,
ffmpeg-fourier CLI, ...).  Switching to daedalus_ctx_create here
without a feature flag would be a footgun.

So cycle 8 lands as plumbing-only NEON-by-recipe substitution for
now; opportunistic QPU enablement is a separate follow-up that adds
a DAEDALUS_FOURIER_ENABLE_QPU env var or equivalent.

## Scope NOT covered

- Intra (bS=4) loop filter c->v_loop_filter_luma_intra — daedalus's
  daedalus_h264_deblock_meta only covers the non-intra path.
- Horizontal-edge variant c->h_loop_filter_luma — separate kernel
  (not yet in daedalus-fourier API).
- Chroma loop filters — separate kernels.
- Bulk batching — single-edge dispatch wastes the kernel's n_edges>1
  amortization.  Same caveat as cycles 6/7; follow-up.
- QPU opportunism — see "Why" above.

## SONAME

Unchanged.  libavcodec.so.62 / libavformat.so.62 / libavutil.so.60.

## Refs

- reauktion/daedalus-v4l2 issue #11: reauktion/daedalus-v4l2#11
- marfrit-packages PR #76 (cycle 6 IDCT 4×4)
- marfrit-packages PR #85 (cycle 7 IDCT 8×8)
- marfrit/daedalus-fourier cycle 8 close (deblock luma-v NEON green)
2026-05-22 12:17:14 +02:00
marfrit 493c762967 ffmpeg-v4l2-request-fourier: substitute H.264 IDCT 8×8 → daedalus-fourier
Cycle 7 of the libavcodec.so substitution arc (reauktion/daedalus-v4l2#11
step 2).  H264DSPContext.idct8_add — called per 8×8 block from the
High-profile intra-8×8-DCT decode path in libavcodec/h264_mb.c — now
dispatches through daedalus_recipe_dispatch_h264_idct8 instead of
ff_h264_idct8_add_neon.

## What

- Add 0004-h264-idct8-daedalus-fourier.patch (in both arch/ and debian/
  ffmpeg-v4l2-request-fourier/).  Extends libavcodec/aarch64/
  h264_idct_daedalus.c (introduced by 0003) with ff_h264_idct8_add_daedalus
  and a daedalus_recipe_dispatch_h264_idct8 call; patches
  libavcodec/aarch64/h264dsp_init_aarch64.c to wire c->idct8_add to
  the new shim.
- arch/PKGBUILD + debian/build-deb.sh: append the new patch to the
  apply list; bump pkgrel/PKGREL to 7.
- No new build-deps, no Depends change, no daedalus-fourier rev — the
  d87239d pin already exposes daedalus_recipe_dispatch_h264_idct8.

## Why

The recipe layer picks the substrate; for cycle 7 (H.264 IDCT 8×8)
the recipe is CPU NEON, so this is effectively a NEON-to-NEON
substitution layered on top of cycle 6.  Production validation of
cycle 6 on higgs Firefox YouTube: 3040 frames decoded cleanly,
avg_decode_us=3388 (no regression vs the pre-substitution ~4 ms
baseline).  Cycle 7 inherits the same shim's pthread_once context.

Bit-exact against ff_h264_idct8_add_neon (daedalus-fourier cycle 7
green; FFmpeg 8×8 block storage block[r + 8*c] matches daedalus
column-major convention).

## Scope NOT covered (deferred)

- Bulk c->idct8_add4 (inter 8×8-DCT macroblocks) stays on the
  in-tree NEON .S code; batched substitution with n_blocks>1 lands
  later alongside the cycle-6 bulk-paths work.
- High-bit-depth (10-bit) path untouched.
- Cycles 8/9 — separate PRs.

## SONAME

Unchanged.  libavcodec.so.62 / libavformat.so.62 / libavutil.so.60.

## Refs

- reauktion/daedalus-v4l2 issue #11 (substitution arc): reauktion/daedalus-v4l2#11
- marfrit-packages PR #76 (cycle 6 IDCT 4×4)
- marfrit-packages PR #78 (libxml2 ABI-skew workaround)
- marfrit/daedalus-fourier cycle 7 close (H.264 IDCT 8×8 NEON green)
2026-05-22 10:20:27 +02:00
marfrit e641d679d3 ffmpeg-v4l2-request-fourier: substitute H.264 IDCT 4×4 → daedalus-fourier
First cycle of the libavcodec.so substitution arc (reauktion/daedalus-v4l2#11
step 2).  H264DSPContext.idct_add — called per 4×4 block from the
intra-4×4 decode path in libavcodec/h264_mb.c — now dispatches through
daedalus_recipe_dispatch_h264_idct4 instead of ff_h264_idct_add_neon.

## What

- Add 0003-h264-idct4-daedalus-fourier.patch (in both arch/ and
  debian/ ffmpeg-v4l2-request-fourier/).  Creates
  libavcodec/aarch64/h264_idct_daedalus.c (ff_h264_idct_add_daedalus
  shim + lazy pthread_once context init via
  daedalus_ctx_create_no_qpu), patches
  libavcodec/aarch64/h264dsp_init_aarch64.c to wire c->idct_add to
  the shim, adds the new .o to libavcodec/aarch64/Makefile.
- arch/PKGBUILD + debian/build-deb.sh: fetch + build
  daedalus-fourier (pinned at d87239d — lockstep with the
  daedalus-v4l2 daemon's inline build) with
  -DCMAKE_POSITION_INDEPENDENT_CODE=ON into a per-build temp prefix,
  then pass --extra-cflags=-I.../include --extra-ldflags=-L.../lib
  --extra-libs="-ldaedalus_core -lvulkan -lpthread" to FFmpeg
  configure.  daedalus_core.a is static-linked into libavcodec.so.62.
- debian/control Depends gains libvulkan1 (daedalus_core PUBLIC-links
  Vulkan::Vulkan for the queryable QPU substrate; the no-QPU
  constructor still works at runtime but the loader needs
  libvulkan.so.1 present to dlopen libavcodec.so.62).
- arch/PKGBUILD depends gains vulkan-icd-loader, makedepends gains
  cmake / ninja / vulkan-headers.

## Why

The recipe layer picks the substrate; for cycle 6 (H.264 IDCT 4×4)
the recipe is CPU NEON, so this is effectively a NEON-to-NEON
substitution with one extra dispatch call and recipe-table lookup.
The point of this first cycle isn't perf wins — it's plumbing.  Once
the path is wired and stable, follow-up patches batch through the
bulk paths (idct_add16 / idct_add16intra / idct_add8) and stack
cycles 7/8/9 (IDCT 8×8, luma-v deblock, qpel mc20).

Bit-exact against ff_h264_idct_add_neon (daedalus-fourier cycle 6
green; FFmpeg's 4×4 block storage matches daedalus's column-major
convention).

## Scope NOT covered

- Bulk paths (idct_add16 / idct_add16intra / idct_add8) — most IDCT
  4×4 calls in real H.264 streams go through these, not the per-
  block c->idct_add path; intra-4×4-only macroblocks are a minority.
  Batched substitution lands in a follow-up.
- High-bit-depth (10-bit) path — not touched; 8-bit only.
- Cycles 7/8/9 — separate PRs.

## SONAME

Unchanged.  libavcodec.so.62 / libavformat.so.62 / libavutil.so.60.
No daedalus-v4l2-dkms or daedalus-v4l2 bump required.

## Refs

- reauktion/daedalus-v4l2 issue #11 (substitution arc): reauktion/daedalus-v4l2#11
- marfrit/daedalus-fourier cycle 6 close (H.264 IDCT 4×4 NEON green)
2026-05-21 21:44:35 +02:00
test0r 9e9447502e ffmpeg-v4l2-request-fourier: patch NV15→P010 unpack for Hi10P / Main10
The n8.1 pin's hwcontext_v4l2request.c deliberately blanks the
transfer-formats list for AV_PIX_FMT_YUV420P10 sw_format (the mapping
target for V4L2_PIX_FMT_NV15), so `ffmpeg -hwaccel v4l2request
-vf hwdownload,format=p010le` on a Hi10P / Main10 input failed at
filter-init with -22 EINVAL — even though kernel-side decode succeeded.

0002-nv15-to-p010-unpack.patch adds an inline NV15→P010 unpack
(5 bytes per 4 samples, little-endian → high-10-of-16) inside
v4l2request_transfer_data_from, exposes AV_PIX_FMT_P010 in
transfer_get_formats for that sw_format, and rejects non-P010
destinations explicitly with ENOSYS instead of silently corrupting
output via av_frame_copy on NV15-packed bytes.

Verified on fresnel (RK3399, linux-fresnel-fourier 7.0-14):
- 5-frame smoke test from issue #21 → exit 0, 13.8MB output
- 20-frame mid-fixture decode → bit-exact HW==SW
  sha256 7d9b66d48d8f17b2281da1881c663ecc31722bb218aba1ae23bf28d07aa66b08
- 8-bit baseline (bbb_60s_720p.h264.mp4) still bit-exact HW==SW (no
  regression in the existing NV12 path)
- Cross-device repro of original EINVAL on unpatched ampere (RK3588)
  pkgrel=4, confirming the bug is upstream-FFmpeg-side, not RK3399-specific

Patch is upstream-able to Kwiboo's v4l2-request-n8.1 branch.

Closes #21.
2026-05-18 08:35:19 +00:00
claude-noether d6c4260eb8 ffmpeg-v4l2-request-git → ffmpeg-v4l2-request-fourier: rename directory
PKGBUILD already renamed itself (pkgname=ffmpeg-v4l2-request-fourier,
replaces=(ffmpeg ffmpeg-v4l2-request-git)) but the containing
directory was never moved. This commit completes the rename to align
the path with the package identity and the rest of the -fourier
umbrella (libva, mpv, kwin, qt6-base, chromium, linux-*).

CI workflow path-trigger is wildcard (arch/**), unaffected. Step
names + cp source path updated to the new directory.
2026-05-17 01:04:37 +02:00