PARTIAL PASS — full I-frame pipeline (IDCT + deblock) running on real
H.264 streams via daedalus-decoder's frame-major dispatch. Residual
divergence vs libavcodec reference: 0.09% to 0.86% Y / 0.35% to 2.0%
UV depending on substrate + resolution. Kernel-level off-by-one
issues remain; structurally same family as task #179.
Architecture (verified against `dejavu` memory before coding)
-------------------------------------------------------------
- NO new libavcodec patches. Uses existing 0016 + 0017 callback
infrastructure.
- daedalus-decoder is the consumer-side frame-major dispatch path;
libavcodec runs to produce the post-deblock reference. Daedalus
is NOT substituted into libavcodec's deblock path.
- Edge derivation is a one-time spec implementation in the CLI, not
a per-block function-pointer hijack.
Different shape from the banned per-kernel substitution arc. Hard
re-check vs the magic word memory before any tool call (per the user's
explicit instruction "make sure no dejavu").
What changed in the CLI
-----------------------
1. avctx->skip_loop_filter dropped — libavcodec's deblock now runs
and AVFrame is post-deblock (the new reference).
2. Per-MB callback captures pre-deblock pixels for all 3 planes
(Y/Cb/Cr) at MB(N)'s own callback time — pure pre-deblock for
MB(N) regardless of incremental deblock timing for neighbours
(filter_mb runs AFTER hl_decode_mb returns, so callback sees
fresh-decoded fresh-pre-deblock pixels).
3. Per-MB callback also captures qp_y, mb_type_intra,
transform_8x8. Slice-level: slice_alpha_c0_offset,
slice_beta_offset, slice_deblocking_filter.
4. Transcribed H.264 §8.7.2 alpha_table[156], beta_table[156],
tc0_table[156][4] from FFmpeg's h264_loopfilter.c (LGPL-2.1+
transcription; algorithm/values normative-spec, unpatented).
5. Transcribed §8.5.11 / Table 8-11 chroma_qp_table[52] for
qP_Y → qP_C conversion (chroma_qp_index_offset assumed 0,
which matches x264 default).
6. Main loop: for each MB, build daedalus_decoder_mb_input.edges
from spec rules. 16 edges/MB (4 V-luma + 4 H-luma + 2 V-Cb +
2 V-Cr + 2 H-Cb + 2 H-Cr). bS=4 at MB boundary, bS=3 internal,
bS=0 at frame boundary. 8x8 DCT MBs skip internal edges at
col/row 4 and 12 (only the 8x8-block boundary fires).
7. Daedalus's flush_frame runs IDCT-add for real-coeffs MBs +
identity passthrough for skipped MBs, THEN dispatches the 4
deblock kernels (luma V/H + chroma V/H, plus their bS=4 intra
variants) across the frame.
8. Compare daedalus output to AVFrame (post-deblock).
Subtle bug hunted: sl->deblocking_filter convention inversion
-------------------------------------------------------------
FFmpeg's h264_slice.c line 1901 does `sl->deblocking_filter ^= 1`
to invert the spec's `disable_deblocking_filter_idc` semantics.
Internal convention:
- 0 = DISABLED (was 1 in spec)
- 1 = ENABLED (was 0 in spec)
- 2 = enabled-but-not-across-slice-boundaries (unchanged)
Initial implementation treated `== 1` as "disabled" per spec
semantics, which silently skipped all edge emission (deblock_off=1)
and gave the same diff count as the no-edges baseline. Inverted
to `deblock_off = (sl->deblocking_filter == 0)`; edges then flowed
and divergence dropped 5346→438 Y diffs (92% reduction) per frame.
Results on hertz (Pi 5 V3D 7.1)
-------------------------------
testsrc2 I-only via libx264 -bf 0 -g 1:
320×240, 5 frames, substrate=cpu:
Y diff 2009/384000 (0.52%), UV diff 3876/192000 (2.02%)
320×240, 5 frames, substrate=qpu:
Y diff 3288/384000 (0.86%), UV diff 3577/192000 (1.86%)
1920×1088, 3 frames, substrate=auto:
Y diff 5810/6266880 (0.09%), UV diff 10921/3133440 (0.35%)
The 1080p rate is lower than QVGA's — content has fewer edges relative
to total pixels at higher resolution.
Residual divergence — root cause analysis
-----------------------------------------
- CPU substrate uses ff_h264_*_loop_filter_*_neon (same kernel
libavcodec uses). Same kernel + same alpha/beta/tc0/bS → output
SHOULD be identical. But still 0.52% Y diff.
- Likely cause: edge dispatch ORDER mismatch. libavcodec serialises
per-MB (filter MB(N)'s edges, then MB(N+1)'s). Daedalus batches
frame-wide (all V luma across the frame, then all H luma, etc.).
For overlapping-pixel zones (e.g., MB(N)'s col 12 internal edge
+ MB(N+1)'s col 0 boundary edge both touch cols 13-15), the
order affects the final pixel.
- QPU substrate has slightly higher divergence (0.86% Y) — additional
kernel-level off-by-one between daedalus's V3D shader and the NEON
reference, in the same family as task #179's chroma divergence.
These are kernel-level / dispatch-order issues, not CLI bugs. Task #179
extended in scope (now includes luma + cross-MB edge ordering); root
cause investigation belongs in daedalus-fourier.
PR-A6 verifies the INFRASTRUCTURE: real coefficients flow through, real
edges are derived per spec, daedalus runs IDCT + deblock in one frame-
major dispatch, output is within ~1% of libavcodec reference on real
H.264 streams. Full byte-exact closure depends on the daedalus-fourier
deblock kernel/dispatch investigation.
Followups
---------
- Extend task #179 to cover luma edges + cross-MB edge ordering on
real-stream layouts.
- PR-A4: Intra_16x16 + chroma DC Hadamard. Would also help the UV
diff rate since currently chroma is identity-passthrough (no real
chroma residual coefficients flowing through daedalus).
- Q3 deferred: daemon refactor in daedalus-v4l2.