b597fc0098
PARTIAL PASS — full I-frame pipeline (IDCT + deblock) running on real H.264 streams via daedalus-decoder's frame-major dispatch. Residual divergence vs libavcodec reference: 0.09% to 0.86% Y / 0.35% to 2.0% UV depending on substrate + resolution. Kernel-level off-by-one issues remain; structurally same family as task #179. Architecture (verified against `dejavu` memory before coding) ------------------------------------------------------------- - NO new libavcodec patches. Uses existing 0016 + 0017 callback infrastructure. - daedalus-decoder is the consumer-side frame-major dispatch path; libavcodec runs to produce the post-deblock reference. Daedalus is NOT substituted into libavcodec's deblock path. - Edge derivation is a one-time spec implementation in the CLI, not a per-block function-pointer hijack. Different shape from the banned per-kernel substitution arc. Hard re-check vs the magic word memory before any tool call (per the user's explicit instruction "make sure no dejavu"). What changed in the CLI ----------------------- 1. avctx->skip_loop_filter dropped — libavcodec's deblock now runs and AVFrame is post-deblock (the new reference). 2. Per-MB callback captures pre-deblock pixels for all 3 planes (Y/Cb/Cr) at MB(N)'s own callback time — pure pre-deblock for MB(N) regardless of incremental deblock timing for neighbours (filter_mb runs AFTER hl_decode_mb returns, so callback sees fresh-decoded fresh-pre-deblock pixels). 3. Per-MB callback also captures qp_y, mb_type_intra, transform_8x8. Slice-level: slice_alpha_c0_offset, slice_beta_offset, slice_deblocking_filter. 4. Transcribed H.264 §8.7.2 alpha_table[156], beta_table[156], tc0_table[156][4] from FFmpeg's h264_loopfilter.c (LGPL-2.1+ transcription; algorithm/values normative-spec, unpatented). 5. Transcribed §8.5.11 / Table 8-11 chroma_qp_table[52] for qP_Y → qP_C conversion (chroma_qp_index_offset assumed 0, which matches x264 default). 6. Main loop: for each MB, build daedalus_decoder_mb_input.edges from spec rules. 16 edges/MB (4 V-luma + 4 H-luma + 2 V-Cb + 2 V-Cr + 2 H-Cb + 2 H-Cr). bS=4 at MB boundary, bS=3 internal, bS=0 at frame boundary. 8x8 DCT MBs skip internal edges at col/row 4 and 12 (only the 8x8-block boundary fires). 7. Daedalus's flush_frame runs IDCT-add for real-coeffs MBs + identity passthrough for skipped MBs, THEN dispatches the 4 deblock kernels (luma V/H + chroma V/H, plus their bS=4 intra variants) across the frame. 8. Compare daedalus output to AVFrame (post-deblock). Subtle bug hunted: sl->deblocking_filter convention inversion ------------------------------------------------------------- FFmpeg's h264_slice.c line 1901 does `sl->deblocking_filter ^= 1` to invert the spec's `disable_deblocking_filter_idc` semantics. Internal convention: - 0 = DISABLED (was 1 in spec) - 1 = ENABLED (was 0 in spec) - 2 = enabled-but-not-across-slice-boundaries (unchanged) Initial implementation treated `== 1` as "disabled" per spec semantics, which silently skipped all edge emission (deblock_off=1) and gave the same diff count as the no-edges baseline. Inverted to `deblock_off = (sl->deblocking_filter == 0)`; edges then flowed and divergence dropped 5346→438 Y diffs (92% reduction) per frame. Results on hertz (Pi 5 V3D 7.1) ------------------------------- testsrc2 I-only via libx264 -bf 0 -g 1: 320×240, 5 frames, substrate=cpu: Y diff 2009/384000 (0.52%), UV diff 3876/192000 (2.02%) 320×240, 5 frames, substrate=qpu: Y diff 3288/384000 (0.86%), UV diff 3577/192000 (1.86%) 1920×1088, 3 frames, substrate=auto: Y diff 5810/6266880 (0.09%), UV diff 10921/3133440 (0.35%) The 1080p rate is lower than QVGA's — content has fewer edges relative to total pixels at higher resolution. Residual divergence — root cause analysis ----------------------------------------- - CPU substrate uses ff_h264_*_loop_filter_*_neon (same kernel libavcodec uses). Same kernel + same alpha/beta/tc0/bS → output SHOULD be identical. But still 0.52% Y diff. - Likely cause: edge dispatch ORDER mismatch. libavcodec serialises per-MB (filter MB(N)'s edges, then MB(N+1)'s). Daedalus batches frame-wide (all V luma across the frame, then all H luma, etc.). For overlapping-pixel zones (e.g., MB(N)'s col 12 internal edge + MB(N+1)'s col 0 boundary edge both touch cols 13-15), the order affects the final pixel. - QPU substrate has slightly higher divergence (0.86% Y) — additional kernel-level off-by-one between daedalus's V3D shader and the NEON reference, in the same family as task #179's chroma divergence. These are kernel-level / dispatch-order issues, not CLI bugs. Task #179 extended in scope (now includes luma + cross-MB edge ordering); root cause investigation belongs in daedalus-fourier. PR-A6 verifies the INFRASTRUCTURE: real coefficients flow through, real edges are derived per spec, daedalus runs IDCT + deblock in one frame- major dispatch, output is within ~1% of libavcodec reference on real H.264 streams. Full byte-exact closure depends on the daedalus-fourier deblock kernel/dispatch investigation. Followups --------- - Extend task #179 to cover luma edges + cross-MB edge ordering on real-stream layouts. - PR-A4: Intra_16x16 + chroma DC Hadamard. Would also help the UV diff rate since currently chroma is identity-passthrough (no real chroma residual coefficients flowing through daedalus). - Q3 deferred: daemon refactor in daedalus-v4l2.