PR-A6: enable libavcodec deblock + drive daedalus deblock on real streams #16

2026-05-26T09:54:00Z

marfrit commented

2026-05-26 09:54:00 +00:00

PARTIAL PASS — full I-frame pipeline (IDCT + deblock) now runs on real H.264 streams via daedalus-decoder's frame-major dispatch. Residual divergence vs libavcodec reference: 0.09%–0.86% Y / 0.35%–2.0% UV depending on substrate + resolution. Kernel-level / dispatch-order off-by-one issues remain — same family as task #179.

Architecture (verified vs `dejavu` memory before coding)

NO new libavcodec patches. Uses existing 0016 + 0017 callback infrastructure.
daedalus-decoder is the consumer-side frame-major dispatch path; libavcodec runs to produce the post-deblock reference. Daedalus is not substituted into libavcodec's deblock path.
Edge derivation is a one-time spec implementation in the CLI, not a per-block function-pointer hijack.

Different shape from the banned per-kernel substitution arc. Re-checked the dejavu memory and frame_major_uma_verdict memory before any tool call (per user's explicit instruction).

What changed in the CLI

Dropped avctx->skip_loop_filter = AVDISCARD_ALL — libavcodec's deblock now runs; AVFrame is post-deblock.
Per-MB callback captures pre-deblock pixels for all 3 planes at MB(N)'s own callback time (pure pre-deblock for MB(N)).
Per-MB callback also captures qp_y, mb_type_intra, transform_8x8. Slice-level: slice_alpha_c0_offset, slice_beta_offset, slice_deblocking_filter.
Transcribed §8.7.2 alpha_table[156], beta_table[156], tc0_table[156][4] from FFmpeg's h264_loopfilter.c.
Transcribed §8.5.11 / Table 8-11 chroma_qp_table[52].
Main loop: per MB, build daedalus_decoder_mb_input.edges from spec rules. 16 edges/MB (4 V-luma + 4 H-luma + 2 V-Cb + 2 V-Cr + 2 H-Cb + 2 H-Cr). bS=4 at MB boundary, bS=3 internal, bS=0 at frame boundary. 8×8 DCT MBs skip cols/rows 4 and 12 internal edges (only the 8×8-block boundary fires).
Daedalus's flush_frame runs IDCT-add for real-coeffs MBs + identity passthrough for skipped MBs, then dispatches the 4 deblock kernels (luma V/H + chroma V/H, plus their bS=4 intra variants) across the frame.

Subtle bug hunted: `sl->deblocking_filter` convention inversion

FFmpeg's h264_slice.c line 1901: sl->deblocking_filter ^= 1. Inverts the spec's disable_deblocking_filter_idc semantics. Internal:

0 = DISABLED (spec=1)
1 = ENABLED (spec=0)
2 = enabled-but-not-across-slice-boundaries (unchanged)

First implementation treated == 1 as "disabled" per spec semantics → silently skipped all edge emission → diff count identical to no-edges baseline. Inverting to deblock_off = (sl->deblocking_filter == 0) dropped diffs 5346 → 438 Y per frame (92% reduction).

Results on hertz (Pi 5 V3D 7.1)

testsrc2 I-only via libx264 -bf 0 -g 1:

dims	frames	substrate	Y diff	UV diff
320×240	5	cpu	2009/384000 (0.52%)	3876/192000 (2.02%)
320×240	5	qpu	3288/384000 (0.86%)	3577/192000 (1.86%)
320×240	5	auto	3288/384000 (0.86%)	3577/192000 (1.86%)
1920×1088	3	auto	5810/6266880 (0.09%)	10921/3133440 (0.35%)

Residual divergence — root cause analysis

CPU substrate uses ff_h264_*_loop_filter_*_neon (same kernel libavcodec uses). Same kernel + same alpha/beta/tc0/bS → output SHOULD be identical. But still 0.52% Y diff.
Most likely cause: edge dispatch order mismatch. libavcodec serialises per-MB (filter MB(N)'s edges, then MB(N+1)'s). Daedalus batches frame-wide (all V luma across the frame, then all H luma, etc.). For overlapping-pixel zones — e.g., MB(N)'s col-12 internal edge and MB(N+1)'s col-0 boundary edge both touch cols 13–15 — the order affects the final pixel.
QPU substrate has slightly higher divergence (0.86% Y) — additional kernel-level off-by-one between daedalus's V3D shader and the NEON reference. Same family as task #179.

These are kernel-level / dispatch-order issues, not CLI bugs. Task #179 extended in scope (now includes luma + cross-MB edge ordering on real-stream layouts); root cause investigation belongs in daedalus-fourier.

Honest framing

This is a partial-pass delivery. The infrastructure works (real coefficients + real edges flowing through daedalus-decoder's frame-major dispatch on real H.264 streams) and output is within ~1% of reference. Full byte-exact closure depends on the daedalus-fourier deblock kernel / dispatch-order investigation.

Per the user's "correctness before speed" principle, this is clearly called out as partial — not pretended to be byte-exact.

Followups

Extend task #179 to cover luma edges + cross-MB edge ordering on real-stream layouts.
PR-A4: Intra_16x16 + chroma DC Hadamard. Would also lift the UV diff rate (currently chroma is identity-passthrough = no real chroma residual coefficients flowing through daedalus).

**PARTIAL PASS** — full I-frame pipeline (IDCT + deblock) now runs on real H.264 streams via daedalus-decoder's frame-major dispatch. Residual divergence vs libavcodec reference: **0.09%–0.86% Y / 0.35%–2.0% UV** depending on substrate + resolution. Kernel-level / dispatch-order off-by-one issues remain — same family as task #179. ## Architecture (verified vs `dejavu` memory before coding) - **NO new libavcodec patches.** Uses existing 0016 + 0017 callback infrastructure. - daedalus-decoder is the consumer-side frame-major dispatch path; libavcodec runs to produce the post-deblock reference. Daedalus is **not** substituted into libavcodec's deblock path. - Edge derivation is a one-time spec implementation in the CLI, not a per-block function-pointer hijack. Different shape from the banned per-kernel substitution arc. Re-checked the `dejavu` memory and `frame_major_uma_verdict` memory before any tool call (per user's explicit instruction). ## What changed in the CLI 1. Dropped `avctx->skip_loop_filter = AVDISCARD_ALL` — libavcodec's deblock now runs; AVFrame is post-deblock. 2. Per-MB callback captures pre-deblock pixels for all 3 planes at MB(N)'s own callback time (pure pre-deblock for MB(N)). 3. Per-MB callback also captures `qp_y`, `mb_type_intra`, `transform_8x8`. Slice-level: `slice_alpha_c0_offset`, `slice_beta_offset`, `slice_deblocking_filter`. 4. Transcribed §8.7.2 `alpha_table[156]`, `beta_table[156]`, `tc0_table[156][4]` from FFmpeg's `h264_loopfilter.c`. 5. Transcribed §8.5.11 / Table 8-11 `chroma_qp_table[52]`. 6. Main loop: per MB, build `daedalus_decoder_mb_input.edges` from spec rules. 16 edges/MB (4 V-luma + 4 H-luma + 2 V-Cb + 2 V-Cr + 2 H-Cb + 2 H-Cr). bS=4 at MB boundary, bS=3 internal, bS=0 at frame boundary. 8×8 DCT MBs skip cols/rows 4 and 12 internal edges (only the 8×8-block boundary fires). 7. Daedalus's `flush_frame` runs IDCT-add for real-coeffs MBs + identity passthrough for skipped MBs, **then** dispatches the 4 deblock kernels (luma V/H + chroma V/H, plus their bS=4 intra variants) across the frame. ## Subtle bug hunted: `sl->deblocking_filter` convention inversion FFmpeg's `h264_slice.c` line 1901: `sl->deblocking_filter ^= 1`. Inverts the spec's `disable_deblocking_filter_idc` semantics. Internal: - `0` = DISABLED (spec=1) - `1` = ENABLED (spec=0) - `2` = enabled-but-not-across-slice-boundaries (unchanged) First implementation treated `== 1` as "disabled" per spec semantics → silently skipped all edge emission → diff count identical to no-edges baseline. Inverting to `deblock_off = (sl->deblocking_filter == 0)` dropped diffs **5346 → 438 Y per frame (92% reduction)**. ## Results on hertz (Pi 5 V3D 7.1) testsrc2 I-only via `libx264 -bf 0 -g 1`: | dims | frames | substrate | Y diff | UV diff | |---|---|---|---|---| | 320×240 | 5 | cpu | 2009/384000 (0.52%) | 3876/192000 (2.02%) | | 320×240 | 5 | qpu | 3288/384000 (0.86%) | 3577/192000 (1.86%) | | 320×240 | 5 | auto | 3288/384000 (0.86%) | 3577/192000 (1.86%) | | 1920×1088 | 3 | auto | 5810/6266880 (0.09%) | 10921/3133440 (0.35%) | ## Residual divergence — root cause analysis - **CPU substrate** uses `ff_h264_*_loop_filter_*_neon` (same kernel libavcodec uses). Same kernel + same `alpha`/`beta`/`tc0`/`bS` → output SHOULD be identical. But still 0.52% Y diff. - **Most likely cause: edge dispatch order mismatch.** libavcodec serialises per-MB (filter MB(N)'s edges, then MB(N+1)'s). Daedalus batches frame-wide (all V luma across the frame, then all H luma, etc.). For overlapping-pixel zones — e.g., MB(N)'s col-12 internal edge and MB(N+1)'s col-0 boundary edge both touch cols 13–15 — the order affects the final pixel. - **QPU substrate** has slightly higher divergence (0.86% Y) — additional kernel-level off-by-one between daedalus's V3D shader and the NEON reference. Same family as task #179. These are kernel-level / dispatch-order issues, not CLI bugs. Task #179 extended in scope (now includes luma + cross-MB edge ordering on real-stream layouts); root cause investigation belongs in daedalus-fourier. ## Honest framing This is a **partial-pass** delivery. The infrastructure works (real coefficients + real edges flowing through daedalus-decoder's frame-major dispatch on real H.264 streams) and output is within ~1% of reference. Full byte-exact closure depends on the daedalus-fourier deblock kernel / dispatch-order investigation. Per the user's "correctness before speed" principle, this is clearly called out as partial — not pretended to be byte-exact. ## Followups - Extend task #179 to cover luma edges + cross-MB edge ordering on real-stream layouts. - PR-A4: Intra_16x16 + chroma DC Hadamard. Would also lift the UV diff rate (currently chroma is identity-passthrough = no real chroma residual coefficients flowing through daedalus).

marfrit added 1 commit 2026-05-26 09:54:00 +00:00

PR-A6: enable libavcodec deblock + drive daedalus deblock on real streams b597fc0098

PARTIAL PASS — full I-frame pipeline (IDCT + deblock) running on real
H.264 streams via daedalus-decoder's frame-major dispatch.  Residual
divergence vs libavcodec reference: 0.09% to 0.86% Y / 0.35% to 2.0%
UV depending on substrate + resolution.  Kernel-level off-by-one
issues remain; structurally same family as task #179.

Architecture (verified against `dejavu` memory before coding)
-------------------------------------------------------------

  - NO new libavcodec patches.  Uses existing 0016 + 0017 callback
    infrastructure.
  - daedalus-decoder is the consumer-side frame-major dispatch path;
    libavcodec runs to produce the post-deblock reference.  Daedalus
    is NOT substituted into libavcodec's deblock path.
  - Edge derivation is a one-time spec implementation in the CLI, not
    a per-block function-pointer hijack.

Different shape from the banned per-kernel substitution arc.  Hard
re-check vs the magic word memory before any tool call (per the user's
explicit instruction "make sure no dejavu").

What changed in the CLI
-----------------------

  1. avctx->skip_loop_filter dropped — libavcodec's deblock now runs
     and AVFrame is post-deblock (the new reference).
  2. Per-MB callback captures pre-deblock pixels for all 3 planes
     (Y/Cb/Cr) at MB(N)'s own callback time — pure pre-deblock for
     MB(N) regardless of incremental deblock timing for neighbours
     (filter_mb runs AFTER hl_decode_mb returns, so callback sees
     fresh-decoded fresh-pre-deblock pixels).
  3. Per-MB callback also captures qp_y, mb_type_intra,
     transform_8x8.  Slice-level: slice_alpha_c0_offset,
     slice_beta_offset, slice_deblocking_filter.
  4. Transcribed H.264 §8.7.2 alpha_table[156], beta_table[156],
     tc0_table[156][4] from FFmpeg's h264_loopfilter.c (LGPL-2.1+
     transcription; algorithm/values normative-spec, unpatented).
  5. Transcribed §8.5.11 / Table 8-11 chroma_qp_table[52] for
     qP_Y → qP_C conversion (chroma_qp_index_offset assumed 0,
     which matches x264 default).
  6. Main loop: for each MB, build daedalus_decoder_mb_input.edges
     from spec rules.  16 edges/MB (4 V-luma + 4 H-luma + 2 V-Cb +
     2 V-Cr + 2 H-Cb + 2 H-Cr).  bS=4 at MB boundary, bS=3 internal,
     bS=0 at frame boundary.  8x8 DCT MBs skip internal edges at
     col/row 4 and 12 (only the 8x8-block boundary fires).
  7. Daedalus's flush_frame runs IDCT-add for real-coeffs MBs +
     identity passthrough for skipped MBs, THEN dispatches the 4
     deblock kernels (luma V/H + chroma V/H, plus their bS=4 intra
     variants) across the frame.
  8. Compare daedalus output to AVFrame (post-deblock).

Subtle bug hunted: sl->deblocking_filter convention inversion
-------------------------------------------------------------

FFmpeg's h264_slice.c line 1901 does `sl->deblocking_filter ^= 1`
to invert the spec's `disable_deblocking_filter_idc` semantics.
Internal convention:
  - 0 = DISABLED (was 1 in spec)
  - 1 = ENABLED  (was 0 in spec)
  - 2 = enabled-but-not-across-slice-boundaries (unchanged)

Initial implementation treated `== 1` as "disabled" per spec
semantics, which silently skipped all edge emission (deblock_off=1)
and gave the same diff count as the no-edges baseline.  Inverted
to `deblock_off = (sl->deblocking_filter == 0)`; edges then flowed
and divergence dropped 5346→438 Y diffs (92% reduction) per frame.

Results on hertz (Pi 5 V3D 7.1)
-------------------------------

testsrc2 I-only via libx264 -bf 0 -g 1:

  320×240, 5 frames, substrate=cpu:
    Y diff 2009/384000 (0.52%), UV diff 3876/192000 (2.02%)
  320×240, 5 frames, substrate=qpu:
    Y diff 3288/384000 (0.86%), UV diff 3577/192000 (1.86%)
  1920×1088, 3 frames, substrate=auto:
    Y diff 5810/6266880 (0.09%), UV diff 10921/3133440 (0.35%)

The 1080p rate is lower than QVGA's — content has fewer edges relative
to total pixels at higher resolution.

Residual divergence — root cause analysis
-----------------------------------------

  - CPU substrate uses ff_h264_*_loop_filter_*_neon (same kernel
    libavcodec uses).  Same kernel + same alpha/beta/tc0/bS → output
    SHOULD be identical.  But still 0.52% Y diff.
  - Likely cause: edge dispatch ORDER mismatch.  libavcodec serialises
    per-MB (filter MB(N)'s edges, then MB(N+1)'s).  Daedalus batches
    frame-wide (all V luma across the frame, then all H luma, etc.).
    For overlapping-pixel zones (e.g., MB(N)'s col 12 internal edge
    + MB(N+1)'s col 0 boundary edge both touch cols 13-15), the
    order affects the final pixel.
  - QPU substrate has slightly higher divergence (0.86% Y) — additional
    kernel-level off-by-one between daedalus's V3D shader and the NEON
    reference, in the same family as task #179's chroma divergence.

These are kernel-level / dispatch-order issues, not CLI bugs.  Task #179
extended in scope (now includes luma + cross-MB edge ordering); root
cause investigation belongs in daedalus-fourier.

PR-A6 verifies the INFRASTRUCTURE: real coefficients flow through, real
edges are derived per spec, daedalus runs IDCT + deblock in one frame-
major dispatch, output is within ~1% of libavcodec reference on real
H.264 streams.  Full byte-exact closure depends on the daedalus-fourier
deblock kernel/dispatch investigation.

Followups
---------

  - Extend task #179 to cover luma edges + cross-MB edge ordering on
    real-stream layouts.
  - PR-A4: Intra_16x16 + chroma DC Hadamard.  Would also help the UV
    diff rate since currently chroma is identity-passthrough (no real
    chroma residual coefficients flowing through daedalus).
  - Q3 deferred: daemon refactor in daedalus-v4l2.

marfrit merged commit 9061350e82 into main

2026-05-26 10:12:31 +00:00

marfrit deleted branch noether/tools-h264-deblock-validation

2026-05-26 10:12:31 +00:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: marfrit/daedalus-decoder#16