Commit Graph

15 Commits

Author SHA1 Message Date
claude-noether 65bd5c3fe3 cycle 6: V3D shader for H.264 IDCT 4x4 (first cycle-6 QPU dispatch)
Per the QPU-default substrate decree 2026-05-23, cycle 6 (H.264
IDCT 4x4 + add) was the highest-priority H.264 kernel to flip
from NEON-only to QPU-capable.  The same shape as VP9 IDCT 8x8
(cycle 1) — two-pass butterfly with shared-memory transpose —
but at 4x4 scale: 4 lanes per block, 16 blocks per WG.

What's added:

  - src/v3d_h264_idct4.comp: GLSL compute shader implementing
    the H.264 §8.5.12.1 1D butterfly twice (row pass then column
    pass), with (val + 32) >> 6 rounding and clip-to-u8 add to
    dst.  Block memory layout is column-major (matches FFmpeg
    `ff_h264_idct_add_neon` convention).

  - CMakeLists: glslang rule + install entry for v3d_h264_idct4.spv.

  - dispatch_h264_idct4_qpu() in daedalus_core.c: lazy pipeline
    init, 3 SSBOs (coeffs / dst / meta as uvec4), push-constant
    (n_blocks, dst_stride), 16 blocks per WG dispatch.  Matches
    the existing dispatch_*_qpu patterns; uses
    v3d_runner_create_buffer / destroy_buffer (will swap to
    pool API once PR #6 lands).

  - daedalus_dispatch_h264_idct4() replaces ROUTE_CPU_ONLY with
    the same CPU/QPU substrate switch the deblock dispatch uses.

  - daedalus_recipe_substrate_for(H264_IDCT4) returns QPU now
    that the shader exists.

Verification on hertz (Pi 5 + V3D 7.1):

  $ ./test_api_h264
  === Phase 8a API smoke: H.264 kernels via recipe dispatch ===
    H264_IDCT4 recipe substrate:      2 (1=CPU, 2=QPU)
    H264_IDCT8 recipe substrate:      1
    H264_DEBLOCK_LV recipe substrate: 2
    H264_QPEL_MC20 recipe substrate:  1
    H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%)   ← QPU
    H.264 IDCT 8x8: 2048/2048 bytes bit-exact
    H.264 deblock luma v: 2048/2048 bytes bit-exact
    H.264 qpel mc20: 1024/1024 bytes bit-exact

The AUTO-substrate path now picks QPU for H.264 IDCT 4x4, and
the output is bit-exact against the C reference (which is
identical to the NEON .S code by construction — same FFmpeg
upstream).

Remaining cycle-6/7/9 work in task #165:
  - cycle 7: H.264 IDCT 8x8 (template same shape; 8 lanes per
    block, fewer blocks per WG)
  - cycle 9: H.264 luma qpel mc20 (different shape — 6-tap MC
    not a transform)

This commit lands the cycle-6 piece of task #165.
2026-05-23 20:06:20 +02:00
claude-noether 737e87980d QPU is default substrate: recipe table + ctx env-var override
Per the user decree 2026-05-23 — "what can be done in QPU will
be done in QPU" — this lands two coupled changes that flip
production-decode kernels with existing V3D shaders from
CPU-by-recipe to QPU-by-recipe:

1) daedalus_recipe_substrate_for() returns SUBSTRATE_QPU for
   every kernel that has a shipped V3D compute shader:

    cycle 1 VP9 IDCT 8x8         QPU  (was QPU; unchanged)
    cycle 2 VP9 LPF wd=4         QPU  (was QPU; unchanged)
    cycle 3 VP9 MC 8h            QPU  (FLIPPED from CPU — v3d_mc_8h.spv)
    cycle 4 VP9 LPF wd=8         QPU  (was QPU; unchanged)
    cycle 5 AV1 CDEF 8x8         QPU  (FLIPPED from CPU — v3d_cdef.spv)
    cycle 6 H.264 IDCT 4x4       CPU  (no shader yet; task #165)
    cycle 7 H.264 IDCT 8x8       CPU  (no shader yet; task #165)
    cycle 8 H.264 deblock luma-v QPU  (FLIPPED from CPU — v3d_h264deblock.spv)
    cycle 9 H.264 qpel mc20      CPU  (no shader yet; task #165)

   The R-band cost/benefit framework still applies but is now
   superseded for substrate selection by the decree.  Where R
   stays RED, the cost is in dispatch overhead, which is a
   fixable engineering issue (tasks 160 buffer-pool, 161
   persistent cmdbuf, 162 dmabuf import).

2) daedalus_ctx_create_no_qpu() now honours an env-var override:
   set DAEDALUS_FORCE_QPU=1 in the process and create_no_qpu
   silently escalates to a full daedalus_ctx_create().  Lets
   the libavcodec substitution shims in marfrit-packages (which
   pthread_once a create_no_qpu ctx — see
   libavcodec/aarch64/h264_idct_daedalus.c) fire QPU paths
   without rebuilding those patches.

   Firefox / mpv consumers stay on the Vulkan-free path by
   default (env var unset).  The daedalus-v4l2 daemon will set
   DAEDALUS_FORCE_QPU=1 explicitly before dlopen'ing libavcodec
   (separate daedalus-v4l2 follow-up).

Smoke (hertz, Pi 5, kernel 6.18.29):

  === test_api_h264 ===
    H264_IDCT4 recipe substrate:      1 (1=CPU, 2=QPU)
    H264_IDCT8 recipe substrate:      1
    H264_DEBLOCK_LV recipe substrate: 2     ← flipped
    H264_QPEL_MC20 recipe substrate:  1
    H.264 IDCT 4x4: 2048/2048 bytes bit-exact
    H.264 IDCT 8x8: 2048/2048 bytes bit-exact
    H.264 deblock luma v: 2048/2048 bytes bit-exact   ← QPU path
    H.264 qpel mc20: 1024/1024 bytes bit-exact

  === test_api_idct === all substrates (CPU/QPU/AUTO) bit-exact
  === test_api_lpf  === all substrates bit-exact wd=4 and wd=8

The dispatch wrapper's fall-through logic (eff == SUBSTRATE_QPU
&& !ctx_has_qpu(ctx) → eff = SUBSTRATE_CPU) handles the case
where the recipe says QPU but the consumer didn't opt in — it
falls back to CPU silently, no regression.

Closes daedalus-fourier tasks #163, #164.
Refs the 2026-05-23 "QPU default substrate" decree.
2026-05-23 19:59:53 +02:00
claude-noether 8fdef27a7d Phase 8c: H.264 luma qpel mc20 through public API
Extends daedalus-fourier with daedalus_recipe_dispatch_h264_qpel_mc20
so libavcodec.so can route H264QpelContext.put_h264_qpel_pixels_tab[1][2]
through the recipe layer instead of ff_put_h264_qpel8_mc20_neon directly.

API additions (header + library):
  - daedalus_h264_qpel_meta { dst_off, src_off }
  - daedalus_dispatch_h264_qpel_mc20(ctx, sub, dst, src, stride,
                                     n_blocks, meta)
  - daedalus_recipe_dispatch_h264_qpel_mc20(...)  (AUTO wrapper)
  - DAEDALUS_KERNEL_H264_QPEL_MC20 = 9 in the recipe-query enum
  - daedalus_recipe_substrate_for() returns CPU NEON for cycle 9

The 6-tap horizontal half-pel filter signature matches FFmpeg's
H264QpelContext convention exactly: dst and src share a single stride
and src already points at output column 0 (filter reads cols -2..+3).
Single-stride API to make the marfrit-packages FFmpeg shim a
straight pointer-pass; no buffer rearrangement.

Verdict per docs/k9_h264qpel_mc20.md: CPU NEON.  Per-block 7.6 ns
gives 135x margin over 30 fps 1080p; QPU dispatch floor at ~250 ns
makes any V3D shader strictly worse.  Recipe table reflects that —
the recipe_dispatch entry is a one-line forward to the CPU path.

CMakeLists changes:
  - h264qpel_neon.S added to the daedalus_core static lib (only the
    bench targets owned it before; now the public API needs it too)
  - tests/h264_qpel8_mc20_ref.c added to the test_api_h264 target

Phase 8a/8b smoke gains a 4th case (test_qpel_mc20): 1024/1024
bytes bit-exact via daedalus_recipe_dispatch_h264_qpel_mc20.

Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.
2026-05-23 03:25:24 +02:00
marfrit 0a99b16489 Phase 8b: opportunistic QPU paths through public API
Wires QPU dispatch for cycles 3 (VP9 MC), 5 (AV1 CDEF), 8 (H.264
deblock) through the public API. These three kernels have recipe
substrate = CPU, but per Issue 003 the mixed-kernel helper value
is real — the dispatch path must exist so override-mode callers
can request QPU on the side.

Pattern mirrors dispatch_idct8_qpu (lazy pipeline + per-call SSBO
alloc + memcpy + dispatch + readback). Each kernel has its own
push-constant struct (mc_pc 3-field, cdef_pc 3-field, deblock_pc
2-field shared with lpf).

Notable bug caught + fixed in test_api_opportunistic_qpu: the
initial dispatch_mc_8h_qpu sized src_max using CPU-side reach
(src_off + 3 + 8 + 7*stride), but the QPU shader reads src[
src_off + row*stride + 0..14] for row=0..7. Last block had 3
uninitialized bytes → 99.8% match → 100% after fix.

After this commit, the public API surface fully covers cycles 1-8:
  Cycle 1 (IDCT 8x8): CPU + QPU + AUTO bit-exact
  Cycle 2 (LPF wd=4): CPU + QPU + AUTO bit-exact
  Cycle 3 (MC 8h): CPU recipe; QPU override bit-exact
  Cycle 4 (LPF wd=8): CPU + QPU + AUTO bit-exact
  Cycle 5 (CDEF): CPU recipe; QPU override (untested in this
    test — bench_v3d_cdef is the authoritative 3-way M1)
  Cycle 6 (H.264 IDCT 4x4): CPU only (no QPU shader by recipe)
  Cycle 7 (H.264 IDCT 8x8): CPU only
  Cycle 8 (H.264 deblock luma-v): CPU recipe; QPU override bit-exact

Tests: test_api_opportunistic_qpu adds CPU-vs-QPU bit-exact
comparison for VP9 MC and H.264 deblock through the API.
test_api_idct, test_api_lpf, test_api_h264 still pass.

Per the locked Phase 8 architecture (project_phase8_architecture
memory): next session opens daedalus-v4l2 sibling repo with
Option B (kernel V4L2 shim + userspace daemon), Option γ (dlopen
FFmpeg parser).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:50:41 +00:00
marfrit af8146a2cd Phase 8a: H.264 kernels through public API
Extends include/daedalus.h with cycles 6, 7, 8 (H.264 IDCT 4x4,
IDCT 8x8, luma deblock luma-v). All recipe-substrate = CPU
(matches per-cycle Phase 7 verdicts).

src/daedalus_core.c: NEON-path implementations + recipe routing.
daedalus_core library now links the full FFmpeg H.264 NEON
snapshot (h264idct + h264dsp) plus existing VP9 + dav1d.

tests/test_api_h264.c: smoke test covering all 3 H.264 kernels
via daedalus_recipe_dispatch_*. All pass 2048/2048 bit-exact.

Public API coverage after this commit:
- Cycles 1 IDCT 8x8 + 2 LPF4 + 4 LPF8: CPU+QPU+AUTO dispatch
  (test_api_idct, test_api_lpf, both pass)
- Cycle 3 MC 8h: CPU only (QPU dispatch stub returns -1)
- Cycle 5 CDEF: CPU only (QPU stub)
- Cycle 6 H.264 IDCT 4x4: CPU only (recipe + only NEON wired)
- Cycle 7 H.264 IDCT 8x8: CPU only
- Cycle 8 H.264 deblock: CPU only (QPU opportunistic — not wired
  through API yet; bench_v3d_h264deblock exists for direct test)

Next Phase 8 sub-step: wire opportunistic QPU dispatch for cycles
3+5+8 through the API (so override-mode users can request QPU).
Then surface V4L2-wrapper architecture decisions to user.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:46:03 +00:00
marfrit 373f63a910 Cycle 8 closed: H.264 deblock R8=0.061 RED, opportunistic helper
Phase 6 deliverable: v3d_h264deblock.comp (132 inst, 4 threads,
no spills). Phase 5 REDs applied:
  RED-1: explicit clamp p1'/q1' to [0,255] before uint8 write
  RED-2: bench-enforced m.x >= 4*stride contract

M1: 3-way 4096/4096 bit-exact (QPU vs C ref AND vs NEON).
M2: 5.629 Medge/s isolation → R8 = 0.061 RED (predicted 0.09-0.14).
    Lower than prediction; H.264 deblock has 4 early-return paths +
    2 conditional writes that hurt V3D branchy execution more than
    expected.

M4 same-kernel: NEON-3+QPU 12.81 Medge/s ≈ pure-NEON-4 ~12-15
  (neutral).

M4 MIXED (real H.264 deployment shape): CPU=MC + QPU=h264deblock
  gives CPU MC 25.11 Mblock/s + QPU h264deblock 6.23 Medge/s.
  QPU contribution is essentially unchanged from isolation —
  the cross-substrate contention is gentle (consistent with
  Issue 003's V4 finding).

Verdict: H.264 deblock = opportunistic QPU helper. Same recipe
slot as cycle 5 CDEF. 6 Medge/s helper = 85% of single-NEON-core
deblock capacity, available when CPU is busy with other work.

Cycles 1-8 deployment recipe complete:
  Primary QPU: cycles 1+2+4 (VP9 IDCT/LPF, all bandwidth-bound)
  Primary CPU: cycles 3+6+7 (compute-heavy or trivially fast on NEON)
  Opportunistic helper: cycles 5+8 (CDEF, H.264 deblock)

Phase 9 lessons added:
  - Branchy kernels underperform V3D vs straight-line ones
  - Mixed-kernel helper value scales with isolation M2, not
    same-kernel M4
  - R prediction needs branchiness weight, not just compute density

- src/v3d_h264deblock.comp (132 inst QPU shader)
- tests/bench_v3d_h264deblock.c (3-way M1 + M2 + R classification)
- tests/bench_concurrent_mixed.c extended with K_H264DEBLOCK
- CMakeLists.txt: v3d_h264deblock.spv + bench_v3d_h264deblock
  + h264dsp linked into bench_concurrent_mixed
- docs/k8_h264deblock_phase7.md (full closure with cycles 1-8 recipe)

Next: Phase 8 — V4L2 wrapper / deployment infra. Public API
already exposes recipe-default substrate per kernel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:44:21 +00:00
marfrit eb5cfb34c4 Phase 8: wire LPF wd=4 + wd=8 QPU through public API
Mirror the IDCT pattern (lazy pipeline + per-call SSBO alloc +
dispatch + readback) for cycles 2 (LPF wd=4) and 4 (LPF wd=8).

Important caught-empirically bug: the two LPF shaders disagree
on push-constant slot order — wd=4 puts dst_stride_u8 at slot 1,
wd=8 puts it at slot 2 (with unused blocks_per_row at slot 1).
Initial single-struct attempt silently corrupted wd=8 output
(1958/2048 = 95.6 % bit-exact on test_api_lpf). Fixed by keeping
separate lpf4_pc and lpf8_pc struct definitions.

dst-window calc handles both kernels (same -4..+3 byte footprint
per row).

test_api_lpf exercises both kernels in CPU / QPU / AUTO modes
against the C reference. All 6 mode/kernel combinations pass
2048/2048 bit-exact (32 edges × 8 rows × 8 bytes/edge).

Phase 8 status after this commit: 3 of 5 kernels wired through
API for QPU dispatch (IDCT, LPF wd=4, LPF wd=8 — i.e., all 3
QPU-default kernels per recipe). Cycle 3 MC and cycle 5 CDEF
still need wiring for opportunistic-override mode but aren't
needed for recipe-AUTO path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:57:25 +00:00
marfrit 1085c5699c Phase 8: wire IDCT QPU dispatch through public API
daedalus_ctx now owns a v3d_runner when V3D is available. The
public API's dispatch_vp9_idct8 routes QPU calls through a
new dispatch_idct8_qpu helper that: (1) lazy-creates the cycle 1
v4 pipeline on first use, (2) allocates 3 host-visible SSBOs
per call (coeffs/dst/meta), (3) memcpy host->GPU, (4) dispatch
with the v4 32-blocks-per-WG geometry, (5) memcpy GPU->host.

Per-call alloc is intentional for Phase 8 correctness-first
scope; buffer-pool perf optimization is deferred.

Added daedalus_ctx_create_no_qpu() for fast-path callers that
know they want CPU only.

test_api_idct extended to a 3-mode matrix: CPU forced, QPU
forced, AUTO recipe. All three deliver 4096/4096 bit-exact
on hertz with V3D 7.1.7.0:

  recipe substrate for VP9_IDCT8: 2 (QPU)
  [CPU] 4096/4096 bit-exact
  [QPU] 4096/4096 bit-exact (real QPU dispatch through the API)
  [AUTO] 4096/4096 bit-exact (recipe routes to QPU)

Next Phase 8 sub-step: same wiring pattern for cycle 2 LPF wd=4
and cycle 4 LPF wd=8 (the other two recipe-QPU kernels).
Cycle 3 MC and cycle 5 CDEF only need the dispatch hook
(recipe routes to CPU; QPU stays opportunistic via explicit
override).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:55:55 +00:00
marfrit 760f6a4060 Phase 8 skeleton: public C API + first end-to-end smoke test
include/daedalus.h: stable C API surface exposing the 5 cycles
(VP9 IDCT 8x8, LPF wd=4, MC 8h, LPF wd=8; AV1 CDEF). Per-kernel
recipe-dispatch helpers default to the cycle 1-5 verdict
substrate (QPU for cycles 1+2+4, CPU for cycles 3+5); explicit
override available for benchmarking and runtime-aware scheduling.

src/daedalus_core.c: NEON-path implementation of all 5 kernels
wrapped behind the public API. QPU path stubbed out (returns -1)
since wiring v3d_runner into daedalus_ctx is the next Phase 8
sub-step; with has_qpu=0 the recipe falls back to CPU cleanly.

tests/test_api_idct.c: 64-block IDCT through the public recipe
dispatch, bit-exact vs C ref. PASS 4096/4096 bytes — proves the
API surface compiles, library links, dispatch routing works, and
NEON fallback delivers correct results.

docs/phase8_scoping.md: architecture options (A=userspace V4L2,
B=kernel V4L2 shim, C=direct libva); pick A for v1; explicitly
out-of-scope work tracked.

Next Phase 8 sub-step: wire v3d_runner into daedalus_ctx so
has_qpu=1 and QPU dispatch goes through the API too. After that:
V4L2 ioctl glue, bitstream parser, superblock loop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:54:43 +00:00
marfrit 5223d3cb3f Cycle 5 closed: CDEF QPU R5=0.116 ORANGE, opportunistic helper
Phase 4 plan with 3 Phase-5 REDs applied inline:
  - meta layout: m.z=tmp_off, m.w=dir
  - sec_shift clamped to >=0 (NEON uqsub semantics)
  - directions table as const ivec2[14], not OR-packed

Phase 6 deliverable: v3d_cdef.comp (387 inst, 2 threads, no spills).
3-way M1 (QPU vs C ref vs NEON) PASS 4096/4096.

M2: 0.443 Mblock/s -> R5 = 0.116 ORANGE (predicted 0.02-0.05 RED).
M4 same-kernel: NEON-3+QPU 8.46 < NEON-4 alone ~10 (negative).
M4 mixed (NEON-3 MC + QPU CDEF): CPU 34.17 Mblock/s MC,
  QPU 0.42 Mblock/s CDEF helper. CPU side higher than the
  Issue 003 NEON-fallback proxy suggested - cross-substrate
  contention is gentler than same-side NEON contention.

Verdict: CDEF stays on CPU; QPU dispatch path exists for
opportunistic use. Deployment recipe table updated for all 5
cycles. Phase 9 lessons: linear extrapolation across cycles is
too pessimistic; CDEF is bandwidth-bound on NEON despite high
per-block ns; real-substrate-cross contention < NEON-proxy
contention.

- src/v3d_cdef.comp: cycle 5 QPU shader
- tests/bench_v3d_cdef.c: 3-way M1, M2 bench
- tests/bench_concurrent_mixed.c: K_CDEF on both sides
- tests/cdef_ref.c + bench_neon_cdef.c: sec_shift clamp +
  expanded damping range to exercise the edge case
- CMakeLists.txt: v3d_cdef.spv + bench_v3d_cdef wiring
- docs/k5_cdef_phase4.md updated with Phase 5 review applied
- docs/k5_cdef_phase7.md: closure doc with full verdict matrix

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:52:46 +00:00
marfrit 85feba4087 Cycle 4 (LPF wd=8) closure: M1=100%, R=0.34, M4=+4.1%, PASS
Fourth daedalus-fourier kernel — VP9 8-tap inner loop filter wd=8
h_8_8 variant. Width extension of cycle 2's wd=4; completes VP9
inner-edge LPF coverage. Full cycle Phase 1-7 + M4'''' in one
combined go (cycle compressed since incremental from cycle 2).

Phase 5 review explicitly skipped (incremental ~30-line shader
delta from cycle 2 + same geometry + cycle-2 RED-pattern checks
still apply). Flagged in docs/k4_lpf8_phase4_7.md per dev_process.md
"Skipping phases is a deliberate choice that should be flagged."

Phase 6 v1 first-light: M1'''' 100.0000% bit-exact (65536/65536)
first try. Shaderdb shows 231 inst, 4 hardware threads, 0 spills,
27 max-temps, 48 uniforms — compiler at the latency-hiding ceiling.

Performance:

  M3'''' NEON (single-core)  52.382 Medge/s
  M2'''' QPU isolation       17.847 Medge/s
  R''''                      0.341  → ORANGE band
  30fps floor margin         9.2x (isolation), 20.3x (mixed)

M4'''' concurrent matrix:

  NEON 4-core                37.823 Medge/s  <- baseline
  QPU only                   14.867 Medge/s
  MIXED NEON-3 + QPU         39.389 Medge/s  <- +4.1% PASS

Verdict: YELLOW-via-M4'''' PASS. Deploy wd=8 LPF on QPU alongside
cycle 2 wd=4. Combined VP9 inner-edge LPF coverage now complete.

Cross-cycle LPF comparison:

  | | wd=4 (k2) | wd=8 (k4) |
  | M3 NEON     | 48.3      | 52.4      |
  | M2 QPU iso  | 19.6      | 17.8      |
  | R iso       | 0.41      | 0.34      |
  | M4 delta    | +6.9%     | +4.1%     |
  | 30fps mixed | 7.2x      | 20.3x     |
  | Verdict     | GO QPU    | GO QPU    |

NEW finding (Phase 9 lesson): NEON gets faster per edge as filter
width grows (20.7 → 19.1 ns wd=4 → wd=8). The relative QPU loss
grows with width. wd=16 would probably flip negative based on the
trend line.

Deployment recipe with cycle 4:

  IDCT 8x8 (k1)     -> QPU   (R=0.92, +7% mixed)
  LPF wd=4 (k2)     -> QPU   (R=0.41, +7% mixed)
  LPF wd=8 (k4)     -> QPU   (R=0.34, +4% mixed)
  MC 8h (k3)        -> CPU   (R=0.067, -19% mixed)
  Entropy           -> CPU   (structural)

VP9 inner-edge LPF coverage complete. Project continues to higgs
deployment plumbing or further kernels per user direction.

New artifacts:
- src/v3d_lpf_h_8_8.comp                 — GLSL shader
- tests/vp9_lpf8_ref.c                   — standalone C ref
- tests/bench_neon_lpf8.c                — M1+M3 bench
- tests/bench_v3d_lpf8.c                 — M1+M2 bench
- tests/bench_concurrent_lpf8.c          — M4 pthread bench
- docs/k4_lpf8_phase1_3.md + phase4_7.md — combined cycle docs

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:56:25 +00:00
marfrit 356e446a49 Cycle 3 (MC interpolation) closure: M1'''=100%, R'''=0.067 RED, M4=-19.5%
Third daedalus-fourier kernel — VP9 8-tap regular subpel filter,
horizontal direction, 8-wide output. Multiply-heavy by design to
stress V3D's no-DP4A deficit. Full cycle Phase 1-7 + M4'''.

Phase 5''' second-model review delivered cleanly — caught 1 RED
bug pre-implementation (src_off off-by-3 indexing convention) and
2 YELLOW gaps (assert MUST language, shaderdb filter-LUT gate).
Without the review, M1''' would have failed silently on first run
with cryptic "high-index source pixels wrong" symptoms.

Phase 6 v1 first-light: M1''' 100.0000% bit-exact (65536/65536
blocks across all 16 mx phases). Phase 5''' filter-LUT prediction
materialised exactly: 197 uniforms (gate was 144), 2 threads (down
from cycle-2's 4 due to register pressure).

Performance:

  M2''' = 1.413 Mblock/s     (707.9 ns/block)
  M3''' = 20.997 Mblock/s    (NEON baseline phase3)
  R'''  = 0.067              (RED band — structural mismatch)
  shaderdb: 488 inst, 2 threads, 197 uniforms, 25 max-temps, 0 spills

M4''' concurrent matrix (8s windows):

  NEON 1-core           14.479 Mblock/s
  NEON 4-core           15.248 Mblock/s   <- baseline (compute-bound,
                                              not bandwidth-saturated
                                              like cycles 1+2!)
  QPU only               1.380 Mblock/s
  MIXED NEON-3 + QPU    12.277 Mblock/s   <- -19.5% (FAIL gate)
  MIXED NEON-4 + QPU    12.158 Mblock/s   <- -20.3%

NEW cross-cycle finding (Phase 9 lesson 2): compute-bound CPU
workloads make the QPU-offload story collapse. Cycles 1+2 were
bandwidth-saturated (4-core scaling 0.56-0.82x of 1-core), so
freeing a CPU core via QPU offload added throughput. Cycle 3 MC
is compute-bound (4-core scaling 1.05x of 1-core — near-linear),
no free cycles to free. QPU contribution (0.45 Mblock/s in
contention) doesn't compensate for losing 1 NEON core delivering
~3.8 Mblock/s.

But 30fps@1080p floor: PASS in every config (1.4x to 15.7x
isolation margin). Per project_30fps_floor_is_fine.md, user-facing
test never fails — daily YouTube playback works fine on any CPU/QPU
split.

DEPLOYMENT RECIPE for higgs (cycle 3 confirmed split):

  IDCT (k1)  -> QPU   (R=0.92, +7% mixed, frees CPU core)
  LPF  (k2)  -> QPU   (R=0.41, +7% mixed, frees CPU core)
  MC   (k3)  -> CPU   (R=0.067, -19.5% mixed — stays on CPU)
  Entropy    -> CPU   (structurally serial)

Mixed-substrate deployment, not "QPU does everything". Realistic for
higgs: entropy + MC on 2-3 ARM cores; IDCT + LPF dispatched to QPU
concurrently; 1-2 ARM cores left for vscode etc.

New artifacts:
- src/v3d_mc_8h.comp               — GLSL kernel
- tests/vp9_mc_ref.c               — standalone C ref (REGULAR filter
                                     embedded; clean transcription)
- tests/bench_neon_mc.c            — M1'''_c + M3''' bench
- tests/bench_v3d_mc.c             — M1''' + M2''' bench with contract
                                     asserts + 30fps margin display
- tests/bench_concurrent_mc.c      — M4''' pthread bench
- external/ffmpeg-snapshot/libavcodec/aarch64/vp9mc_neon.S    (vendored)
- external/ffmpeg-snapshot/libavcodec/vp9_subpel_filters_table.c
                                     (hand-extracted; provides
                                      ff_vp9_subpel_filters symbol
                                      without dragging in full vp9dsp.c)
- docs/k3_mc_phase{1,2,3,4,5,7}.md — full cycle documentation

Memory updates: project_30fps_floor_is_fine.md (user's 30fps target
recalibration), MEMORY.md index updated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:51:43 +00:00
marfrit 36eca40ff2 Cycle 2 (LPF) closure: M1''=100%, R''=0.41, M4''=+6.9%, PASS
Phase 4 plan + Phase 5 second-model review (PASS-WITH-REVISIONS:
2 YELLOW contract gaps applied) + Phase 6 v1 implementation +
Phase 7 verification including M4'' concurrent gate.

Phase 5'' review delivered cleanly — no RED bugs (cycle 1 lessons
applied successfully). 2 YELLOW findings baked into phase4 §4:
  - stride >= 4 contract added alongside m.x >= 4 (finding 2)
  - assert(...) in bench made a MUST not a suggestion (finding 4)
  - V3D divergence-cost note: don't restructure to always-execute,
    masked lanes consume clock anyway (finding 3, informational)

Phase 6 v1 first-light hit M1'' 100.0000% bit-exact on first run
(65536/65536 edges) — the cycle-1 v4 patterns (WG=256, 2-per-sg,
uint8_t SSBO, oob early-return discipline) baked in from start
worked as expected.

Performance:

  M2'' = 19.645 Medge/s     (50.9 ns/edge)
  M3'' = 48.285 Medge/s     (NEON baseline from phase3)
  R''  = 0.41               (ORANGE band - doesn't auto-close per
                             cycle-1 calibration adjustment)

shaderdb: 160 inst, **4 threads**, 0 spills, 21 max-temps —
shader is already at the compiler ceiling. No v2/v3/v4 iteration
loop like cycle 1 because there's nothing more to extract from
the compiled shape. The 30x gap between theoretical instruction
throughput and measured wall-clock is divergence-tax + memory
latency, not compile quality.

M4'' concurrent matrix on hertz (8s windows):

  NEON-1 LPF          41.131 Medge/s
  NEON-4 LPF          33.726 Medge/s  <- realistic CPU ceiling
                                          (per-core 7-9; same
                                          bandwidth-saturation as
                                          cycle-1 F1)
  QPU only            14.299 Medge/s
  MIXED NEON-3 + QPU  36.049 Medge/s  <- +6.9% over NEON-4
  MIXED NEON-4 + QPU  31.892 Medge/s  <- -5.4% oversubscribed

The "freed-core" pattern generalizes from IDCT to LPF: NEON-3+QPU
beats pure NEON-4 by ~7% in both cycles. Cycle-2 NEW finding:
**oversubscribed mode hurts for lighter kernels** (LPF -5.4% vs
cycle-1 IDCT +9.4%). Recommendation for higgs deployment hardens
to "always N-1 NEON cores + QPU, never N + QPU".

Phase 9 lessons (in phase7 §"Phase 9 lessons"):
1. Cycle-1 v4-pattern is the v1 starting point (saves 3 iterations)
2. Phase 5 review pays off every cycle
3. R isolation misleading on bandwidth-saturated hardware
4. Oversubscription tax depends on kernel weight
5. shaderdb 4-threads/0-spills = compute not the bottleneck

New artifacts:
- src/v3d_lpf_h_4_8.comp                — GLSL kernel
- tests/bench_v3d_lpf.c                 — M1'' + M2'' harness with
                                          contract asserts + fm/hev
                                          pass-rate instrumentation
- tests/bench_concurrent_lpf.c          — M4'' pthread bench
                                          (mirrors bench_concurrent.c)
- docs/k2_deblock_phase{4,5,7}.md       — plan + review + verification

Project verdict: continue. Cycle 3 candidates: MC interpolation
(multiply-heavy, stress V3D SMUL24), CDEF (AV1-only, different
neighborhood shape), or wd=8/wd=16 LPF variants. User to direct.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:39:26 +00:00
marfrit d66f22f333 Phase 6 (v1+v4 production) + Phase 7 closure: R = 0.92 ± 0.03 on hertz
First QPU IDCT8 kernel running and bit-exact on V3D 7.1 via Mesa
v3dv compute. Five iterations through a Phase 7→Phase 4' loopback;
production kernel is v4.

New files:
- src/v3d_runner.{c,h}  — reusable Vulkan compute plumbing (instance,
                          V3D device picker, HOST_VISIBLE|COHERENT
                          SSBOs with mmap, compute pipeline from .spv,
                          enables storageBuffer{8,16}BitAccess)
- src/v3d_idct8.comp    — VP9 8x8 DCT_DCT IDCT add, v4 production:
                          256 invocations/WG, 2 blocks/subgroup
                          (no idle lanes), uint8 dst SSBO (race-free
                          per phase5 finding 5), unrolled writes
                          (no chained ternary), oob-flag pattern
                          (barrier-safe per phase5 finding 7)
- tests/bench_v3d_idct.c — M1' bit-exact gate + M2 throughput vs C ref
- docs/phase7.md         — full iteration journey + decision verdict

CMakeLists.txt updated to build the new shader, library, and bench
when DAEDALUS_BUILD_VULKAN=ON.

Iteration record (1920x1088 luma, 32640 blocks/dispatch, N=3):

  ver  change                              R       ns/block
  v1   first-light                         0.230   533
  v2   kill ternary + 2-blocks-per-sg      0.474   258
  v3   per-pass scope oN                   0.481   254  (noise)
  v4   WG 64 -> 256 invocations            0.947   129
  v5   packed uint32 coeff reads           0.938   130  (noise, reverted)
  v4 final N=3                             0.918 +/- 0.033

Bit-exactness 100.0000% across all iterations (10000-block sample
on 128x128, 32640-block sample on 1080p) against both the C
reference (tests/vp9_idct8_ref.c) and the vendored FFmpeg NEON
ff_vp9_idct_idct_8x8_add_neon.

Key learning over the Phase 5 review's prediction model: the
chained ternary was NOT a spill killer on V3D 7.1 (shaderdb
showed 0:0 spills:fills even in v1). The actual lever was
workgroup-size-driven latency hiding — going from 64 to 256
invocations doubled throughput with the same compiled code
(270 inst, 2 threads, 21 max-temps, 0 spills) because the
v3dv scheduler had 4x more in-flight work to overlap TMU
latency.

Verdict per phase1.md decision rules: YELLOW band (0.5 <= R < 1.0)
by a wide margin, near GREEN boundary. Phase 1 YELLOW rule:
add M4 (concurrent CPU+QPU throughput) before honest-close or
continue. M4 is the next measurement, not more shader tuning —
at R = 0.92 with all 4 A76 cores still 100% free for other work,
the question is whether the system aggregate beats pure 4-core
NEON.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:09:00 +00:00
marfrit dcbbc77038 Path B pivot + Phase 0-3 closed with first baseline numbers
This is a from-scratch initial commit on a fresh .git. The original
scaffold commit (7510b56) and the earlier session's working-tree
docs were lost in a 2026-05-18 10:25 working-tree wipe; the corrupted
.git is preserved at .git-broken-2026-05-18/ (gitignored) for
forensic inspection.

Scope re-anchored from Path A (custom VPU firmware on VC7 scalar
cores; blocked by BCM2712 silicon-RoT mask-ROM signature check)
to Path B (QPU compute kernels via Mesa v3d / Vulkan compute or
direct DRM, on stock signed Pi 5 / CM5). See README.md and
docs/phase0.md for the substrate audit that closed Path A.

Phases closed:
  Phase 0 — substrate audit; Path A blocked, Path B open;
            codec-back-end-fits-QPU finding (docs/phase0.md)
  Phase 1 — first kernel locked (VP9 / AV1 8x8 inverse DCT) with
            publish-before-measure R = M2/M3 decision rules
            (docs/phase1.md)
  Phase 2 — reference impls mapped; FFmpeg n7.1.3 source vendored
            under external/ffmpeg-snapshot/ (PROVENANCE.md pins
            commit f46e514 + per-file SHA-256s) (docs/phase2.md)
  Phase 3 — real baseline measurements on hertz (docs/phase3.md):
              M1 bit-exact            100.0000 % (10000/10000)
              M3 NEON IDCT8 single    8.171 Mblock/s (122.4 ns/block)
              M5a empty Vulkan submit 22.66 us
              M5b 1-WG noop dispatch  55.60 us
              M5 delta                32.95 us/dispatch
            => per-dispatch overhead is ~455x per-NEON-block cost;
               Phase 4 must batch at frame level or close to it.

Build harness in place: CMakeLists.txt + tests/{bench_neon_idct.c,
vp9_idct8_ref.c, bench_vulkan_dispatch.c, shaders/noop.comp} +
external/ffmpeg-snapshot/config.h shim (7 defines + EXTERN_ASM).
Builds clean on Debian Trixie aarch64 with cmake 3.31, ninja 1.12,
libvulkan-dev 1.4.309, glslang-tools 15.1.0. Vendored FFmpeg .S
assembles via the config.h shim.

Next: Phase 4 (plan first QPU IDCT kernel under the M5 batching
constraint) -> Phase 5 second-model review -> Phase 6 implement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:30:12 +00:00