5565cc2bef74f1626db9dd506e5ee794abc46d53
80 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
0a99b16489 |
Phase 8b: opportunistic QPU paths through public API
Wires QPU dispatch for cycles 3 (VP9 MC), 5 (AV1 CDEF), 8 (H.264
deblock) through the public API. These three kernels have recipe
substrate = CPU, but per Issue 003 the mixed-kernel helper value
is real — the dispatch path must exist so override-mode callers
can request QPU on the side.
Pattern mirrors dispatch_idct8_qpu (lazy pipeline + per-call SSBO
alloc + memcpy + dispatch + readback). Each kernel has its own
push-constant struct (mc_pc 3-field, cdef_pc 3-field, deblock_pc
2-field shared with lpf).
Notable bug caught + fixed in test_api_opportunistic_qpu: the
initial dispatch_mc_8h_qpu sized src_max using CPU-side reach
(src_off + 3 + 8 + 7*stride), but the QPU shader reads src[
src_off + row*stride + 0..14] for row=0..7. Last block had 3
uninitialized bytes → 99.8% match → 100% after fix.
After this commit, the public API surface fully covers cycles 1-8:
Cycle 1 (IDCT 8x8): CPU + QPU + AUTO bit-exact
Cycle 2 (LPF wd=4): CPU + QPU + AUTO bit-exact
Cycle 3 (MC 8h): CPU recipe; QPU override bit-exact
Cycle 4 (LPF wd=8): CPU + QPU + AUTO bit-exact
Cycle 5 (CDEF): CPU recipe; QPU override (untested in this
test — bench_v3d_cdef is the authoritative 3-way M1)
Cycle 6 (H.264 IDCT 4x4): CPU only (no QPU shader by recipe)
Cycle 7 (H.264 IDCT 8x8): CPU only
Cycle 8 (H.264 deblock luma-v): CPU recipe; QPU override bit-exact
Tests: test_api_opportunistic_qpu adds CPU-vs-QPU bit-exact
comparison for VP9 MC and H.264 deblock through the API.
test_api_idct, test_api_lpf, test_api_h264 still pass.
Per the locked Phase 8 architecture (project_phase8_architecture
memory): next session opens daedalus-v4l2 sibling repo with
Option B (kernel V4L2 shim + userspace daemon), Option γ (dlopen
FFmpeg parser).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
fd55f5ebc1 |
Phase 8 status doc — surfacing V4L2 architecture to user
Per goal "c8p3..c8p7, then p8 — surface if user intervention is required": this is the surface point. Kernel-library work is complete (cycles 1-8 all dispatchable via public API, all CPU paths bit-exact, 3 QPU paths bit-exact, 3 opportunistic-QPU paths shader-exists-API-TODO). V4L2 wrapper architecture needs 4 user decisions: - Q1: Option A (v4l2loopback) / B (kernel V4L2 shim) / C (libva direct) - Q2: Parser source: FFmpeg-vendored / dav1d+libvpx mix / FFmpeg-dlopen - Q3: In-repo or sibling repo (daedalus-v4l2)? - Q4: End-to-end test target (tiny clips / 1080p30 / both) Recommended defaults (A / γ / sibling / both) documented; explicit confirmation requested before committing to days of work that locks in months of follow-on choices. Mechanical TODOs that can proceed in parallel without blocking V4L2 decision: cycle 3+5+8 opportunistic-QPU dispatch wiring through API, or cycle 9 (H.264 luma qpel MC, predicted CPU-only per cycle 6/7 pattern). 24 commits pushed to marfrit/daedalus-fourier this autonomous arc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
af8146a2cd |
Phase 8a: H.264 kernels through public API
Extends include/daedalus.h with cycles 6, 7, 8 (H.264 IDCT 4x4, IDCT 8x8, luma deblock luma-v). All recipe-substrate = CPU (matches per-cycle Phase 7 verdicts). src/daedalus_core.c: NEON-path implementations + recipe routing. daedalus_core library now links the full FFmpeg H.264 NEON snapshot (h264idct + h264dsp) plus existing VP9 + dav1d. tests/test_api_h264.c: smoke test covering all 3 H.264 kernels via daedalus_recipe_dispatch_*. All pass 2048/2048 bit-exact. Public API coverage after this commit: - Cycles 1 IDCT 8x8 + 2 LPF4 + 4 LPF8: CPU+QPU+AUTO dispatch (test_api_idct, test_api_lpf, both pass) - Cycle 3 MC 8h: CPU only (QPU dispatch stub returns -1) - Cycle 5 CDEF: CPU only (QPU stub) - Cycle 6 H.264 IDCT 4x4: CPU only (recipe + only NEON wired) - Cycle 7 H.264 IDCT 8x8: CPU only - Cycle 8 H.264 deblock: CPU only (QPU opportunistic — not wired through API yet; bench_v3d_h264deblock exists for direct test) Next Phase 8 sub-step: wire opportunistic QPU dispatch for cycles 3+5+8 through the API (so override-mode users can request QPU). Then surface V4L2-wrapper architecture decisions to user. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
373f63a910 |
Cycle 8 closed: H.264 deblock R8=0.061 RED, opportunistic helper
Phase 6 deliverable: v3d_h264deblock.comp (132 inst, 4 threads,
no spills). Phase 5 REDs applied:
RED-1: explicit clamp p1'/q1' to [0,255] before uint8 write
RED-2: bench-enforced m.x >= 4*stride contract
M1: 3-way 4096/4096 bit-exact (QPU vs C ref AND vs NEON).
M2: 5.629 Medge/s isolation → R8 = 0.061 RED (predicted 0.09-0.14).
Lower than prediction; H.264 deblock has 4 early-return paths +
2 conditional writes that hurt V3D branchy execution more than
expected.
M4 same-kernel: NEON-3+QPU 12.81 Medge/s ≈ pure-NEON-4 ~12-15
(neutral).
M4 MIXED (real H.264 deployment shape): CPU=MC + QPU=h264deblock
gives CPU MC 25.11 Mblock/s + QPU h264deblock 6.23 Medge/s.
QPU contribution is essentially unchanged from isolation —
the cross-substrate contention is gentle (consistent with
Issue 003's V4 finding).
Verdict: H.264 deblock = opportunistic QPU helper. Same recipe
slot as cycle 5 CDEF. 6 Medge/s helper = 85% of single-NEON-core
deblock capacity, available when CPU is busy with other work.
Cycles 1-8 deployment recipe complete:
Primary QPU: cycles 1+2+4 (VP9 IDCT/LPF, all bandwidth-bound)
Primary CPU: cycles 3+6+7 (compute-heavy or trivially fast on NEON)
Opportunistic helper: cycles 5+8 (CDEF, H.264 deblock)
Phase 9 lessons added:
- Branchy kernels underperform V3D vs straight-line ones
- Mixed-kernel helper value scales with isolation M2, not
same-kernel M4
- R prediction needs branchiness weight, not just compute density
- src/v3d_h264deblock.comp (132 inst QPU shader)
- tests/bench_v3d_h264deblock.c (3-way M1 + M2 + R classification)
- tests/bench_concurrent_mixed.c extended with K_H264DEBLOCK
- CMakeLists.txt: v3d_h264deblock.spv + bench_v3d_h264deblock
+ h264dsp linked into bench_concurrent_mixed
- docs/k8_h264deblock_phase7.md (full closure with cycles 1-8 recipe)
Next: Phase 8 — V4L2 wrapper / deployment infra. Public API
already exposes recipe-default substrate per kernel.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
f2ba08e1cf |
Cycle 8 Phase 4: H.264 deblock QPU shader plan
Per-column dispatch (16 cols/edge, 16 edges/WG, 256 inv/WG). Follows cycle 2 LPF wd=4 template with H.264-specific adjustments: alpha/beta gating + tc0 per-4-col-segment + ap/aq side conditions for conditional p1/q1 writes. Predicted shaderdb: ~150-200 inst, 2-3 threads. Predicted R8 = 0.09-0.14 ORANGE (per Phase 3 closure). 7 Phase 5 review items flagged for Sonnet audit: - tc0 sign-extension semantics - Multiple early-return safety (no barrier follows — safe) - abs() on int operands - clamp vs clip3 equivalence - per-segment tc0 LUT extraction tradeoff - alpha=0/beta=0 outer precondition - dst_off arithmetic Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
436a5c4f74 |
Cycle 8 Phase 3 closed: H.264 deblock NEON = 92 Medge/s
M1: 10000/10000 bit-exact (after orientation fix: ff_h264_v_loop_ filter is "vertical filtering of horizontal edges", not "vertical edge"; 16 columns process the edge horizontally with 8 rows of vertical context). M3: 91.947 Medge/s per core. Per-edge 10.9 ns. 11x worst-case 1080p30 floor, 30x realistic floor. Filter triggers on 25 % of edges (random alpha/beta/tc0 covers both gating paths). Cycle 8 Phase 9 lesson: H.264/FFmpeg "v_loop_filter" naming uses filter DIRECTION (vertical) not edge orientation. Edge is horizontal; filter operates vertically across it. Distinct from cycle 6's column-major-block lesson but related discovery pattern. Encoded for future cycles. R8 prediction revised: 0.09-0.14 ORANGE (down from Phase 1's 0.3-0.8 estimate). H.264 deblock is 2x faster on NEON than VP9 LPF wd=4 (cycle 2) but H.264 deblock has more per-edge branches that hurt QPU more. Worth building anyway: - ORANGE in cycle 1's "M4 may rescue" band - Mixed-kernel deployment helper value (Issue 003) matters more than isolation R - 25%-trigger rate gives 4x effective contribution multiplier on QPU side - tests/h264_deblock_ref.c (column-walking C ref per row segment) - tests/bench_neon_h264deblock.c (M1 + M3 bench) - CMakeLists.txt: cycle 8 NEON bench wiring + h264dsp_neon.S - docs/k8_h264deblock_phase3.md (closure) Next: Phase 4 plan QPU shader, Phase 5 Sonnet review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
5a085e7180 |
Cycle 8 (H.264 deblock) opened — Phase 1 + NEON vendored
Targets the one H.264 kernel most likely to be QPU-worthy: in-loop deblock. Cycles 6 and 7 (IDCT 4x4 and 8x8) both came in CPU-only because H.264 transforms are NEON-trivial. H.264 deblock has analogous structure to VP9 LPF (cycles 2+4, both GREEN) so predicted R8 = ORANGE/YELLOW. This commit: - Vendors ff_h264_*_loop_filter_*_neon from h264dsp_neon.S (1076 lines, includes both v/h luma + chroma + intra variants + weight/biweight) - PROVENANCE.md updated with the new vendored file - Phase 1 doc captures the full plan: start with luma vertical non-intra (most common case), defer Phase 3+ to next session H.264 deblock C ref scope is ~2 hours (per-row branching, per-4-row-segment tc0, ap/aq side conditions, alpha/beta thresholds — much more complex than VP9 LPF wd=4's single-branch filter). Deferring to fresh attention next session rather than rushing now. After cycle 8 closes, the H.264 QPU surface is well-characterised and the cycles-1-8 inventory drives the Phase 8 V4L2 wrapper's substrate-routing recipe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
db2205d0e3 |
Cycle 7 closed: H.264 IDCT 8x8 = 151 Mblock/s NEON, Phase 4 deferred
M1: 10000/10000 bit-exact first try (column-major-block lesson from cycle 6 carried over cleanly). M3: 151.2 Mblock/s per core. Per-block 6.6 ns. 155x the 1080p30 floor (0.972 Mblock/s req'd). Phase-1 prediction of R7 = 0.5-0.9 YELLOW/GREEN was WRONG. H.264 IDCT 8x8 is dramatically lighter than VP9 IDCT 8x8 (18.5x faster NEON): VP9 IDCT 8x8: 122 ns/block (Q14 trig + COSPI multiplies) H.264 IDCT 8x8: 6.6 ns/block (pure integer butterfly + shifts) Phase 4 deferred via the cycle 6 lightweight-kernel rationale: NEON per-block << QPU dispatch floor; offload doesn't help. Phase 9 lesson updated: H.264 transforms (both 4x4 and 8x8) are NEON-trivial. Skip ALL H.264 transform cycles for QPU. Target compute-heavy H.264 kernels only (deblock = cycle 8 next; MC likely RED). Cycle 7 = 2nd consecutive "predicted GREEN, measured CPU-only" result. Forces a sharper view of which kernels QPU can actually help with: deblock and possibly some VP9 cases. - tests/h264_idct8_ref.c (column-major C ref) - tests/bench_neon_h264idct8.c (M1 + M3 bench) - CMakeLists.txt: cycle 7 bench wiring - docs/k7_h264idct8_phase3_and_4.md (closure) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
480f34f6e6 |
Cycle 7 (H.264 IDCT 8x8) opened — Phase 1 goal doc
Predicted R7 = 0.5-0.9 YELLOW/GREEN. Closely analogous to cycle 1 (VP9 IDCT 8x8 R=0.92 GREEN): same block size, same lane geometry, same data shape. H.264 8x8 IT uses integer butterfly with 3 sub-stages (vs cycle 1's Q14 trig single butterfly) — more compute per pass but simpler operations. Phase 1 documents: - Spec butterfly (e/f/g stages per H.264 §8.5.13) - 30fps@1080p floor = 0.972 Mblock/s (same as cycle 1 since same block density) - NEON ref = ff_h264_idct8_add_neon (already vendored in cycle 6's h264idct_neon.S) - Cycle 8-10 preview: chroma MC, luma qpel MC, in-loop deblock Phase 3 next session: write column-major C ref + bench, capture M1 + M3. Then Phase 4 plan (likely cycle-1 v3d_idct8.comp adapted to integer butterfly), Phase 5 review, Phase 6 implement, Phase 7 measure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
7288473d79 |
Cycle 6 closed (deferred Phase 4): IDCT 4x4 too small for QPU
Phase 4 QPU shader DEFERRED (not RED-by-build, but predicted-RED and not worth building): - NEON delivers 175 Mblock/s (5.7 ns/block) on a single core - QPU per-block floor ~250 ns (from cycle 1 scaling) → R6 = 0.022 - Mixed-kernel helper contribution would be ~1-2 Mblock/s — <1% of NEON capacity - 30fps@1080p worst case = 5.85 Mblock/s; NEON delivers 30x that on ONE core. No need for QPU help. Phase 9 lesson: for any cycle with NEON per-block < ~30ns, predict deep RED and defer Phase 4 unless there's a specific structural QPU advantage. Shapes future cycle selection: prefer compute-heavy kernels (cycle 7 H.264 IDCT 8x8 next; cycle 9 luma qpel MC; cycle 10 deblock). Cycle 6 phase tally: Phase 1 ✓, Phase 2 implicit, Phase 3 ✓ (M1 + M3), Phase 4 DEFERRED, Phase 5-7 N/A, Phase 8 trivial CPU-only (recipe = stay CPU), Phase 9 ✓. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
f92dc40f43 |
Cycle 6 (H.264) opened — IDCT 4x4 Phase 1+3, M3 = 175 Mblock/s
H.264 scope added 2026-05-18 per user direction. Pi 5's VideoCore VII has no hardware H.264 decoder block (only HEVC), so a QPU-accelerated H.264 path fills the most impactful codec gap. Cycle 6 = first H.264 kernel (4x4 IDCT + add, smallest H.264 transform, simplest first cycle). Phase 1: goal doc + 1080p30 floor analysis (5.85 Mblock/s worst-case, 2.0 Mblock/s realistic since most MBs use 8x8 or P-skip). Phase 3: NEON M3 baseline captured. ff_h264_idct_add_neon on hertz delivers 175 Mblock/s (5.7 ns per block) = 30x worst-case floor margin. H.264 IDCT 4x4 is dramatically lighter than VP9 IDCT 8x8 (21x faster per block). Phase 3 closure also caught the key Phase 9 lesson: H.264/FFmpeg blocks are COLUMN-MAJOR (block[c*4 + r] = (row=r, col=c)). NEON ld1 with 4 registers interleaves loading, and the FFmpeg C ref indexing makes this convention explicit. Initial C ref assumed row-major, M1 was 5% bit-exact; after fix, M1 = 100%. Convention encoded for all subsequent H.264 cycles (cycle 7+). - external/ffmpeg-snapshot/libavcodec/aarch64/h264idct_neon.S (vendored verbatim from FFmpeg n7.1.3, 415 lines) - external/ffmpeg-snapshot/PROVENANCE.md: updated - tests/h264_idct4_ref.c: column-major C ref - tests/bench_neon_h264idct4.c: M1 + M3 bench - CMakeLists.txt: cycle 6 NEON bench wiring - docs/k6_h264idct4_phase1.md, phase3.md Phase 4 next: QPU shader for cycle 6. Predicted R6 = 0.01 (deep RED — kernel too small relative to QPU dispatch overhead) but worth building for cycle-completeness + the opportunistic-helper hypothesis (cycle 6 may stay CPU per recipe). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
eb5cfb34c4 |
Phase 8: wire LPF wd=4 + wd=8 QPU through public API
Mirror the IDCT pattern (lazy pipeline + per-call SSBO alloc + dispatch + readback) for cycles 2 (LPF wd=4) and 4 (LPF wd=8). Important caught-empirically bug: the two LPF shaders disagree on push-constant slot order — wd=4 puts dst_stride_u8 at slot 1, wd=8 puts it at slot 2 (with unused blocks_per_row at slot 1). Initial single-struct attempt silently corrupted wd=8 output (1958/2048 = 95.6 % bit-exact on test_api_lpf). Fixed by keeping separate lpf4_pc and lpf8_pc struct definitions. dst-window calc handles both kernels (same -4..+3 byte footprint per row). test_api_lpf exercises both kernels in CPU / QPU / AUTO modes against the C reference. All 6 mode/kernel combinations pass 2048/2048 bit-exact (32 edges × 8 rows × 8 bytes/edge). Phase 8 status after this commit: 3 of 5 kernels wired through API for QPU dispatch (IDCT, LPF wd=4, LPF wd=8 — i.e., all 3 QPU-default kernels per recipe). Cycle 3 MC and cycle 5 CDEF still need wiring for opportunistic-override mode but aren't needed for recipe-AUTO path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
1085c5699c |
Phase 8: wire IDCT QPU dispatch through public API
daedalus_ctx now owns a v3d_runner when V3D is available. The public API's dispatch_vp9_idct8 routes QPU calls through a new dispatch_idct8_qpu helper that: (1) lazy-creates the cycle 1 v4 pipeline on first use, (2) allocates 3 host-visible SSBOs per call (coeffs/dst/meta), (3) memcpy host->GPU, (4) dispatch with the v4 32-blocks-per-WG geometry, (5) memcpy GPU->host. Per-call alloc is intentional for Phase 8 correctness-first scope; buffer-pool perf optimization is deferred. Added daedalus_ctx_create_no_qpu() for fast-path callers that know they want CPU only. test_api_idct extended to a 3-mode matrix: CPU forced, QPU forced, AUTO recipe. All three deliver 4096/4096 bit-exact on hertz with V3D 7.1.7.0: recipe substrate for VP9_IDCT8: 2 (QPU) [CPU] 4096/4096 bit-exact [QPU] 4096/4096 bit-exact (real QPU dispatch through the API) [AUTO] 4096/4096 bit-exact (recipe routes to QPU) Next Phase 8 sub-step: same wiring pattern for cycle 2 LPF wd=4 and cycle 4 LPF wd=8 (the other two recipe-QPU kernels). Cycle 3 MC and cycle 5 CDEF only need the dispatch hook (recipe routes to CPU; QPU stays opportunistic via explicit override). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
760f6a4060 |
Phase 8 skeleton: public C API + first end-to-end smoke test
include/daedalus.h: stable C API surface exposing the 5 cycles (VP9 IDCT 8x8, LPF wd=4, MC 8h, LPF wd=8; AV1 CDEF). Per-kernel recipe-dispatch helpers default to the cycle 1-5 verdict substrate (QPU for cycles 1+2+4, CPU for cycles 3+5); explicit override available for benchmarking and runtime-aware scheduling. src/daedalus_core.c: NEON-path implementation of all 5 kernels wrapped behind the public API. QPU path stubbed out (returns -1) since wiring v3d_runner into daedalus_ctx is the next Phase 8 sub-step; with has_qpu=0 the recipe falls back to CPU cleanly. tests/test_api_idct.c: 64-block IDCT through the public recipe dispatch, bit-exact vs C ref. PASS 4096/4096 bytes — proves the API surface compiles, library links, dispatch routing works, and NEON fallback delivers correct results. docs/phase8_scoping.md: architecture options (A=userspace V4L2, B=kernel V4L2 shim, C=direct libva); pick A for v1; explicitly out-of-scope work tracked. Next Phase 8 sub-step: wire v3d_runner into daedalus_ctx so has_qpu=1 and QPU dispatch goes through the API too. After that: V4L2 ioctl glue, bitstream parser, superblock loop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
5223d3cb3f |
Cycle 5 closed: CDEF QPU R5=0.116 ORANGE, opportunistic helper
Phase 4 plan with 3 Phase-5 REDs applied inline: - meta layout: m.z=tmp_off, m.w=dir - sec_shift clamped to >=0 (NEON uqsub semantics) - directions table as const ivec2[14], not OR-packed Phase 6 deliverable: v3d_cdef.comp (387 inst, 2 threads, no spills). 3-way M1 (QPU vs C ref vs NEON) PASS 4096/4096. M2: 0.443 Mblock/s -> R5 = 0.116 ORANGE (predicted 0.02-0.05 RED). M4 same-kernel: NEON-3+QPU 8.46 < NEON-4 alone ~10 (negative). M4 mixed (NEON-3 MC + QPU CDEF): CPU 34.17 Mblock/s MC, QPU 0.42 Mblock/s CDEF helper. CPU side higher than the Issue 003 NEON-fallback proxy suggested - cross-substrate contention is gentler than same-side NEON contention. Verdict: CDEF stays on CPU; QPU dispatch path exists for opportunistic use. Deployment recipe table updated for all 5 cycles. Phase 9 lessons: linear extrapolation across cycles is too pessimistic; CDEF is bandwidth-bound on NEON despite high per-block ns; real-substrate-cross contention < NEON-proxy contention. - src/v3d_cdef.comp: cycle 5 QPU shader - tests/bench_v3d_cdef.c: 3-way M1, M2 bench - tests/bench_concurrent_mixed.c: K_CDEF on both sides - tests/cdef_ref.c + bench_neon_cdef.c: sec_shift clamp + expanded damping range to exercise the edge case - CMakeLists.txt: v3d_cdef.spv + bench_v3d_cdef wiring - docs/k5_cdef_phase4.md updated with Phase 5 review applied - docs/k5_cdef_phase7.md: closure doc with full verdict matrix Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
1740e7c165 |
Cycle 5 Phase 4: QPU CDEF shader plan (predicted deep RED)
Per-block stencil: 12 constrain ops per pixel, 64 pixels per block, 4 blocks/WG, 256 invocations/WG. Predicted R5 = 0.03 (deep RED) from cycle-3 MC scaling. Plan calls out 5 Phase 5 review items, notably sentinel handling and signed/unsigned min/max distinction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
9c0bd72e70 |
Cycle 5 Phase 3 closed: M1 PASS via bench pointer-convention fix
The previous "layout mismatch" deferral was a one-line bench bug: NEON expects the caller to pass tmp pointing at the 8x8 block origin (after the 2*16+2 padding skip), but the bench passed the raw padded-buffer origin. C ref does the advance internally, so it filtered the correct block; NEON filtered a (+2 rows, +2 cols) shifted region. Diagonal-shift trace in the partial doc was exactly that. Fix: tmps + i*TMP_INTS + (2*TMP_W + 2) for NEON calls. Results: M1: 10000/10000 bit-exact (100.0000%), all 8 dirs balanced M3: 3.809 Mblock/s (consistent with 3.923 from longer window) Phase 4 unblocked; predicted R5 = 0.02-0.05 (deep RED) per earlier analysis. Will build QPU CDEF anyway for cycle-completeness + V4L2 dispatch-path existence. - tests/bench_neon_cdef.c: 3-line tmp pointer fix - docs/k5_cdef_phase3.md: supersedes k5_cdef_phase3_partial.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
2dd774a9ab |
Issue 003 closed: mixed-kernel M4 validates V4 deployment shape
bench_concurrent_mixed runs NEON-N on kernel A + QPU on kernel B concurrently. Matrix on hertz: V3 (CPU MC + QPU MC same-kernel): CPU 22.64 + QPU 0.39 Mblock/s V4 (CPU MC + QPU LPF4): CPU 27.87 + QPU 12.74 Medge/s V1 (CPU MC + NEON-fb CDEF): CPU 24.49 + 1.75 Mblock/s CDEF V2 (CPU LPF4 + NEON-fb CDEF): CPU 27.28 Medge + 1.70 Mblock/s V4 is the daedalus-fourier deployment shape (CPU runs MC; QPU runs LPF4 via cycle 2 GREEN offload). Both substrates productive; CPU MC +23% per-core vs same-kernel V3 control. Same-kernel M4 in cycles 1-5 was a worst-case contention bound, not a deployment number — user's "5%/50%" framing was correct. Cycle 3 MC verdict unchanged (QPU MC contributes ~0.4 under any contention); cycle 5 CDEF deferred verdict softened to opportunistic helper (NEON-fallback proxy used since cycle 5 Phase 6 not yet built). - tests/bench_concurrent_mixed.c (configurable cpu-kernel / qpu-kernel matrix; supports MC, LPF4, LPF8, IDCT real QPU dispatch; CDEF uses NEON-on-core-3 fallback) - CMakeLists.txt: build target wired with all FFmpeg + dav1d sources - docs/issues/003-mixed-kernel-m4-bench.md: closure + matrix - docs/k3_mc_phase7.md: M4 methodology caveat extended with V3/V4 - docs/k5_cdef_phase3_partial.md: deployment recommendation updated Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
460a6a6d08 |
Calibration: M4 same-kernel measures worst-case contention
User-flagged 2026-05-18: the cycles 3 (MC) + 5 (CDEF) 'CPU only'
verdicts were based on M4 measuring same-kernel concurrent NEON+QPU,
which is the WORST case for memory-bandwidth contention. A real
decoder pipeline has CPU doing kernel A + QPU doing kernel B
concurrently — different access patterns contend less.
Concretely: in a real pipeline, CPU runs entropy + MC + other work
while QPU is idle except for IDCT + LPF. The 'opportunistic QPU
helper' for CDEF (or MC) hasn't been measured. M4 set the bar too
high.
Updates:
- docs/k3_mc_phase7.md §'M4 methodology caveat' added with the
user's contribution framing
- docs/k5_cdef_phase3_partial.md §'Deployment recommendation'
softened from 'CPU only' to 'CPU baseline; QPU helper viable in
mixed-kernel deployment, unmeasured'
- docs/issues/003-mixed-kernel-m4-bench.md filed — the rigorous
test to close the question (4 variants: bandwidth+bandwidth,
compute+CDEF, same-kernel control, real-pipeline mix)
- ~/.claude/projects/-home-mfritsche-src-daedalus-fourier/memory/
feedback_m4_same_kernel_worst_case.md added — carries the
calibration into future cycles + Phase 8 deployment decisions
- MEMORY.md index updated
The bandwidth-bound vs compute-bound classification still holds at
the kernel level — Phase 9 cross-cycle lesson stays valid. But its
mapping to deployment is nuanced:
- Bandwidth-bound on QPU → DEFINITIVE offload (M4 +ve, cycles 1+2+4)
- Compute-bound on QPU → OPPORTUNISTIC helper if pipeline has
bandwidth-light CPU work running concurrently (cycles 3+5,
needs Issue 003 measurement)
Phase 8 V4L2 wrapper should keep CDEF + MC slot-able to either CPU
or QPU at runtime (not hard-baked), so Issue 003's result can update
the dispatch table without re-architecture.
No code changes. Doc + memory + issue only.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
20b59cd6a5 |
Cycle 5 phase 3 partial: M3 NEON = 3.923 Mblock/s; M1 deferred
CDEF is the most compute-intensive kernel measured so far —
254.9 ns/block (2x IDCT, 5x MC). 30fps@1080p floor margin: 4x
even on single NEON core in isolation.
M3 captured cleanly via dav1d_cdef_filter8_8bpc_neon. M1 bit-exact
gate failing due to tmp-layout mismatch between my standalone C
reference and dav1d's NEON expectation. The smoking gun: NEON output
appears at (+2 rows, -2 cols) shifted positions vs C ref output —
suggests NEON's padding-function output has a different convention
than my manual tmp construction.
Untangled in setup work:
- dav1d has TWO directions tables: stride-12 in src/tables.c
(C-side), stride-16 in src/arm/64/cdef_tmpl.S (NEON-side).
Initially vendored the C-side; should have used the NEON-side.
- dav1d's NEON expects tmp built by dav1d_cdef_padding8_8bpc_neon
(a separate function with its own conventions), not the C-side
padding() function from cdef_tmpl.c.
- Updated cdef_ref.c to use NEON-layout (stride 16) with table
transcribed from cdef_tmpl.S. Algorithm matches — but bench's
manual tmp construction doesn't match what NEON expects.
Resolution paths for next session (documented in
docs/k5_cdef_phase3_partial.md §'Resolution paths'):
1. Use dav1d_cdef_padding8_8bpc_neon to construct tmp (simplest)
2. Vendor dav1d's full C reference (most rigorous)
3. Reverse-engineer dav1d's padding output layout (hackiest)
Predicted R5 if/when QPU shader implemented: 0.02-0.05 (RED).
CDEF likely stays on CPU per cycle 3 lesson 7 (compute-bound
kernels don't benefit from QPU offload). 30fps floor still
passes regardless.
New artifacts:
- external/dav1d-snapshot/src/arm/64/cdef_tmpl.S (additional vendored)
- external/dav1d-snapshot/config.h — 14-define asm preamble shim
- tests/cdef_ref.c — standalone C ref (algorithmically correct,
layout mismatch with NEON known)
- tests/bench_neon_cdef.c — bench (M1 made warning, M3 captured)
- docs/k5_cdef_phase3_partial.md — phase 3 partial closure +
resumption checklist
dav1d snapshot in PROVENANCE.md should be updated next session
with the new cdef_tmpl.S entry.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
2cd2258a7b |
Cycle 5 setup (Phase 1+2): vendor dav1d 1.4.3 CDEF sources
First AV1 kernel cycle and first dav1d-vendored sources. Phase 1+2
docs lay out the structural complexity (CDEF needs pre-padded 12x12
working buffer + external edge context + direction lookup +
constraint function — meaningfully more complex than cycles 1-4).
Phase 3+ deferred to next session — CDEF is the first cycle that
doesn't fit cleanly into a single autonomous run.
Vendored from dav1d 1.4.3 (BSD-2-Clause, cleaner license than
FFmpeg's LGPL-2.1+):
src/arm/64/cdef.S 520 lines — NEON impl
src/arm/64/util.S 278 lines — NEON helpers
src/arm/asm.S 335 lines — GAS preamble
src/cdef_tmpl.c 331 lines — C reference (templated)
include/common/intops.h 84 lines — utility helpers
src/tables_cdef_subset.c hand-extracted — dav1d_cdef_directions
only (avoids dragging full 1013-line
tables.c + transitive includes)
Discovery from Phase 2 analysis:
- Filter type and shape: dav1d_cdef_filter8_pri_sec_8bpc_neon takes
(dst, dst_stride, tmp, pri_strength, sec_strength, dir, damping, h).
The 'tmp' arg is the pre-padded 12x12 buffer constructed externally
by the dav1d C-side padding() function.
- Tap weights are inline-computed (not table): pri_tap = 4 or 3
(based on pri_strength bit), sec_tap = 2 or 1. Only
dav1d_cdef_directions[12][2] is an external table.
- Constraint function: constrain(diff, threshold, shift) =
apply_sign(min(abs(diff), max(0, threshold - (abs(diff) >> shift))),
diff)
Predicted R5 band: 0.15-0.30 (ORANGE). CDEF is compute-heavier than
LPF (per-pixel min/max conditional logic), so likely worse R than
cycle 2/4 but better than cycle 3 MC. M4 gate likely required.
What Phase 3+ needs (next session):
1. config.h shim for dav1d's asm preamble (defines TBD on first build)
2. Standalone C reference for cdef_filter_block_8x8_c
(cdef_tmpl.c references several dav1d private headers; cleaner to
transcribe to a self-contained tests/cdef_ref.c)
3. tests/bench_neon_cdef.c — M1+M3 bench
4. Phase 4 plan, Phase 5 review (mandatory), Phase 6 shader, Phase 7 measure
PROVENANCE.md documents pin + per-file role + re-vendoring procedure.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
20e3d004ae |
Issues 001+002: defer LPF wd=16 + LPF vertical variants
Per user direction at cycle-4 close: file wd=16 (trend prediction validation) and vertical variants (column-stride TMU behaviour unknown) as local issues for future cycles. Progress instead to CDEF (AV1) for codec breadth. docs/issues/001 — wd=16 prediction validation. Per cycle 4 lesson 4, trend says wd=16 likely flips M4 negative. Quick incremental cycle when revisited. docs/issues/002 — vertical variants. Different memory access pattern (column-strided vs row-strided). The load-bearing unknown is whether the cycle 2 +6.9% mixed gain survives the TMU coalescing shift. If positive, deployment recipe gains symmetry; if negative, must split by orientation. Both issues have acceptance criteria + expected outcomes documented. Cycle 5 next: CDEF (AV1) — codec-breadth expansion. No Gitea repo exists for daedalus-fourier yet (project is local- only). If a tracker is wanted, create the repo and migrate these .md files. For now they live in-tree as part of the project history. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
85feba4087 |
Cycle 4 (LPF wd=8) closure: M1=100%, R=0.34, M4=+4.1%, PASS
Fourth daedalus-fourier kernel — VP9 8-tap inner loop filter wd=8 h_8_8 variant. Width extension of cycle 2's wd=4; completes VP9 inner-edge LPF coverage. Full cycle Phase 1-7 + M4'''' in one combined go (cycle compressed since incremental from cycle 2). Phase 5 review explicitly skipped (incremental ~30-line shader delta from cycle 2 + same geometry + cycle-2 RED-pattern checks still apply). Flagged in docs/k4_lpf8_phase4_7.md per dev_process.md "Skipping phases is a deliberate choice that should be flagged." Phase 6 v1 first-light: M1'''' 100.0000% bit-exact (65536/65536) first try. Shaderdb shows 231 inst, 4 hardware threads, 0 spills, 27 max-temps, 48 uniforms — compiler at the latency-hiding ceiling. Performance: M3'''' NEON (single-core) 52.382 Medge/s M2'''' QPU isolation 17.847 Medge/s R'''' 0.341 → ORANGE band 30fps floor margin 9.2x (isolation), 20.3x (mixed) M4'''' concurrent matrix: NEON 4-core 37.823 Medge/s <- baseline QPU only 14.867 Medge/s MIXED NEON-3 + QPU 39.389 Medge/s <- +4.1% PASS Verdict: YELLOW-via-M4'''' PASS. Deploy wd=8 LPF on QPU alongside cycle 2 wd=4. Combined VP9 inner-edge LPF coverage now complete. Cross-cycle LPF comparison: | | wd=4 (k2) | wd=8 (k4) | | M3 NEON | 48.3 | 52.4 | | M2 QPU iso | 19.6 | 17.8 | | R iso | 0.41 | 0.34 | | M4 delta | +6.9% | +4.1% | | 30fps mixed | 7.2x | 20.3x | | Verdict | GO QPU | GO QPU | NEW finding (Phase 9 lesson): NEON gets faster per edge as filter width grows (20.7 → 19.1 ns wd=4 → wd=8). The relative QPU loss grows with width. wd=16 would probably flip negative based on the trend line. Deployment recipe with cycle 4: IDCT 8x8 (k1) -> QPU (R=0.92, +7% mixed) LPF wd=4 (k2) -> QPU (R=0.41, +7% mixed) LPF wd=8 (k4) -> QPU (R=0.34, +4% mixed) MC 8h (k3) -> CPU (R=0.067, -19% mixed) Entropy -> CPU (structural) VP9 inner-edge LPF coverage complete. Project continues to higgs deployment plumbing or further kernels per user direction. New artifacts: - src/v3d_lpf_h_8_8.comp — GLSL shader - tests/vp9_lpf8_ref.c — standalone C ref - tests/bench_neon_lpf8.c — M1+M3 bench - tests/bench_v3d_lpf8.c — M1+M2 bench - tests/bench_concurrent_lpf8.c — M4 pthread bench - docs/k4_lpf8_phase1_3.md + phase4_7.md — combined cycle docs Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
356e446a49 |
Cycle 3 (MC interpolation) closure: M1'''=100%, R'''=0.067 RED, M4=-19.5%
Third daedalus-fourier kernel — VP9 8-tap regular subpel filter,
horizontal direction, 8-wide output. Multiply-heavy by design to
stress V3D's no-DP4A deficit. Full cycle Phase 1-7 + M4'''.
Phase 5''' second-model review delivered cleanly — caught 1 RED
bug pre-implementation (src_off off-by-3 indexing convention) and
2 YELLOW gaps (assert MUST language, shaderdb filter-LUT gate).
Without the review, M1''' would have failed silently on first run
with cryptic "high-index source pixels wrong" symptoms.
Phase 6 v1 first-light: M1''' 100.0000% bit-exact (65536/65536
blocks across all 16 mx phases). Phase 5''' filter-LUT prediction
materialised exactly: 197 uniforms (gate was 144), 2 threads (down
from cycle-2's 4 due to register pressure).
Performance:
M2''' = 1.413 Mblock/s (707.9 ns/block)
M3''' = 20.997 Mblock/s (NEON baseline phase3)
R''' = 0.067 (RED band — structural mismatch)
shaderdb: 488 inst, 2 threads, 197 uniforms, 25 max-temps, 0 spills
M4''' concurrent matrix (8s windows):
NEON 1-core 14.479 Mblock/s
NEON 4-core 15.248 Mblock/s <- baseline (compute-bound,
not bandwidth-saturated
like cycles 1+2!)
QPU only 1.380 Mblock/s
MIXED NEON-3 + QPU 12.277 Mblock/s <- -19.5% (FAIL gate)
MIXED NEON-4 + QPU 12.158 Mblock/s <- -20.3%
NEW cross-cycle finding (Phase 9 lesson 2): compute-bound CPU
workloads make the QPU-offload story collapse. Cycles 1+2 were
bandwidth-saturated (4-core scaling 0.56-0.82x of 1-core), so
freeing a CPU core via QPU offload added throughput. Cycle 3 MC
is compute-bound (4-core scaling 1.05x of 1-core — near-linear),
no free cycles to free. QPU contribution (0.45 Mblock/s in
contention) doesn't compensate for losing 1 NEON core delivering
~3.8 Mblock/s.
But 30fps@1080p floor: PASS in every config (1.4x to 15.7x
isolation margin). Per project_30fps_floor_is_fine.md, user-facing
test never fails — daily YouTube playback works fine on any CPU/QPU
split.
DEPLOYMENT RECIPE for higgs (cycle 3 confirmed split):
IDCT (k1) -> QPU (R=0.92, +7% mixed, frees CPU core)
LPF (k2) -> QPU (R=0.41, +7% mixed, frees CPU core)
MC (k3) -> CPU (R=0.067, -19.5% mixed — stays on CPU)
Entropy -> CPU (structurally serial)
Mixed-substrate deployment, not "QPU does everything". Realistic for
higgs: entropy + MC on 2-3 ARM cores; IDCT + LPF dispatched to QPU
concurrently; 1-2 ARM cores left for vscode etc.
New artifacts:
- src/v3d_mc_8h.comp — GLSL kernel
- tests/vp9_mc_ref.c — standalone C ref (REGULAR filter
embedded; clean transcription)
- tests/bench_neon_mc.c — M1'''_c + M3''' bench
- tests/bench_v3d_mc.c — M1''' + M2''' bench with contract
asserts + 30fps margin display
- tests/bench_concurrent_mc.c — M4''' pthread bench
- external/ffmpeg-snapshot/libavcodec/aarch64/vp9mc_neon.S (vendored)
- external/ffmpeg-snapshot/libavcodec/vp9_subpel_filters_table.c
(hand-extracted; provides
ff_vp9_subpel_filters symbol
without dragging in full vp9dsp.c)
- docs/k3_mc_phase{1,2,3,4,5,7}.md — full cycle documentation
Memory updates: project_30fps_floor_is_fine.md (user's 30fps target
recalibration), MEMORY.md index updated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
36eca40ff2 |
Cycle 2 (LPF) closure: M1''=100%, R''=0.41, M4''=+6.9%, PASS
Phase 4 plan + Phase 5 second-model review (PASS-WITH-REVISIONS:
2 YELLOW contract gaps applied) + Phase 6 v1 implementation +
Phase 7 verification including M4'' concurrent gate.
Phase 5'' review delivered cleanly — no RED bugs (cycle 1 lessons
applied successfully). 2 YELLOW findings baked into phase4 §4:
- stride >= 4 contract added alongside m.x >= 4 (finding 2)
- assert(...) in bench made a MUST not a suggestion (finding 4)
- V3D divergence-cost note: don't restructure to always-execute,
masked lanes consume clock anyway (finding 3, informational)
Phase 6 v1 first-light hit M1'' 100.0000% bit-exact on first run
(65536/65536 edges) — the cycle-1 v4 patterns (WG=256, 2-per-sg,
uint8_t SSBO, oob early-return discipline) baked in from start
worked as expected.
Performance:
M2'' = 19.645 Medge/s (50.9 ns/edge)
M3'' = 48.285 Medge/s (NEON baseline from phase3)
R'' = 0.41 (ORANGE band - doesn't auto-close per
cycle-1 calibration adjustment)
shaderdb: 160 inst, **4 threads**, 0 spills, 21 max-temps —
shader is already at the compiler ceiling. No v2/v3/v4 iteration
loop like cycle 1 because there's nothing more to extract from
the compiled shape. The 30x gap between theoretical instruction
throughput and measured wall-clock is divergence-tax + memory
latency, not compile quality.
M4'' concurrent matrix on hertz (8s windows):
NEON-1 LPF 41.131 Medge/s
NEON-4 LPF 33.726 Medge/s <- realistic CPU ceiling
(per-core 7-9; same
bandwidth-saturation as
cycle-1 F1)
QPU only 14.299 Medge/s
MIXED NEON-3 + QPU 36.049 Medge/s <- +6.9% over NEON-4
MIXED NEON-4 + QPU 31.892 Medge/s <- -5.4% oversubscribed
The "freed-core" pattern generalizes from IDCT to LPF: NEON-3+QPU
beats pure NEON-4 by ~7% in both cycles. Cycle-2 NEW finding:
**oversubscribed mode hurts for lighter kernels** (LPF -5.4% vs
cycle-1 IDCT +9.4%). Recommendation for higgs deployment hardens
to "always N-1 NEON cores + QPU, never N + QPU".
Phase 9 lessons (in phase7 §"Phase 9 lessons"):
1. Cycle-1 v4-pattern is the v1 starting point (saves 3 iterations)
2. Phase 5 review pays off every cycle
3. R isolation misleading on bandwidth-saturated hardware
4. Oversubscription tax depends on kernel weight
5. shaderdb 4-threads/0-spills = compute not the bottleneck
New artifacts:
- src/v3d_lpf_h_4_8.comp — GLSL kernel
- tests/bench_v3d_lpf.c — M1'' + M2'' harness with
contract asserts + fm/hev
pass-rate instrumentation
- tests/bench_concurrent_lpf.c — M4'' pthread bench
(mirrors bench_concurrent.c)
- docs/k2_deblock_phase{4,5,7}.md — plan + review + verification
Project verdict: continue. Cycle 3 candidates: MC interpolation
(multiply-heavy, stress V3D SMUL24), CDEF (AV1-only, different
neighborhood shape), or wd=8/wd=16 LPF variants. User to direct.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
be7ff5587c |
Cycle 2 (deblocking) Phase 1-3: M3'' = 48.285 Medge/s baseline
Second kernel candidate per phase7_M4.md verdict "next-kernel cycle
authorised". VP9 4-tap inner loop filter, horizontal direction,
8-pixel edge (libavcodec ff_vp9_loop_filter_h_4_8_neon as baseline).
Different workload shape from IDCT - boundary streaming, lighter
compute per unit, per-row conditionals - tests whether QPU win
generalises.
docs/k2_deblock_phase1.md - goal-setting. Same R-band decision rules
as cycle 1 (phase1.md), with the cycle-1 calibration adjustment:
ORANGE band is no longer auto-close because M4 showed mixed > pure
CPU even at modest R when CPU bandwidth-saturates.
docs/k2_deblock_phase2.md - situation analysis. C reference already
in vendored snapshot (vp9dsp_template.c:1780-1898). Fetched
vp9lpf_neon.S fresh (1334 lines, LGPL-2.1+, FFmpeg n7.1.3 pin,
SHA-256 384e49e7...). PROVENANCE.md updated.
docs/k2_deblock_phase3.md - NEON baseline:
M1''_c bit-exact 100.0000 % (10000 random edges)
M3'' throughput 48.285 Medge/s (20.7 ns/edge, single A76)
per-frame 1080p-eq 748 FPS (worst case 64 530 edges/frame)
cycles/edge ~58 (=20.7ns x 2.8GHz), ~7 cycles/row
LPF is 5.9x faster per-unit than IDCT M3 (20.7 vs 122 ns), so the
QPU break-even point moves down. Predicted R''_v1 band ~0.5-0.9
- frame-level batching amortises the same 33us dispatch overhead;
workload becomes bandwidth-bound rather than compute-bound
(~5.7 MB/frame traffic at 64 530 edges x ~88 B per edge).
New artifacts:
- tests/vp9_lpf_ref.c - standalone bit-exact C ref (8-bit, wd=4
inner only; clean transcription)
- tests/bench_neon_lpf.c - M1''_c gate + M3'' time-based bench
(5s window, edge-content-biased RNG for
realistic fm/hev hit rates)
- external/ffmpeg-snapshot/libavcodec/aarch64/vp9lpf_neon.S
- CMakeLists.txt updated with bench_neon_lpf target
Phase 4 next: plan the QPU LPF compute shader. Cycle 1's phase4.md
+ phase5.md + phase7.md learnings apply directly - bake in the v4
winning patterns from the start (WG=256, edges-per-subgroup
pattern adapted from blocks, uint8_t dst SSBO, oob flag, unrolled
writes).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
8182e43c15 |
Phase 7 M4: mixed CPU+QPU beats pure 4-core NEON; project continues
The YELLOW-band gate test from phase1.md (concurrent CPU+QPU vs
pure-CPU baseline). tests/bench_concurrent.c is a pthread harness
that runs N NEON workers (pinned to cores 0..N-1) and optionally a
QPU dispatch loop on its own pinned thread; 8s time-based windows;
sums per-worker block counts.
Raw results on hertz (1920x1088, 32640 blocks/dispatch):
Config Mblock/s 1080p FPS-eq
NEON 1-core 12.623 389.6
NEON 4-core 7.074 218.3 <- realistic CPU ceiling
QPU only 6.890 212.7
MIXED NEON-3 + QPU 7.583 234.0 <- +7.2 % over NEON-4
MIXED NEON-4 + QPU 7.739 238.9 <- +9.4 % oversubscribed
Headline findings beyond the gate test itself:
F1 — Pi 5 LPDDR4x saturates well before 4-core CPU scaling.
NEON-1 (12.6) > NEON-4 (7.1): 4 cores deliver 0.56x the
per-core throughput, not 4x. The realistic CPU ceiling for
memory-bound IDCT work is ~7 Mblock/s aggregate, not the
~32 Mblock/s a naive 4x scaling would predict. This recasts
phase7.md's R=0.92 framing: the right baseline is "4-core
NEON saturated", which the QPU effectively matches (6.89
vs 7.07) on its own.
F2 — QPU contributes meaningfully BECAUSE it doesn't fully share
the CPU's bandwidth bottleneck (own access channel + v3d L2
cache partially insulate it). Mixed adds the QPU's 0.51
Mblock/s on top of an already-saturated CPU.
F3 — Oversubscribed mode (NEON-4 + QPU) is not harmful — per-NEON
drops slightly but QPU adds more than the loss. Net +9 %.
F4 — Freed-core story is bigger than the throughput delta. In
mixed NEON-3+QPU, the 4th core is 100 % free for entropy
decode (Bool coder, ANS) which MUST run on CPU. Pure NEON-4
has nothing left. Realistic decode pipeline gets more like
a 30-50 % effective throughput uplift, not just 7 %.
Verdict per phase1.md YELLOW-band rule (mixed > pure-CPU): PASS.
Project continues to next-kernel cycle (recommend deblocking or
CDEF — same "small parallel block-level" workload class that
amortises the same M4 wins).
docs/phase7_M4.md captures the full M4 harness design, all 5
configs raw output, and the leaves-open items: M7 wall-power via
Himbeere plug, sustained-thermal test, realistic-bitstream
coefficient distribution, multi-frame async pipelining.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
d66f22f333 |
Phase 6 (v1+v4 production) + Phase 7 closure: R = 0.92 ± 0.03 on hertz
First QPU IDCT8 kernel running and bit-exact on V3D 7.1 via Mesa
v3dv compute. Five iterations through a Phase 7→Phase 4' loopback;
production kernel is v4.
New files:
- src/v3d_runner.{c,h} — reusable Vulkan compute plumbing (instance,
V3D device picker, HOST_VISIBLE|COHERENT
SSBOs with mmap, compute pipeline from .spv,
enables storageBuffer{8,16}BitAccess)
- src/v3d_idct8.comp — VP9 8x8 DCT_DCT IDCT add, v4 production:
256 invocations/WG, 2 blocks/subgroup
(no idle lanes), uint8 dst SSBO (race-free
per phase5 finding 5), unrolled writes
(no chained ternary), oob-flag pattern
(barrier-safe per phase5 finding 7)
- tests/bench_v3d_idct.c — M1' bit-exact gate + M2 throughput vs C ref
- docs/phase7.md — full iteration journey + decision verdict
CMakeLists.txt updated to build the new shader, library, and bench
when DAEDALUS_BUILD_VULKAN=ON.
Iteration record (1920x1088 luma, 32640 blocks/dispatch, N=3):
ver change R ns/block
v1 first-light 0.230 533
v2 kill ternary + 2-blocks-per-sg 0.474 258
v3 per-pass scope oN 0.481 254 (noise)
v4 WG 64 -> 256 invocations 0.947 129
v5 packed uint32 coeff reads 0.938 130 (noise, reverted)
v4 final N=3 0.918 +/- 0.033
Bit-exactness 100.0000% across all iterations (10000-block sample
on 128x128, 32640-block sample on 1080p) against both the C
reference (tests/vp9_idct8_ref.c) and the vendored FFmpeg NEON
ff_vp9_idct_idct_8x8_add_neon.
Key learning over the Phase 5 review's prediction model: the
chained ternary was NOT a spill killer on V3D 7.1 (shaderdb
showed 0:0 spills:fills even in v1). The actual lever was
workgroup-size-driven latency hiding — going from 64 to 256
invocations doubled throughput with the same compiled code
(270 inst, 2 threads, 21 max-temps, 0 spills) because the
v3dv scheduler had 4x more in-flight work to overlap TMU
latency.
Verdict per phase1.md decision rules: YELLOW band (0.5 <= R < 1.0)
by a wide margin, near GREEN boundary. Phase 1 YELLOW rule:
add M4 (concurrent CPU+QPU throughput) before honest-close or
continue. M4 is the next measurement, not more shader tuning —
at R = 0.92 with all 4 A76 cores still 100% free for other work,
the question is whether the system aggregate beats pure 4-core
NEON.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
71db72928f |
Phase 4 plan + Phase 5 second-model review (PASS-WITH-REVISIONS)
Phase 4 — Plan first QPU IDCT8 kernel given the M5 32.95us/dispatch
constraint. Frame-level dispatch (8160 WGs for 1080p), 64
invocations/WG = 4 blocks/WG, 3 SSBOs + push constants, reproduces
FFmpeg's transposed column-pass orientation. Predicted M2 ~16 Mblock/s
(BW-limited), R ~ 2.0 -> strong PASS under phase1.md decision rules.
Phase 5 — Second-model review by Claude Sonnet (fresh-context Agent,
no prior session memory). Verdict PASS-WITH-REVISIONS with 2 RED-class
findings + 1 YELLOW that this commit applies:
RED finding 5 (dst race condition): int32_t[] dst with 4 lanes
writing to overlapping 32-bit words = non-atomic concurrent
writes = Vulkan UB. Fix: uint8_t[] via storageBuffer8BitAccess
(verified exposed). Applied to phase4.md sec 5 + GLSL declaration.
RED finding 7 (early-return before barrier): if (block_idx >=
n_blocks) return; ahead of barrier() is UB by Vulkan spec.
For 1080p (32640 blocks, /4) no partial WGs; for any frame
width not /32 there are. Fix: oob flag, gate work bodies,
barrier() unconditional. Applied to phase4.md sec 4 pseudocode.
YELLOW finding 6 (subgroup ops): docs claimed BASIC+VOTE only;
actual exposed set is BASIC+VOTE+BALLOT+SHUFFLE+
SHUFFLE_RELATIVE+QUAD per vulkaninfo. Plan doesn't use any
subgroup ops in v1 so unaffected, but the wrong constraint
would mislead Phase 6/7. Corrected in phase0.md sec 2,
phase2.md sec 6, phase4.md sec 1 (C4).
GREEN/YELLOW findings 1-4, 8 (orientation, WG geom, idle lanes, BW
prediction, compute envelope accounting) accepted as-is or deferred to
Phase 7 M6 sweep per plan's existing flagging.
Reviewer verdict post-revisions: "Phase 4 is APPROVED for Phase 6
implementation. No re-review needed; revisions are mechanical and
address verified bugs/errors."
Phase 5 itself just paid for itself: two real UB bugs caught before
any GLSL was written.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
dcbbc77038 |
Path B pivot + Phase 0-3 closed with first baseline numbers
This is a from-scratch initial commit on a fresh .git. The original
scaffold commit (
|