daedalus-fourier

Author	SHA1	Message	Date
claude-noether	818e71560e	gitignore: exclude .claude/ runtime files The previous commit unintentionally added .claude/scheduled_tasks.lock which is an agent-runtime artefact, not source. Untrack it and add .claude/ to .gitignore so it stays out of future commits.	2026-05-24 23:29:06 +02:00
claude-noether	9d5451e0fe	h264: deblock_luma_h — CPU/NEON via vendored ff_h264_h_loop_filter Adds the horizontal-edge sibling of cycle 8's deblock_luma_v. The vendored FFmpeg snapshot already includes ff_h264_h_loop_filter_luma_neon in libavcodec/aarch64/h264dsp_neon.S — this PR wires up the symbol, the bit-exact reference, and the recipe-table entry so daedalus-decoder and other consumers can call the H variant through the same dispatch shape they use for _v. Scope: - Public API: daedalus_dispatch_h264_deblock_luma_h(ctx, sub, ...) + daedalus_recipe_dispatch_h264_deblock_luma_h(ctx, ...) wrapper. - Internal: dispatch_h264_deblock_h_cpu() calls the NEON entry. - Recipe table: new DAEDALUS_KERNEL_H264_DEBLOCK_LH = 10, mapped to DAEDALUS_SUBSTRATE_CPU until a QPU shader is written. An explicit SUBSTRATE_QPU request on the H dispatch returns -1 (fails fast, no silent CPU degradation). - C reference: tests/h264_h_loop_filter_luma_ref.c — the column-axis transpose of h264_deblock_ref.c. Same per-segment kernel; pix[-4..+3] accesses cols instead of rowsstride. - Test: test_api_h264 grows a test_deblock_h() with 8 tiles (8 cols x 16 rows each, edge at col 4), random alpha/beta/tc0; compares NEON dispatch against reference byte-for-byte. Verified on hertz (Pi 5 / V3D 7.1): $ ./build/test_api_h264 === Phase 8a API smoke: H.264 kernels via recipe dispatch === H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU) H264_IDCT8 recipe substrate: 2 H264_DEBLOCK_LV recipe substrate: 2 H264_QPEL_MC20 recipe substrate: 2 H264_DEBLOCK_LH recipe substrate: 1 (CPU, no QPU H shader yet) H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%) H.264 IDCT 8x8: 2048/2048 bytes bit-exact (100.0000%) H.264 deblock luma v: 2048/2048 bytes bit-exact (100.0000%) H.264 deblock luma h: 1024/1024 bytes bit-exact (100.0000%) H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%) All 5 kernels bit-exact PASS. The new H variant joins the suite with 1024 random-input bytes per tile x 8 tiles. Why CPU-only for now: the daedalus-decoder downstream needs the H edge dispatched somewhere — even at CPU NEON cost (~6 ns/edge per the cycle 8 M3 baseline) a frame's worth at 1080p is ~ 8160 MBs 4 edges = 32 640 edges = ~200 us — well inside the 30 fps budget. Writing the V3D H-edge shader is a follow-up (would be cycle 8' or similar; the V-edge shader's transpose isn't mechanical because of how the workgroup organisation maps to columns vs rows). Backlog addition (out of scope for this PR): - V3D shader for the H variant (mirror of v3d_h264deblock.spv). - bS=4 intra-strength filter (different algebra; both _v and _h). - Chroma deblock luma_v/_h (8-cell variants).	2026-05-24 23:28:56 +02:00
marfrit	0d54d68f38	Merge pull request 'cycle 9: V3D shader for H.264 luma qpel mc20 — closes 9/9 QPU coverage' (#8 ) from noether/v3d-shader-h264-qpel-mc20 into main Reviewed-on: #8	2026-05-23 19:14:44 +00:00
claude-noether	79553c6e22	cycle 9: V3D shader for H.264 luma qpel mc20 — 9/9 QPU coverage Closes the QPU-default substrate campaign per the 2026-05-23 decree: every daedalus-fourier kernel that can be done in QPU is now done in QPU. Cycle 9 is the last piece — 6-tap horizontal half-pel luma motion compensation, H.264 §8.4.2.2.1. Shader (src/v3d_h264_qpel_mc20.comp): - local_size = 64, 1 lane per output pixel of one 8x8 block, 1 block per workgroup. Simplest layout that avoids any inter-lane communication — V3D's L2 cache handles the redundant reads from adjacent lanes computing adjacent output columns. - Per-pixel: read 6 src samples (cols c-2..c+3 in row r), apply the (1, -5, 20, 20, -5, 1) / 32 filter with +16 rounding, clip to u8, write one dst byte. - Single-stride convention matches FFmpeg's H264QpelContext (dst and src share `stride`; src+src_off points at output col 0 with the caller-guaranteed -2/+3 padding). Dispatch wiring (src/daedalus_core.c): - h264_qpel_mc20_pipe field on daedalus_ctx, lazy init. - dispatch_h264_qpel_mc20_qpu(): 3 SSBOs (src / dst / meta), src_max = src_off + 7stride + 11 (covers the +3-col read footprint on the last row), dst_max = dst_off + 7stride + 8. 1 block per WG. - daedalus_dispatch_h264_qpel_mc20() replaces ROUTE_CPU_ONLY with the substrate-switch pattern matching the other H.264 kernels. - Recipe table: H264_QPEL_MC20 returns SUBSTRATE_QPU. Verification (hertz, Pi 5, V3D 7.1): $ ./test_api_h264 === Phase 8a API smoke: H.264 kernels via recipe dispatch === H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU) H264_IDCT8 recipe substrate: 2 H264_DEBLOCK_LV recipe substrate: 2 H264_QPEL_MC20 recipe substrate: 2 ← flipped H.264 IDCT 4x4: 2048/2048 bytes bit-exact H.264 IDCT 8x8: 2048/2048 bytes bit-exact H.264 deblock luma v: 2048/2048 bytes bit-exact H.264 qpel mc20: 1024/1024 bytes bit-exact ← QPU First-iteration result was 1017/1024 (99.32%) — off-by-7 traced to undersizing src_max in the host wrapper. The filter reads src_off + 7*stride + (7 + 3) = +10 at the last row last column; add 1 for memcpy's [0..N-1] semantic → 11. Fixed in the same patch. All 9 daedalus-fourier cycles now QPU-by-recipe: cycle 1 VP9 IDCT 8x8 QPU cycle 2 VP9 LPF wd=4 QPU cycle 3 VP9 MC 8h QPU cycle 4 VP9 LPF wd=8 QPU cycle 5 AV1 CDEF 8x8 QPU cycle 6 H.264 IDCT 4x4 QPU cycle 7 H.264 IDCT 8x8 QPU cycle 8 H.264 deblock luma-v QPU cycle 9 H.264 qpel mc20 QPU ← this commit Closes daedalus-fourier task #165. Per the decree memory [QPU is default substrate], the prototype now demonstrates GPU acceleration on every measured kernel.	2026-05-23 21:05:36 +02:00
marfrit	a092ee34aa	Merge pull request 'QPU is default substrate: recipe table + ctx env-var override' (#7 ) from noether/qpu-default-recipe-cycles-5-8 into main Reviewed-on: #7	2026-05-23 18:59:34 +00:00
marfrit	c01754e849	Merge pull request 'v3d_runner: buffer pool for QPU dispatch hot path' (#6 ) from noether/v3d-buffer-pool into main Reviewed-on: #6	2026-05-23 18:59:18 +00:00
claude-noether	74687d9def	cycle 7: V3D shader for H.264 IDCT 8x8 Mirrors cycle 6 (PR #7 prior commit) but at 8x8 scale: 8 lanes per block, 8 blocks per WG. H.264 §8.5.13.2 1D butterfly twice (row pass, column pass), (val + 32) >> 6 rounded + clipped + added to dst. Bit-exact first try on hertz (Pi 5, V3D 7.1): H264_IDCT4 recipe substrate: 2 (QPU) H264_IDCT8 recipe substrate: 2 (QPU) ← flipped H264_DEBLOCK_LV recipe substrate: 2 (QPU) H264_QPEL_MC20 recipe substrate: 1 (CPU) ← task #165 remaining H.264 IDCT 4x4: 2048/2048 bytes bit-exact H.264 IDCT 8x8: 2048/2048 bytes bit-exact ← QPU H.264 deblock luma v: 2048/2048 bytes bit-exact H.264 qpel mc20: 1024/1024 bytes bit-exact 8 of 9 daedalus-fourier cycles now QPU-by-recipe. Only cycle 9 (H.264 luma qpel mc20) still CPU — different shape (6-tap MC filter, not a transform) so needs its own shader template; task #165 covers it as a follow-on. Same pattern as cycle 6 commit (`65bd5c3`): adds h264_idct8_pipe field + lazy init, dispatch_h264_idct8_qpu() with 3 SSBOs, v3d_h264_idct8.spv install rule. Uses v3d_runner_create_buffer / destroy_buffer (will swap to pool API once PR #6 lands).	2026-05-23 20:09:25 +02:00
claude-noether	65bd5c3fe3	cycle 6: V3D shader for H.264 IDCT 4x4 (first cycle-6 QPU dispatch) Per the QPU-default substrate decree 2026-05-23, cycle 6 (H.264 IDCT 4x4 + add) was the highest-priority H.264 kernel to flip from NEON-only to QPU-capable. The same shape as VP9 IDCT 8x8 (cycle 1) — two-pass butterfly with shared-memory transpose — but at 4x4 scale: 4 lanes per block, 16 blocks per WG. What's added: - src/v3d_h264_idct4.comp: GLSL compute shader implementing the H.264 §8.5.12.1 1D butterfly twice (row pass then column pass), with (val + 32) >> 6 rounding and clip-to-u8 add to dst. Block memory layout is column-major (matches FFmpeg `ff_h264_idct_add_neon` convention). - CMakeLists: glslang rule + install entry for v3d_h264_idct4.spv. - dispatch_h264_idct4_qpu() in daedalus_core.c: lazy pipeline init, 3 SSBOs (coeffs / dst / meta as uvec4), push-constant (n_blocks, dst_stride), 16 blocks per WG dispatch. Matches the existing dispatch_*_qpu patterns; uses v3d_runner_create_buffer / destroy_buffer (will swap to pool API once PR #6 lands). - daedalus_dispatch_h264_idct4() replaces ROUTE_CPU_ONLY with the same CPU/QPU substrate switch the deblock dispatch uses. - daedalus_recipe_substrate_for(H264_IDCT4) returns QPU now that the shader exists. Verification on hertz (Pi 5 + V3D 7.1): $ ./test_api_h264 === Phase 8a API smoke: H.264 kernels via recipe dispatch === H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU) H264_IDCT8 recipe substrate: 1 H264_DEBLOCK_LV recipe substrate: 2 H264_QPEL_MC20 recipe substrate: 1 H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%) ← QPU H.264 IDCT 8x8: 2048/2048 bytes bit-exact H.264 deblock luma v: 2048/2048 bytes bit-exact H.264 qpel mc20: 1024/1024 bytes bit-exact The AUTO-substrate path now picks QPU for H.264 IDCT 4x4, and the output is bit-exact against the C reference (which is identical to the NEON .S code by construction — same FFmpeg upstream). Remaining cycle-6/7/9 work in task #165: - cycle 7: H.264 IDCT 8x8 (template same shape; 8 lanes per block, fewer blocks per WG) - cycle 9: H.264 luma qpel mc20 (different shape — 6-tap MC not a transform) This commit lands the cycle-6 piece of task #165.	2026-05-23 20:06:20 +02:00
claude-noether	737e87980d	QPU is default substrate: recipe table + ctx env-var override Per the user decree 2026-05-23 — "what can be done in QPU will be done in QPU" — this lands two coupled changes that flip production-decode kernels with existing V3D shaders from CPU-by-recipe to QPU-by-recipe: 1) daedalus_recipe_substrate_for() returns SUBSTRATE_QPU for every kernel that has a shipped V3D compute shader: cycle 1 VP9 IDCT 8x8 QPU (was QPU; unchanged) cycle 2 VP9 LPF wd=4 QPU (was QPU; unchanged) cycle 3 VP9 MC 8h QPU (FLIPPED from CPU — v3d_mc_8h.spv) cycle 4 VP9 LPF wd=8 QPU (was QPU; unchanged) cycle 5 AV1 CDEF 8x8 QPU (FLIPPED from CPU — v3d_cdef.spv) cycle 6 H.264 IDCT 4x4 CPU (no shader yet; task #165) cycle 7 H.264 IDCT 8x8 CPU (no shader yet; task #165) cycle 8 H.264 deblock luma-v QPU (FLIPPED from CPU — v3d_h264deblock.spv) cycle 9 H.264 qpel mc20 CPU (no shader yet; task #165) The R-band cost/benefit framework still applies but is now superseded for substrate selection by the decree. Where R stays RED, the cost is in dispatch overhead, which is a fixable engineering issue (tasks 160 buffer-pool, 161 persistent cmdbuf, 162 dmabuf import). 2) daedalus_ctx_create_no_qpu() now honours an env-var override: set DAEDALUS_FORCE_QPU=1 in the process and create_no_qpu silently escalates to a full daedalus_ctx_create(). Lets the libavcodec substitution shims in marfrit-packages (which pthread_once a create_no_qpu ctx — see libavcodec/aarch64/h264_idct_daedalus.c) fire QPU paths without rebuilding those patches. Firefox / mpv consumers stay on the Vulkan-free path by default (env var unset). The daedalus-v4l2 daemon will set DAEDALUS_FORCE_QPU=1 explicitly before dlopen'ing libavcodec (separate daedalus-v4l2 follow-up). Smoke (hertz, Pi 5, kernel 6.18.29): === test_api_h264 === H264_IDCT4 recipe substrate: 1 (1=CPU, 2=QPU) H264_IDCT8 recipe substrate: 1 H264_DEBLOCK_LV recipe substrate: 2 ← flipped H264_QPEL_MC20 recipe substrate: 1 H.264 IDCT 4x4: 2048/2048 bytes bit-exact H.264 IDCT 8x8: 2048/2048 bytes bit-exact H.264 deblock luma v: 2048/2048 bytes bit-exact ← QPU path H.264 qpel mc20: 1024/1024 bytes bit-exact === test_api_idct === all substrates (CPU/QPU/AUTO) bit-exact === test_api_lpf === all substrates bit-exact wd=4 and wd=8 The dispatch wrapper's fall-through logic (eff == SUBSTRATE_QPU && !ctx_has_qpu(ctx) → eff = SUBSTRATE_CPU) handles the case where the recipe says QPU but the consumer didn't opt in — it falls back to CPU silently, no regression. Closes daedalus-fourier tasks #163, #164. Refs the 2026-05-23 "QPU default substrate" decree.	2026-05-23 19:59:53 +02:00
claude-noether	98553278dd	v3d_runner: persistent per-pipeline command buffer Phase 2 of the QPU-default substrate campaign — eliminate vkAllocateCommandBuffers from the dispatch hot path. Attaches a VkCommandBuffer to each v3d_pipeline, allocated once in v3d_runner_create_pipeline() and freed in destroy_pipeline(). The five dispatch_*_qpu sites switch from v3d_runner_alloc_cmdbuf() to v3d_runner_pipeline_cmdbuf_reset() — vkResetCommandBuffer is O(1) versus the driver-side allocation walk. Pool was already created with VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT so reset is permitted. Microbench (hertz, Pi 5, kernel 6.18.29, V3D 7.1): before (task 160 pool only): steady-state p50: 76.44 us steady-state mean: 77.95 us after (task 160 pool + task 161 persistent cb): steady-state p50: 54.56 us steady-state mean: 56.00 us -> 28% per-dispatch reduction The remaining ~54 us steady-state is dominated by vkQueueWaitIdle + shader execution + the two memcpy(in/out) on the dst buffer — task 162 (dmabuf import for dst) targets the memcpy half. test_api_idct stays bit-exact across CPU/QPU/AUTO substrates. Refs daedalus-fourier task #161.	2026-05-23 19:56:35 +02:00
claude-noether	0a042a8e95	v3d_runner: buffer pool for QPU dispatch hot path Per the QPU-default substrate decree 2026-05-23: the per-dispatch vkAllocateMemory in dispatch__qpu was the biggest single fixable contributor to QPU dispatch overhead. This pools v3d_buffer allocations by power-of-2 size class so the second-and-subsequent dispatch hits a freelist instead of paying ~10-50us of Mesa-V3D7 memory-allocation cost per call. API additions (v3d_runner.h): - v3d_runner_acquire_buffer(): pulls from per-bucket freelist; falls through to v3d_runner_create_buffer() on miss. - v3d_runner_release_buffer(): pushes back onto the freelist; the backing VkBuffer/VkDeviceMemory only get vkFreeMemory'd in v3d_runner_destroy(). - v3d_runner_pool_total_bytes(): diagnostic watermark. Size classes 2^8..2^23 (256 B to 8 MiB). Oversize requests fall through to non-pooled (vkAllocateMemory) for both ends — pool stays correct, just degenerates to old behaviour for those calls. Migration: daedalus_core.c dispatch__qpu paths globally swap create_buffer → acquire_buffer and destroy_buffer → release_buffer. All five QPU dispatch functions (idct8 / lpf / mc_8h / cdef / h264_deblock) now reuse buffers across calls. test_api_idct stays bit-exact (4096/4096 bytes on CPU/QPU/AUTO substrates on hertz). Microbench (tests/bench_pool_overhead.c) on hertz (Pi 5, 6.18.29+rpt-rpi-2712, V3D 7.1): call 0: 434.89 us (cold — 3x vkAllocateMemory) call 1: 100.06 us (pool hit on all 3 buffers) steady-state: p50: 76.44 us p90: 90.52 us mean: 77.95 us first-call / steady-state ratio: 5.7x The remaining ~76us steady-state is dominated by vkQueueWaitIdle + shader execution + per-call descriptor-set update + command-buffer allocation — addressed in follow-on tasks 161 (persistent cmdbuf) and 162 (dmabuf import for dst, eliminates memcpy in/out). Refs daedalus-fourier task #160.	2026-05-23 19:52:50 +02:00
marfrit	3ecfc8b0ef	Merge pull request 'docs: architecture backlog — correct fleet hardware mapping' (#4 ) from noether/architecture-backlog-fleet-fix into main Reviewed-on: #4	2026-05-23 15:12:29 +00:00
marfrit	c154253432	Merge pull request 'CMakeLists: make daedalus-fourier.pc relocatable via ${pcfiledir}' (#5 ) from noether/pkgconfig-relocatable-prefix into main Reviewed-on: #5	2026-05-23 15:11:20 +00:00
claude-noether	b3de96b21c	CMakeLists: make daedalus-fourier.pc relocatable via ${pcfiledir} The pkg-config file was generated at configure time with `prefix=${CMAKE_INSTALL_PREFIX}`, which captured whatever CMAKE_INSTALL_PREFIX happened to be set to at `cmake -B build` time — typically the default `/usr/local`. `cmake --install build --prefix /foo` then put the files under /foo but the .pc still pointed pkg-config at /usr/local/include and /usr/local/lib, which broke downstream consumers configuring against the install tree. Concrete bite encountered today on hertz: the daedalus-v4l2 daemon CMake configure on a /tmp/df-prefix install tree resolved DAEDALUS_FOURIER_INCLUDE_DIRS to /usr/local/include (empty path on the test host), so main.c failed to find <daedalus.h>. Fix: write the .pc with `prefix=${pcfiledir}/<rel>` where <rel> is the configure-time-computed relative path from <prefix>/<libdir>/pkgconfig back to <prefix>. pkg-config substitutes ${pcfiledir} with the actual on-disk location of the .pc at lookup time, so the resolved prefix tracks wherever the install tree is moved to — including DESTDIR-staged builds, apt package installs, and ad-hoc `cmake --install --prefix /tmp/foo` test installs. The relative-path computation handles GNUInstallDirs layouts that add multiarch tuples (Debian's lib/aarch64-linux-gnu) without hard-coding `../..`. Tested on hertz (Debian trixie, libdir=lib): prefix=${pcfiledir}/../../ ... $ pkg-config --variable=prefix daedalus-fourier /tmp/df-prefix-test/lib/pkgconfig/../../ # mv preserves relocation: $ mv /tmp/df-prefix-test /tmp/df-prefix-moved $ pkg-config --variable=prefix daedalus-fourier /tmp/df-prefix-moved/lib/pkgconfig/../../ This unblocks the daedalus-v4l2 daemon out-of-tree builds against local daedalus-fourier installs and is a prerequisite for tidy test-rig deployments (per the hertz reload session 2026-05-23).	2026-05-23 16:55:31 +02:00
claude-noether	68dccd2911	docs: architecture backlog — correct fleet hardware mapping Original draft (PR #3) speculated wrongly on host-to-SoC mapping: - hertz and tesla were listed under RK3588. Verified via /proc/device-tree/compatible: both are raspberrypi,5-model-b / brcm,bcm2712 (tesla is an LXD container hosted on hertz, so necessarily shares the host SoC). - boltzmann (the only actual RK3588 in the fleet, 32 GB, kernel- dev / MCP hub, 8 W always-on) was omitted entirely. - noether (Pi 4 / BCM2711, the user's interactive workstation, where Firefox and mpv actually run) was omitted entirely. Corrects the per-SoC coverage table: BCM2712 Pi 5 — higgs, hertz, broglie, tesla (LXD on hertz) BCM2711 Pi 4 — noether (workstation), dcw3, dcw2 RK3588 — boltzmann Allwinner H6 — (not in fleet) Reasoning consequences: - Pi 5 row is now four hosts but one SoC. Adding a fifth Pi 5 doesn't pressure-test the architecture; substrate decisions are identical across the row. - The realistic forcing function for the Pi 4 path is "HW decode on noether matters and rpivid is still unstable upstream" — noether is a daily-driver Pi 4 workstation, so this is closer than the original draft implied. - The realistic forcing function for an RK3588 caps file is "AV1 playback on boltzmann matters" — rkvdec doesn't cover AV1, so Mali Valhall compute substrate becomes the only HW acceleration option there. "Re-read this when" list at the top + "Why deferred" section + decision log all updated. No change to the architecture sketch (caps directory, plugin layout, two-backend conclusion) — those were correct in the original; only the host-to-SoC mapping underneath them was wrong. Refs PR #3 (the merged original).	2026-05-23 15:47:55 +02:00
marfrit	7d6f106919	Merge pull request 'docs: architecture backlog for multi-SoC daedalus generalization' (#3 ) from noether/architecture-backlog into main Reviewed-on: #3	2026-05-23 03:31:56 +00:00
claude-noether	632dfc1e74	docs: architecture backlog for multi-SoC daedalus generalization Captures the design draft for generalizing the daedalus daemon across the fleet (Pi 5 + Pi 4 + RK3588 + Allwinner H6) while explicitly DEFERRING the work until a second SoC creates a forcing function. Key conclusions: - The recipe layer in daedalus-fourier (daedalus_recipe_dispatch_*) already abstracts substrate selection per kernel; scaling to multi-SoC is a data extension (caps/<soc>.toml), not new architecture. - libva-v4l2-request-fourier already abstracts over any V4L2 stateless decoder node; the cross-SoC seam is at the V4L2 device level, where the upstream stateless API put it. - The conceptual gap is that hardware decoders are NOT made of shaders — rkvdec on RK3588, Hantro G1/G2, VPU8, rpi-hevc-dec on Pi 5 are bitstream-in NV12-out monoliths. A generalized daemon needs TWO backends: substrate-composed (today's path) and codec-level pass-through to vendor V4L2 decoders. - On RK3588 + every codec rkvdec supports, the daedalus daemon is bypassed entirely — libva talks to rkvdec directly. The daemon is only ever in the path on SoCs where at least one codec needs substrate composition. Forcing functions for revisiting: - Pi 4 enters daily use with rpivid still unstable upstream (would require a V3D4 substrate-composed path with its own caps file and substrate verdicts). - A third-party user needs to swap shaders for V3D firmware experiments without rebuilding the daemon. - An x86 / panvk host enters the fleet needing dynamic SoC discovery rather than build-time pinning. Until then: keep daedalus daemon Pi 5 specific, push cross-SoC abstraction up to libva-v4l2-request-fourier (which already does most of it). Document covers: - current stack diagram (cycles 1-9 closed) - per-SoC codec coverage matrix - refined sketch: /usr/lib/daedalus/{shaders,caps,plugins} - illustrative bcm2712.toml + rk3588.toml caps files - where it gets hard (probing, fallback, stateful vs stateless, CI matrix, libva node selection) - open questions - decision log No code changes; document only. Refs reauktion/daedalus-v4l2#11 substitution arc closing; pivot to bug-fix backlog (#145 daemon SEGV, #146 D-state) is the next work block once cycle 9 deploys.	2026-05-23 05:05:31 +02:00
marfrit	209a4218bc	Merge pull request 'Phase 8c: H.264 luma qpel mc20 through public API' (#2 ) from noether/api-h264-qpel-mc20 into main Reviewed-on: #2	2026-05-23 01:29:24 +00:00
claude-noether	8fdef27a7d	Phase 8c: H.264 luma qpel mc20 through public API Extends daedalus-fourier with daedalus_recipe_dispatch_h264_qpel_mc20 so libavcodec.so can route H264QpelContext.put_h264_qpel_pixels_tab[1][2] through the recipe layer instead of ff_put_h264_qpel8_mc20_neon directly. API additions (header + library): - daedalus_h264_qpel_meta { dst_off, src_off } - daedalus_dispatch_h264_qpel_mc20(ctx, sub, dst, src, stride, n_blocks, meta) - daedalus_recipe_dispatch_h264_qpel_mc20(...) (AUTO wrapper) - DAEDALUS_KERNEL_H264_QPEL_MC20 = 9 in the recipe-query enum - daedalus_recipe_substrate_for() returns CPU NEON for cycle 9 The 6-tap horizontal half-pel filter signature matches FFmpeg's H264QpelContext convention exactly: dst and src share a single stride and src already points at output column 0 (filter reads cols -2..+3). Single-stride API to make the marfrit-packages FFmpeg shim a straight pointer-pass; no buffer rearrangement. Verdict per docs/k9_h264qpel_mc20.md: CPU NEON. Per-block 7.6 ns gives 135x margin over 30 fps 1080p; QPU dispatch floor at ~250 ns makes any V3D shader strictly worse. Recipe table reflects that — the recipe_dispatch entry is a one-line forward to the CPU path. CMakeLists changes: - h264qpel_neon.S added to the daedalus_core static lib (only the bench targets owned it before; now the public API needs it too) - tests/h264_qpel8_mc20_ref.c added to the test_api_h264 target Phase 8a/8b smoke gains a 4th case (test_qpel_mc20): 1024/1024 bytes bit-exact via daedalus_recipe_dispatch_h264_qpel_mc20. Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.	2026-05-23 03:25:24 +02:00
marfrit	d87239d817	Merge pull request 'CMakeLists: install rules + pkg-config for daedalus_core' (#1 ) from noether/installable-pkgconfig into main Reviewed-on: #1	2026-05-21 15:53:37 +00:00
claude-noether	47d0107809	CMakeLists: install rules + pkg-config for daedalus_core Make daedalus_core installable so sibling consumers (Phase 8 V4L2 daemon, future libva-v4l2-request-fourier integration tests, etc.) can `pkg_check_modules(DAEDALUS REQUIRED daedalus-fourier)` against a system-installed copy. Installs: - lib/libdaedalus_core.a - include/daedalus.h - lib/pkgconfig/daedalus-fourier.pc - share/daedalus-fourier/shaders/*.spv (only when DAEDALUS_BUILD_VULKAN is ON; consumers using daedalus_ctx_create_no_qpu() don't need them) pkg-config surfaces the static-archive transitive deps via Libs.private (-lpthread -ldl -lm) and Requires.private (vulkan), so a consumer doing `pkg-config --static --libs daedalus-fourier` gets the full link line. Non-static consumers (using the no_qpu path) get just `-ldaedalus_core`. No behaviour change to existing tests / benches. Verified on hertz (Pi 5, dev host): clean build, all 7 SPIR-V shaders + the static lib + the header + the .pc file land in the install prefix.	2026-05-21 17:49:49 +02:00
marfrit	0e4caae006	README: fix daedalus-v4l2 link (reauktion/, not marfrit/) User created the sibling repo under reauktion/ org. Updates all 5 cross-links in the README. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:57:38 +00:00
marfrit	5e04b89d9d	README polish: reflect cycles 1-9 state + sibling daedalus-v4l2 The Phase-0-era README is updated to reflect the kernel-library project's actual state: - Status: 9 cycles closed (VP9 IDCT/LPF/MC, AV1 CDEF, H.264 IDCT4/IDCT8/deblock/MC) with deployment recipe table as the headline result. - Architecture: clarifies that 3 kernels deploy on QPU primary, 6 on CPU primary, 2 with opportunistic-QPU helper paths; V4L2 wrapper is the sibling daedalus-v4l2 (Option B + γ + sibling per locked Phase 8 architecture). - Layout: shows actual repo structure (include/, src/, tests/, docs/k_phase.md, external/ffmpeg-snapshot + dav1d-snapshot). - Build + run: concrete cmake commands and example bench invocations. - Consuming the kernel library: code snippet showing the public API (daedalus_ctx_create, daedalus_recipe_dispatch_*). - Conventions: updated dev process reference, current claude-noether SSH alias convention. - Sibling projects: added daedalus-v4l2 link. Old "single-kernel proof-of-concept negative result will close the project" framing replaced — the negative result test passed; project is alive and now in deployment phase. Project voice (Daedalus myth, higgs framing, honest-target posture) preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:55:40 +00:00
marfrit	5c8b09349c	Cycle 9 closed: H.264 luma qpel mc20 = 131 Mblock/s NEON, CPU-only Last unmeasured H.264 kernel. mc20 picked as representative (horizontal half-pel, 6-tap filter; canonical for the H.264 luma qpel family). M1 PASS 10000/10000 first try, M3 = 131.477 Mblock/s on a single core (7.6 ns/block), 135x the 1080p30 floor. Per the cycles 6+7 lightweight-kernel rationale, Phase 4 deferred: QPU dispatch floor (~250 ns/block) is 33x above the NEON per-block cost; R9 ≈ 0.03 deep RED. No realistic QPU offload value. Generalization: all H.264 luma MC variants (mc02, mc11, mc22, etc.) will share this verdict. No need to measure each variant individually. H.264 NEON is dramatically faster than VP9 NEON across the board: - IDCT 4x4: 175 vs N/A (no VP9 analog) - IDCT 8x8: 151 vs 8.2 Mblock/s (18x faster) - MC 6/8-tap: 131 vs 7.0 (19x faster) - Deblock: 92 vs 48 Medge/s (2x faster) H.264 deployment recipe: all CPU NEON except deblock (opportunistic QPU). On a Pi 5 running H.264-only, the QPU is mostly idle. Cycles 1-9 complete. Public API exposes all 9. Next: daedalus-v4l2 sibling repo per locked Phase 8 architecture (B + γ + sibling), then README polish. - external/ffmpeg-snapshot/libavcodec/aarch64/h264qpel_neon.S vendored (1467 lines, all qpel variants) - tests/h264_qpel8_mc20_ref.c: 40-line C ref (clip255 of 6-tap convolution) - tests/bench_neon_h264qpel_mc20.c: M1 + M3 bench - docs/k9_h264qpel_mc20.md: cycle 9 closure with comparison matrix Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:53:21 +00:00
marfrit	0a99b16489	Phase 8b: opportunistic QPU paths through public API Wires QPU dispatch for cycles 3 (VP9 MC), 5 (AV1 CDEF), 8 (H.264 deblock) through the public API. These three kernels have recipe substrate = CPU, but per Issue 003 the mixed-kernel helper value is real — the dispatch path must exist so override-mode callers can request QPU on the side. Pattern mirrors dispatch_idct8_qpu (lazy pipeline + per-call SSBO alloc + memcpy + dispatch + readback). Each kernel has its own push-constant struct (mc_pc 3-field, cdef_pc 3-field, deblock_pc 2-field shared with lpf). Notable bug caught + fixed in test_api_opportunistic_qpu: the initial dispatch_mc_8h_qpu sized src_max using CPU-side reach (src_off + 3 + 8 + 7stride), but the QPU shader reads src[ src_off + rowstride + 0..14] for row=0..7. Last block had 3 uninitialized bytes → 99.8% match → 100% after fix. After this commit, the public API surface fully covers cycles 1-8: Cycle 1 (IDCT 8x8): CPU + QPU + AUTO bit-exact Cycle 2 (LPF wd=4): CPU + QPU + AUTO bit-exact Cycle 3 (MC 8h): CPU recipe; QPU override bit-exact Cycle 4 (LPF wd=8): CPU + QPU + AUTO bit-exact Cycle 5 (CDEF): CPU recipe; QPU override (untested in this test — bench_v3d_cdef is the authoritative 3-way M1) Cycle 6 (H.264 IDCT 4x4): CPU only (no QPU shader by recipe) Cycle 7 (H.264 IDCT 8x8): CPU only Cycle 8 (H.264 deblock luma-v): CPU recipe; QPU override bit-exact Tests: test_api_opportunistic_qpu adds CPU-vs-QPU bit-exact comparison for VP9 MC and H.264 deblock through the API. test_api_idct, test_api_lpf, test_api_h264 still pass. Per the locked Phase 8 architecture (project_phase8_architecture memory): next session opens daedalus-v4l2 sibling repo with Option B (kernel V4L2 shim + userspace daemon), Option γ (dlopen FFmpeg parser). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:50:41 +00:00
marfrit	fd55f5ebc1	Phase 8 status doc — surfacing V4L2 architecture to user Per goal "c8p3..c8p7, then p8 — surface if user intervention is required": this is the surface point. Kernel-library work is complete (cycles 1-8 all dispatchable via public API, all CPU paths bit-exact, 3 QPU paths bit-exact, 3 opportunistic-QPU paths shader-exists-API-TODO). V4L2 wrapper architecture needs 4 user decisions: - Q1: Option A (v4l2loopback) / B (kernel V4L2 shim) / C (libva direct) - Q2: Parser source: FFmpeg-vendored / dav1d+libvpx mix / FFmpeg-dlopen - Q3: In-repo or sibling repo (daedalus-v4l2)? - Q4: End-to-end test target (tiny clips / 1080p30 / both) Recommended defaults (A / γ / sibling / both) documented; explicit confirmation requested before committing to days of work that locks in months of follow-on choices. Mechanical TODOs that can proceed in parallel without blocking V4L2 decision: cycle 3+5+8 opportunistic-QPU dispatch wiring through API, or cycle 9 (H.264 luma qpel MC, predicted CPU-only per cycle 6/7 pattern). 24 commits pushed to marfrit/daedalus-fourier this autonomous arc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:46:24 +00:00
marfrit	af8146a2cd	Phase 8a: H.264 kernels through public API Extends include/daedalus.h with cycles 6, 7, 8 (H.264 IDCT 4x4, IDCT 8x8, luma deblock luma-v). All recipe-substrate = CPU (matches per-cycle Phase 7 verdicts). src/daedalus_core.c: NEON-path implementations + recipe routing. daedalus_core library now links the full FFmpeg H.264 NEON snapshot (h264idct + h264dsp) plus existing VP9 + dav1d. tests/test_api_h264.c: smoke test covering all 3 H.264 kernels via daedalus_recipe_dispatch_*. All pass 2048/2048 bit-exact. Public API coverage after this commit: - Cycles 1 IDCT 8x8 + 2 LPF4 + 4 LPF8: CPU+QPU+AUTO dispatch (test_api_idct, test_api_lpf, both pass) - Cycle 3 MC 8h: CPU only (QPU dispatch stub returns -1) - Cycle 5 CDEF: CPU only (QPU stub) - Cycle 6 H.264 IDCT 4x4: CPU only (recipe + only NEON wired) - Cycle 7 H.264 IDCT 8x8: CPU only - Cycle 8 H.264 deblock: CPU only (QPU opportunistic — not wired through API yet; bench_v3d_h264deblock exists for direct test) Next Phase 8 sub-step: wire opportunistic QPU dispatch for cycles 3+5+8 through the API (so override-mode users can request QPU). Then surface V4L2-wrapper architecture decisions to user. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:46:03 +00:00
marfrit	373f63a910	Cycle 8 closed: H.264 deblock R8=0.061 RED, opportunistic helper Phase 6 deliverable: v3d_h264deblock.comp (132 inst, 4 threads, no spills). Phase 5 REDs applied: RED-1: explicit clamp p1'/q1' to [0,255] before uint8 write RED-2: bench-enforced m.x >= 4*stride contract M1: 3-way 4096/4096 bit-exact (QPU vs C ref AND vs NEON). M2: 5.629 Medge/s isolation → R8 = 0.061 RED (predicted 0.09-0.14). Lower than prediction; H.264 deblock has 4 early-return paths + 2 conditional writes that hurt V3D branchy execution more than expected. M4 same-kernel: NEON-3+QPU 12.81 Medge/s ≈ pure-NEON-4 ~12-15 (neutral). M4 MIXED (real H.264 deployment shape): CPU=MC + QPU=h264deblock gives CPU MC 25.11 Mblock/s + QPU h264deblock 6.23 Medge/s. QPU contribution is essentially unchanged from isolation — the cross-substrate contention is gentle (consistent with Issue 003's V4 finding). Verdict: H.264 deblock = opportunistic QPU helper. Same recipe slot as cycle 5 CDEF. 6 Medge/s helper = 85% of single-NEON-core deblock capacity, available when CPU is busy with other work. Cycles 1-8 deployment recipe complete: Primary QPU: cycles 1+2+4 (VP9 IDCT/LPF, all bandwidth-bound) Primary CPU: cycles 3+6+7 (compute-heavy or trivially fast on NEON) Opportunistic helper: cycles 5+8 (CDEF, H.264 deblock) Phase 9 lessons added: - Branchy kernels underperform V3D vs straight-line ones - Mixed-kernel helper value scales with isolation M2, not same-kernel M4 - R prediction needs branchiness weight, not just compute density - src/v3d_h264deblock.comp (132 inst QPU shader) - tests/bench_v3d_h264deblock.c (3-way M1 + M2 + R classification) - tests/bench_concurrent_mixed.c extended with K_H264DEBLOCK - CMakeLists.txt: v3d_h264deblock.spv + bench_v3d_h264deblock + h264dsp linked into bench_concurrent_mixed - docs/k8_h264deblock_phase7.md (full closure with cycles 1-8 recipe) Next: Phase 8 — V4L2 wrapper / deployment infra. Public API already exposes recipe-default substrate per kernel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:44:21 +00:00
marfrit	f2ba08e1cf	Cycle 8 Phase 4: H.264 deblock QPU shader plan Per-column dispatch (16 cols/edge, 16 edges/WG, 256 inv/WG). Follows cycle 2 LPF wd=4 template with H.264-specific adjustments: alpha/beta gating + tc0 per-4-col-segment + ap/aq side conditions for conditional p1/q1 writes. Predicted shaderdb: ~150-200 inst, 2-3 threads. Predicted R8 = 0.09-0.14 ORANGE (per Phase 3 closure). 7 Phase 5 review items flagged for Sonnet audit: - tc0 sign-extension semantics - Multiple early-return safety (no barrier follows — safe) - abs() on int operands - clamp vs clip3 equivalence - per-segment tc0 LUT extraction tradeoff - alpha=0/beta=0 outer precondition - dst_off arithmetic Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:40:07 +00:00
marfrit	436a5c4f74	Cycle 8 Phase 3 closed: H.264 deblock NEON = 92 Medge/s M1: 10000/10000 bit-exact (after orientation fix: ff_h264_v_loop_ filter is "vertical filtering of horizontal edges", not "vertical edge"; 16 columns process the edge horizontally with 8 rows of vertical context). M3: 91.947 Medge/s per core. Per-edge 10.9 ns. 11x worst-case 1080p30 floor, 30x realistic floor. Filter triggers on 25 % of edges (random alpha/beta/tc0 covers both gating paths). Cycle 8 Phase 9 lesson: H.264/FFmpeg "v_loop_filter" naming uses filter DIRECTION (vertical) not edge orientation. Edge is horizontal; filter operates vertically across it. Distinct from cycle 6's column-major-block lesson but related discovery pattern. Encoded for future cycles. R8 prediction revised: 0.09-0.14 ORANGE (down from Phase 1's 0.3-0.8 estimate). H.264 deblock is 2x faster on NEON than VP9 LPF wd=4 (cycle 2) but H.264 deblock has more per-edge branches that hurt QPU more. Worth building anyway: - ORANGE in cycle 1's "M4 may rescue" band - Mixed-kernel deployment helper value (Issue 003) matters more than isolation R - 25%-trigger rate gives 4x effective contribution multiplier on QPU side - tests/h264_deblock_ref.c (column-walking C ref per row segment) - tests/bench_neon_h264deblock.c (M1 + M3 bench) - CMakeLists.txt: cycle 8 NEON bench wiring + h264dsp_neon.S - docs/k8_h264deblock_phase3.md (closure) Next: Phase 4 plan QPU shader, Phase 5 Sonnet review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:39:36 +00:00
marfrit	5a085e7180	Cycle 8 (H.264 deblock) opened — Phase 1 + NEON vendored Targets the one H.264 kernel most likely to be QPU-worthy: in-loop deblock. Cycles 6 and 7 (IDCT 4x4 and 8x8) both came in CPU-only because H.264 transforms are NEON-trivial. H.264 deblock has analogous structure to VP9 LPF (cycles 2+4, both GREEN) so predicted R8 = ORANGE/YELLOW. This commit: - Vendors ff_h264__loop_filter__neon from h264dsp_neon.S (1076 lines, includes both v/h luma + chroma + intra variants + weight/biweight) - PROVENANCE.md updated with the new vendored file - Phase 1 doc captures the full plan: start with luma vertical non-intra (most common case), defer Phase 3+ to next session H.264 deblock C ref scope is ~2 hours (per-row branching, per-4-row-segment tc0, ap/aq side conditions, alpha/beta thresholds — much more complex than VP9 LPF wd=4's single-branch filter). Deferring to fresh attention next session rather than rushing now. After cycle 8 closes, the H.264 QPU surface is well-characterised and the cycles-1-8 inventory drives the Phase 8 V4L2 wrapper's substrate-routing recipe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:18:19 +00:00
marfrit	db2205d0e3	Cycle 7 closed: H.264 IDCT 8x8 = 151 Mblock/s NEON, Phase 4 deferred M1: 10000/10000 bit-exact first try (column-major-block lesson from cycle 6 carried over cleanly). M3: 151.2 Mblock/s per core. Per-block 6.6 ns. 155x the 1080p30 floor (0.972 Mblock/s req'd). Phase-1 prediction of R7 = 0.5-0.9 YELLOW/GREEN was WRONG. H.264 IDCT 8x8 is dramatically lighter than VP9 IDCT 8x8 (18.5x faster NEON): VP9 IDCT 8x8: 122 ns/block (Q14 trig + COSPI multiplies) H.264 IDCT 8x8: 6.6 ns/block (pure integer butterfly + shifts) Phase 4 deferred via the cycle 6 lightweight-kernel rationale: NEON per-block << QPU dispatch floor; offload doesn't help. Phase 9 lesson updated: H.264 transforms (both 4x4 and 8x8) are NEON-trivial. Skip ALL H.264 transform cycles for QPU. Target compute-heavy H.264 kernels only (deblock = cycle 8 next; MC likely RED). Cycle 7 = 2nd consecutive "predicted GREEN, measured CPU-only" result. Forces a sharper view of which kernels QPU can actually help with: deblock and possibly some VP9 cases. - tests/h264_idct8_ref.c (column-major C ref) - tests/bench_neon_h264idct8.c (M1 + M3 bench) - CMakeLists.txt: cycle 7 bench wiring - docs/k7_h264idct8_phase3_and_4.md (closure) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:16:42 +00:00
marfrit	480f34f6e6	Cycle 7 (H.264 IDCT 8x8) opened — Phase 1 goal doc Predicted R7 = 0.5-0.9 YELLOW/GREEN. Closely analogous to cycle 1 (VP9 IDCT 8x8 R=0.92 GREEN): same block size, same lane geometry, same data shape. H.264 8x8 IT uses integer butterfly with 3 sub-stages (vs cycle 1's Q14 trig single butterfly) — more compute per pass but simpler operations. Phase 1 documents: - Spec butterfly (e/f/g stages per H.264 §8.5.13) - 30fps@1080p floor = 0.972 Mblock/s (same as cycle 1 since same block density) - NEON ref = ff_h264_idct8_add_neon (already vendored in cycle 6's h264idct_neon.S) - Cycle 8-10 preview: chroma MC, luma qpel MC, in-loop deblock Phase 3 next session: write column-major C ref + bench, capture M1 + M3. Then Phase 4 plan (likely cycle-1 v3d_idct8.comp adapted to integer butterfly), Phase 5 review, Phase 6 implement, Phase 7 measure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:15:37 +00:00
marfrit	7288473d79	Cycle 6 closed (deferred Phase 4): IDCT 4x4 too small for QPU Phase 4 QPU shader DEFERRED (not RED-by-build, but predicted-RED and not worth building): - NEON delivers 175 Mblock/s (5.7 ns/block) on a single core - QPU per-block floor ~250 ns (from cycle 1 scaling) → R6 = 0.022 - Mixed-kernel helper contribution would be ~1-2 Mblock/s — <1% of NEON capacity - 30fps@1080p worst case = 5.85 Mblock/s; NEON delivers 30x that on ONE core. No need for QPU help. Phase 9 lesson: for any cycle with NEON per-block < ~30ns, predict deep RED and defer Phase 4 unless there's a specific structural QPU advantage. Shapes future cycle selection: prefer compute-heavy kernels (cycle 7 H.264 IDCT 8x8 next; cycle 9 luma qpel MC; cycle 10 deblock). Cycle 6 phase tally: Phase 1 ✓, Phase 2 implicit, Phase 3 ✓ (M1 + M3), Phase 4 DEFERRED, Phase 5-7 N/A, Phase 8 trivial CPU-only (recipe = stay CPU), Phase 9 ✓. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:15:25 +00:00
marfrit	f92dc40f43	Cycle 6 (H.264) opened — IDCT 4x4 Phase 1+3, M3 = 175 Mblock/s H.264 scope added 2026-05-18 per user direction. Pi 5's VideoCore VII has no hardware H.264 decoder block (only HEVC), so a QPU-accelerated H.264 path fills the most impactful codec gap. Cycle 6 = first H.264 kernel (4x4 IDCT + add, smallest H.264 transform, simplest first cycle). Phase 1: goal doc + 1080p30 floor analysis (5.85 Mblock/s worst-case, 2.0 Mblock/s realistic since most MBs use 8x8 or P-skip). Phase 3: NEON M3 baseline captured. ff_h264_idct_add_neon on hertz delivers 175 Mblock/s (5.7 ns per block) = 30x worst-case floor margin. H.264 IDCT 4x4 is dramatically lighter than VP9 IDCT 8x8 (21x faster per block). Phase 3 closure also caught the key Phase 9 lesson: H.264/FFmpeg blocks are COLUMN-MAJOR (block[c*4 + r] = (row=r, col=c)). NEON ld1 with 4 registers interleaves loading, and the FFmpeg C ref indexing makes this convention explicit. Initial C ref assumed row-major, M1 was 5% bit-exact; after fix, M1 = 100%. Convention encoded for all subsequent H.264 cycles (cycle 7+). - external/ffmpeg-snapshot/libavcodec/aarch64/h264idct_neon.S (vendored verbatim from FFmpeg n7.1.3, 415 lines) - external/ffmpeg-snapshot/PROVENANCE.md: updated - tests/h264_idct4_ref.c: column-major C ref - tests/bench_neon_h264idct4.c: M1 + M3 bench - CMakeLists.txt: cycle 6 NEON bench wiring - docs/k6_h264idct4_phase1.md, phase3.md Phase 4 next: QPU shader for cycle 6. Predicted R6 = 0.01 (deep RED — kernel too small relative to QPU dispatch overhead) but worth building for cycle-completeness + the opportunistic-helper hypothesis (cycle 6 may stay CPU per recipe). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:14:43 +00:00
marfrit	eb5cfb34c4	Phase 8: wire LPF wd=4 + wd=8 QPU through public API Mirror the IDCT pattern (lazy pipeline + per-call SSBO alloc + dispatch + readback) for cycles 2 (LPF wd=4) and 4 (LPF wd=8). Important caught-empirically bug: the two LPF shaders disagree on push-constant slot order — wd=4 puts dst_stride_u8 at slot 1, wd=8 puts it at slot 2 (with unused blocks_per_row at slot 1). Initial single-struct attempt silently corrupted wd=8 output (1958/2048 = 95.6 % bit-exact on test_api_lpf). Fixed by keeping separate lpf4_pc and lpf8_pc struct definitions. dst-window calc handles both kernels (same -4..+3 byte footprint per row). test_api_lpf exercises both kernels in CPU / QPU / AUTO modes against the C reference. All 6 mode/kernel combinations pass 2048/2048 bit-exact (32 edges × 8 rows × 8 bytes/edge). Phase 8 status after this commit: 3 of 5 kernels wired through API for QPU dispatch (IDCT, LPF wd=4, LPF wd=8 — i.e., all 3 QPU-default kernels per recipe). Cycle 3 MC and cycle 5 CDEF still need wiring for opportunistic-override mode but aren't needed for recipe-AUTO path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 13:57:25 +00:00
marfrit	1085c5699c	Phase 8: wire IDCT QPU dispatch through public API daedalus_ctx now owns a v3d_runner when V3D is available. The public API's dispatch_vp9_idct8 routes QPU calls through a new dispatch_idct8_qpu helper that: (1) lazy-creates the cycle 1 v4 pipeline on first use, (2) allocates 3 host-visible SSBOs per call (coeffs/dst/meta), (3) memcpy host->GPU, (4) dispatch with the v4 32-blocks-per-WG geometry, (5) memcpy GPU->host. Per-call alloc is intentional for Phase 8 correctness-first scope; buffer-pool perf optimization is deferred. Added daedalus_ctx_create_no_qpu() for fast-path callers that know they want CPU only. test_api_idct extended to a 3-mode matrix: CPU forced, QPU forced, AUTO recipe. All three deliver 4096/4096 bit-exact on hertz with V3D 7.1.7.0: recipe substrate for VP9_IDCT8: 2 (QPU) [CPU] 4096/4096 bit-exact [QPU] 4096/4096 bit-exact (real QPU dispatch through the API) [AUTO] 4096/4096 bit-exact (recipe routes to QPU) Next Phase 8 sub-step: same wiring pattern for cycle 2 LPF wd=4 and cycle 4 LPF wd=8 (the other two recipe-QPU kernels). Cycle 3 MC and cycle 5 CDEF only need the dispatch hook (recipe routes to CPU; QPU stays opportunistic via explicit override). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 13:55:55 +00:00
marfrit	760f6a4060	Phase 8 skeleton: public C API + first end-to-end smoke test include/daedalus.h: stable C API surface exposing the 5 cycles (VP9 IDCT 8x8, LPF wd=4, MC 8h, LPF wd=8; AV1 CDEF). Per-kernel recipe-dispatch helpers default to the cycle 1-5 verdict substrate (QPU for cycles 1+2+4, CPU for cycles 3+5); explicit override available for benchmarking and runtime-aware scheduling. src/daedalus_core.c: NEON-path implementation of all 5 kernels wrapped behind the public API. QPU path stubbed out (returns -1) since wiring v3d_runner into daedalus_ctx is the next Phase 8 sub-step; with has_qpu=0 the recipe falls back to CPU cleanly. tests/test_api_idct.c: 64-block IDCT through the public recipe dispatch, bit-exact vs C ref. PASS 4096/4096 bytes — proves the API surface compiles, library links, dispatch routing works, and NEON fallback delivers correct results. docs/phase8_scoping.md: architecture options (A=userspace V4L2, B=kernel V4L2 shim, C=direct libva); pick A for v1; explicitly out-of-scope work tracked. Next Phase 8 sub-step: wire v3d_runner into daedalus_ctx so has_qpu=1 and QPU dispatch goes through the API too. After that: V4L2 ioctl glue, bitstream parser, superblock loop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 13:54:43 +00:00
marfrit	5223d3cb3f	Cycle 5 closed: CDEF QPU R5=0.116 ORANGE, opportunistic helper Phase 4 plan with 3 Phase-5 REDs applied inline: - meta layout: m.z=tmp_off, m.w=dir - sec_shift clamped to >=0 (NEON uqsub semantics) - directions table as const ivec2[14], not OR-packed Phase 6 deliverable: v3d_cdef.comp (387 inst, 2 threads, no spills). 3-way M1 (QPU vs C ref vs NEON) PASS 4096/4096. M2: 0.443 Mblock/s -> R5 = 0.116 ORANGE (predicted 0.02-0.05 RED). M4 same-kernel: NEON-3+QPU 8.46 < NEON-4 alone ~10 (negative). M4 mixed (NEON-3 MC + QPU CDEF): CPU 34.17 Mblock/s MC, QPU 0.42 Mblock/s CDEF helper. CPU side higher than the Issue 003 NEON-fallback proxy suggested - cross-substrate contention is gentler than same-side NEON contention. Verdict: CDEF stays on CPU; QPU dispatch path exists for opportunistic use. Deployment recipe table updated for all 5 cycles. Phase 9 lessons: linear extrapolation across cycles is too pessimistic; CDEF is bandwidth-bound on NEON despite high per-block ns; real-substrate-cross contention < NEON-proxy contention. - src/v3d_cdef.comp: cycle 5 QPU shader - tests/bench_v3d_cdef.c: 3-way M1, M2 bench - tests/bench_concurrent_mixed.c: K_CDEF on both sides - tests/cdef_ref.c + bench_neon_cdef.c: sec_shift clamp + expanded damping range to exercise the edge case - CMakeLists.txt: v3d_cdef.spv + bench_v3d_cdef wiring - docs/k5_cdef_phase4.md updated with Phase 5 review applied - docs/k5_cdef_phase7.md: closure doc with full verdict matrix Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 13:52:46 +00:00
marfrit	1740e7c165	Cycle 5 Phase 4: QPU CDEF shader plan (predicted deep RED) Per-block stencil: 12 constrain ops per pixel, 64 pixels per block, 4 blocks/WG, 256 invocations/WG. Predicted R5 = 0.03 (deep RED) from cycle-3 MC scaling. Plan calls out 5 Phase 5 review items, notably sentinel handling and signed/unsigned min/max distinction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 13:47:18 +00:00
marfrit	9c0bd72e70	Cycle 5 Phase 3 closed: M1 PASS via bench pointer-convention fix The previous "layout mismatch" deferral was a one-line bench bug: NEON expects the caller to pass tmp pointing at the 8x8 block origin (after the 216+2 padding skip), but the bench passed the raw padded-buffer origin. C ref does the advance internally, so it filtered the correct block; NEON filtered a (+2 rows, +2 cols) shifted region. Diagonal-shift trace in the partial doc was exactly that. Fix: tmps + iTMP_INTS + (2*TMP_W + 2) for NEON calls. Results: M1: 10000/10000 bit-exact (100.0000%), all 8 dirs balanced M3: 3.809 Mblock/s (consistent with 3.923 from longer window) Phase 4 unblocked; predicted R5 = 0.02-0.05 (deep RED) per earlier analysis. Will build QPU CDEF anyway for cycle-completeness + V4L2 dispatch-path existence. - tests/bench_neon_cdef.c: 3-line tmp pointer fix - docs/k5_cdef_phase3.md: supersedes k5_cdef_phase3_partial.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 13:46:50 +00:00
marfrit	2dd774a9ab	Issue 003 closed: mixed-kernel M4 validates V4 deployment shape bench_concurrent_mixed runs NEON-N on kernel A + QPU on kernel B concurrently. Matrix on hertz: V3 (CPU MC + QPU MC same-kernel): CPU 22.64 + QPU 0.39 Mblock/s V4 (CPU MC + QPU LPF4): CPU 27.87 + QPU 12.74 Medge/s V1 (CPU MC + NEON-fb CDEF): CPU 24.49 + 1.75 Mblock/s CDEF V2 (CPU LPF4 + NEON-fb CDEF): CPU 27.28 Medge + 1.70 Mblock/s V4 is the daedalus-fourier deployment shape (CPU runs MC; QPU runs LPF4 via cycle 2 GREEN offload). Both substrates productive; CPU MC +23% per-core vs same-kernel V3 control. Same-kernel M4 in cycles 1-5 was a worst-case contention bound, not a deployment number — user's "5%/50%" framing was correct. Cycle 3 MC verdict unchanged (QPU MC contributes ~0.4 under any contention); cycle 5 CDEF deferred verdict softened to opportunistic helper (NEON-fallback proxy used since cycle 5 Phase 6 not yet built). - tests/bench_concurrent_mixed.c (configurable cpu-kernel / qpu-kernel matrix; supports MC, LPF4, LPF8, IDCT real QPU dispatch; CDEF uses NEON-on-core-3 fallback) - CMakeLists.txt: build target wired with all FFmpeg + dav1d sources - docs/issues/003-mixed-kernel-m4-bench.md: closure + matrix - docs/k3_mc_phase7.md: M4 methodology caveat extended with V3/V4 - docs/k5_cdef_phase3_partial.md: deployment recommendation updated Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 13:44:08 +00:00
marfrit	460a6a6d08	Calibration: M4 same-kernel measures worst-case contention User-flagged 2026-05-18: the cycles 3 (MC) + 5 (CDEF) 'CPU only' verdicts were based on M4 measuring same-kernel concurrent NEON+QPU, which is the WORST case for memory-bandwidth contention. A real decoder pipeline has CPU doing kernel A + QPU doing kernel B concurrently — different access patterns contend less. Concretely: in a real pipeline, CPU runs entropy + MC + other work while QPU is idle except for IDCT + LPF. The 'opportunistic QPU helper' for CDEF (or MC) hasn't been measured. M4 set the bar too high. Updates: - docs/k3_mc_phase7.md §'M4 methodology caveat' added with the user's contribution framing - docs/k5_cdef_phase3_partial.md §'Deployment recommendation' softened from 'CPU only' to 'CPU baseline; QPU helper viable in mixed-kernel deployment, unmeasured' - docs/issues/003-mixed-kernel-m4-bench.md filed — the rigorous test to close the question (4 variants: bandwidth+bandwidth, compute+CDEF, same-kernel control, real-pipeline mix) - ~/.claude/projects/-home-mfritsche-src-daedalus-fourier/memory/ feedback_m4_same_kernel_worst_case.md added — carries the calibration into future cycles + Phase 8 deployment decisions - MEMORY.md index updated The bandwidth-bound vs compute-bound classification still holds at the kernel level — Phase 9 cross-cycle lesson stays valid. But its mapping to deployment is nuanced: - Bandwidth-bound on QPU → DEFINITIVE offload (M4 +ve, cycles 1+2+4) - Compute-bound on QPU → OPPORTUNISTIC helper if pipeline has bandwidth-light CPU work running concurrently (cycles 3+5, needs Issue 003 measurement) Phase 8 V4L2 wrapper should keep CDEF + MC slot-able to either CPU or QPU at runtime (not hard-baked), so Issue 003's result can update the dispatch table without re-architecture. No code changes. Doc + memory + issue only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 13:31:27 +00:00
marfrit	20b59cd6a5	Cycle 5 phase 3 partial: M3 NEON = 3.923 Mblock/s; M1 deferred CDEF is the most compute-intensive kernel measured so far — 254.9 ns/block (2x IDCT, 5x MC). 30fps@1080p floor margin: 4x even on single NEON core in isolation. M3 captured cleanly via dav1d_cdef_filter8_8bpc_neon. M1 bit-exact gate failing due to tmp-layout mismatch between my standalone C reference and dav1d's NEON expectation. The smoking gun: NEON output appears at (+2 rows, -2 cols) shifted positions vs C ref output — suggests NEON's padding-function output has a different convention than my manual tmp construction. Untangled in setup work: - dav1d has TWO directions tables: stride-12 in src/tables.c (C-side), stride-16 in src/arm/64/cdef_tmpl.S (NEON-side). Initially vendored the C-side; should have used the NEON-side. - dav1d's NEON expects tmp built by dav1d_cdef_padding8_8bpc_neon (a separate function with its own conventions), not the C-side padding() function from cdef_tmpl.c. - Updated cdef_ref.c to use NEON-layout (stride 16) with table transcribed from cdef_tmpl.S. Algorithm matches — but bench's manual tmp construction doesn't match what NEON expects. Resolution paths for next session (documented in docs/k5_cdef_phase3_partial.md §'Resolution paths'): 1. Use dav1d_cdef_padding8_8bpc_neon to construct tmp (simplest) 2. Vendor dav1d's full C reference (most rigorous) 3. Reverse-engineer dav1d's padding output layout (hackiest) Predicted R5 if/when QPU shader implemented: 0.02-0.05 (RED). CDEF likely stays on CPU per cycle 3 lesson 7 (compute-bound kernels don't benefit from QPU offload). 30fps floor still passes regardless. New artifacts: - external/dav1d-snapshot/src/arm/64/cdef_tmpl.S (additional vendored) - external/dav1d-snapshot/config.h — 14-define asm preamble shim - tests/cdef_ref.c — standalone C ref (algorithmically correct, layout mismatch with NEON known) - tests/bench_neon_cdef.c — bench (M1 made warning, M3 captured) - docs/k5_cdef_phase3_partial.md — phase 3 partial closure + resumption checklist dav1d snapshot in PROVENANCE.md should be updated next session with the new cdef_tmpl.S entry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 13:21:24 +00:00
marfrit	2cd2258a7b	Cycle 5 setup (Phase 1+2): vendor dav1d 1.4.3 CDEF sources First AV1 kernel cycle and first dav1d-vendored sources. Phase 1+2 docs lay out the structural complexity (CDEF needs pre-padded 12x12 working buffer + external edge context + direction lookup + constraint function — meaningfully more complex than cycles 1-4). Phase 3+ deferred to next session — CDEF is the first cycle that doesn't fit cleanly into a single autonomous run. Vendored from dav1d 1.4.3 (BSD-2-Clause, cleaner license than FFmpeg's LGPL-2.1+): src/arm/64/cdef.S 520 lines — NEON impl src/arm/64/util.S 278 lines — NEON helpers src/arm/asm.S 335 lines — GAS preamble src/cdef_tmpl.c 331 lines — C reference (templated) include/common/intops.h 84 lines — utility helpers src/tables_cdef_subset.c hand-extracted — dav1d_cdef_directions only (avoids dragging full 1013-line tables.c + transitive includes) Discovery from Phase 2 analysis: - Filter type and shape: dav1d_cdef_filter8_pri_sec_8bpc_neon takes (dst, dst_stride, tmp, pri_strength, sec_strength, dir, damping, h). The 'tmp' arg is the pre-padded 12x12 buffer constructed externally by the dav1d C-side padding() function. - Tap weights are inline-computed (not table): pri_tap = 4 or 3 (based on pri_strength bit), sec_tap = 2 or 1. Only dav1d_cdef_directions[12][2] is an external table. - Constraint function: constrain(diff, threshold, shift) = apply_sign(min(abs(diff), max(0, threshold - (abs(diff) >> shift))), diff) Predicted R5 band: 0.15-0.30 (ORANGE). CDEF is compute-heavier than LPF (per-pixel min/max conditional logic), so likely worse R than cycle 2/4 but better than cycle 3 MC. M4 gate likely required. What Phase 3+ needs (next session): 1. config.h shim for dav1d's asm preamble (defines TBD on first build) 2. Standalone C reference for cdef_filter_block_8x8_c (cdef_tmpl.c references several dav1d private headers; cleaner to transcribe to a self-contained tests/cdef_ref.c) 3. tests/bench_neon_cdef.c — M1+M3 bench 4. Phase 4 plan, Phase 5 review (mandatory), Phase 6 shader, Phase 7 measure PROVENANCE.md documents pin + per-file role + re-vendoring procedure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 13:12:25 +00:00
marfrit	20e3d004ae	Issues 001+002: defer LPF wd=16 + LPF vertical variants Per user direction at cycle-4 close: file wd=16 (trend prediction validation) and vertical variants (column-stride TMU behaviour unknown) as local issues for future cycles. Progress instead to CDEF (AV1) for codec breadth. docs/issues/001 — wd=16 prediction validation. Per cycle 4 lesson 4, trend says wd=16 likely flips M4 negative. Quick incremental cycle when revisited. docs/issues/002 — vertical variants. Different memory access pattern (column-strided vs row-strided). The load-bearing unknown is whether the cycle 2 +6.9% mixed gain survives the TMU coalescing shift. If positive, deployment recipe gains symmetry; if negative, must split by orientation. Both issues have acceptance criteria + expected outcomes documented. Cycle 5 next: CDEF (AV1) — codec-breadth expansion. No Gitea repo exists for daedalus-fourier yet (project is local- only). If a tracker is wanted, create the repo and migrate these .md files. For now they live in-tree as part of the project history. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 13:09:51 +00:00
marfrit	85feba4087	Cycle 4 (LPF wd=8) closure: M1=100%, R=0.34, M4=+4.1%, PASS Fourth daedalus-fourier kernel — VP9 8-tap inner loop filter wd=8 h_8_8 variant. Width extension of cycle 2's wd=4; completes VP9 inner-edge LPF coverage. Full cycle Phase 1-7 + M4'''' in one combined go (cycle compressed since incremental from cycle 2). Phase 5 review explicitly skipped (incremental ~30-line shader delta from cycle 2 + same geometry + cycle-2 RED-pattern checks still apply). Flagged in docs/k4_lpf8_phase4_7.md per dev_process.md "Skipping phases is a deliberate choice that should be flagged." Phase 6 v1 first-light: M1'''' 100.0000% bit-exact (65536/65536) first try. Shaderdb shows 231 inst, 4 hardware threads, 0 spills, 27 max-temps, 48 uniforms — compiler at the latency-hiding ceiling. Performance: M3'''' NEON (single-core) 52.382 Medge/s M2'''' QPU isolation 17.847 Medge/s R'''' 0.341 → ORANGE band 30fps floor margin 9.2x (isolation), 20.3x (mixed) M4'''' concurrent matrix: NEON 4-core 37.823 Medge/s <- baseline QPU only 14.867 Medge/s MIXED NEON-3 + QPU 39.389 Medge/s <- +4.1% PASS Verdict: YELLOW-via-M4'''' PASS. Deploy wd=8 LPF on QPU alongside cycle 2 wd=4. Combined VP9 inner-edge LPF coverage now complete. Cross-cycle LPF comparison: \| \| wd=4 (k2) \| wd=8 (k4) \| \| M3 NEON \| 48.3 \| 52.4 \| \| M2 QPU iso \| 19.6 \| 17.8 \| \| R iso \| 0.41 \| 0.34 \| \| M4 delta \| +6.9% \| +4.1% \| \| 30fps mixed \| 7.2x \| 20.3x \| \| Verdict \| GO QPU \| GO QPU \| NEW finding (Phase 9 lesson): NEON gets faster per edge as filter width grows (20.7 → 19.1 ns wd=4 → wd=8). The relative QPU loss grows with width. wd=16 would probably flip negative based on the trend line. Deployment recipe with cycle 4: IDCT 8x8 (k1) -> QPU (R=0.92, +7% mixed) LPF wd=4 (k2) -> QPU (R=0.41, +7% mixed) LPF wd=8 (k4) -> QPU (R=0.34, +4% mixed) MC 8h (k3) -> CPU (R=0.067, -19% mixed) Entropy -> CPU (structural) VP9 inner-edge LPF coverage complete. Project continues to higgs deployment plumbing or further kernels per user direction. New artifacts: - src/v3d_lpf_h_8_8.comp — GLSL shader - tests/vp9_lpf8_ref.c — standalone C ref - tests/bench_neon_lpf8.c — M1+M3 bench - tests/bench_v3d_lpf8.c — M1+M2 bench - tests/bench_concurrent_lpf8.c — M4 pthread bench - docs/k4_lpf8_phase1_3.md + phase4_7.md — combined cycle docs Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 12:56:25 +00:00
marfrit	356e446a49	Cycle 3 (MC interpolation) closure: M1'''=100%, R'''=0.067 RED, M4=-19.5% Third daedalus-fourier kernel — VP9 8-tap regular subpel filter, horizontal direction, 8-wide output. Multiply-heavy by design to stress V3D's no-DP4A deficit. Full cycle Phase 1-7 + M4'''. Phase 5''' second-model review delivered cleanly — caught 1 RED bug pre-implementation (src_off off-by-3 indexing convention) and 2 YELLOW gaps (assert MUST language, shaderdb filter-LUT gate). Without the review, M1''' would have failed silently on first run with cryptic "high-index source pixels wrong" symptoms. Phase 6 v1 first-light: M1''' 100.0000% bit-exact (65536/65536 blocks across all 16 mx phases). Phase 5''' filter-LUT prediction materialised exactly: 197 uniforms (gate was 144), 2 threads (down from cycle-2's 4 due to register pressure). Performance: M2''' = 1.413 Mblock/s (707.9 ns/block) M3''' = 20.997 Mblock/s (NEON baseline phase3) R''' = 0.067 (RED band — structural mismatch) shaderdb: 488 inst, 2 threads, 197 uniforms, 25 max-temps, 0 spills M4''' concurrent matrix (8s windows): NEON 1-core 14.479 Mblock/s NEON 4-core 15.248 Mblock/s <- baseline (compute-bound, not bandwidth-saturated like cycles 1+2!) QPU only 1.380 Mblock/s MIXED NEON-3 + QPU 12.277 Mblock/s <- -19.5% (FAIL gate) MIXED NEON-4 + QPU 12.158 Mblock/s <- -20.3% NEW cross-cycle finding (Phase 9 lesson 2): compute-bound CPU workloads make the QPU-offload story collapse. Cycles 1+2 were bandwidth-saturated (4-core scaling 0.56-0.82x of 1-core), so freeing a CPU core via QPU offload added throughput. Cycle 3 MC is compute-bound (4-core scaling 1.05x of 1-core — near-linear), no free cycles to free. QPU contribution (0.45 Mblock/s in contention) doesn't compensate for losing 1 NEON core delivering ~3.8 Mblock/s. But 30fps@1080p floor: PASS in every config (1.4x to 15.7x isolation margin). Per project_30fps_floor_is_fine.md, user-facing test never fails — daily YouTube playback works fine on any CPU/QPU split. DEPLOYMENT RECIPE for higgs (cycle 3 confirmed split): IDCT (k1) -> QPU (R=0.92, +7% mixed, frees CPU core) LPF (k2) -> QPU (R=0.41, +7% mixed, frees CPU core) MC (k3) -> CPU (R=0.067, -19.5% mixed — stays on CPU) Entropy -> CPU (structurally serial) Mixed-substrate deployment, not "QPU does everything". Realistic for higgs: entropy + MC on 2-3 ARM cores; IDCT + LPF dispatched to QPU concurrently; 1-2 ARM cores left for vscode etc. New artifacts: - src/v3d_mc_8h.comp — GLSL kernel - tests/vp9_mc_ref.c — standalone C ref (REGULAR filter embedded; clean transcription) - tests/bench_neon_mc.c — M1'''_c + M3''' bench - tests/bench_v3d_mc.c — M1''' + M2''' bench with contract asserts + 30fps margin display - tests/bench_concurrent_mc.c — M4''' pthread bench - external/ffmpeg-snapshot/libavcodec/aarch64/vp9mc_neon.S (vendored) - external/ffmpeg-snapshot/libavcodec/vp9_subpel_filters_table.c (hand-extracted; provides ff_vp9_subpel_filters symbol without dragging in full vp9dsp.c) - docs/k3_mc_phase{1,2,3,4,5,7}.md — full cycle documentation Memory updates: project_30fps_floor_is_fine.md (user's 30fps target recalibration), MEMORY.md index updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 12:51:43 +00:00
marfrit	36eca40ff2	Cycle 2 (LPF) closure: M1''=100%, R''=0.41, M4''=+6.9%, PASS Phase 4 plan + Phase 5 second-model review (PASS-WITH-REVISIONS: 2 YELLOW contract gaps applied) + Phase 6 v1 implementation + Phase 7 verification including M4'' concurrent gate. Phase 5'' review delivered cleanly — no RED bugs (cycle 1 lessons applied successfully). 2 YELLOW findings baked into phase4 §4: - stride >= 4 contract added alongside m.x >= 4 (finding 2) - assert(...) in bench made a MUST not a suggestion (finding 4) - V3D divergence-cost note: don't restructure to always-execute, masked lanes consume clock anyway (finding 3, informational) Phase 6 v1 first-light hit M1'' 100.0000% bit-exact on first run (65536/65536 edges) — the cycle-1 v4 patterns (WG=256, 2-per-sg, uint8_t SSBO, oob early-return discipline) baked in from start worked as expected. Performance: M2'' = 19.645 Medge/s (50.9 ns/edge) M3'' = 48.285 Medge/s (NEON baseline from phase3) R'' = 0.41 (ORANGE band - doesn't auto-close per cycle-1 calibration adjustment) shaderdb: 160 inst, 4 threads, 0 spills, 21 max-temps — shader is already at the compiler ceiling. No v2/v3/v4 iteration loop like cycle 1 because there's nothing more to extract from the compiled shape. The 30x gap between theoretical instruction throughput and measured wall-clock is divergence-tax + memory latency, not compile quality. M4'' concurrent matrix on hertz (8s windows): NEON-1 LPF 41.131 Medge/s NEON-4 LPF 33.726 Medge/s <- realistic CPU ceiling (per-core 7-9; same bandwidth-saturation as cycle-1 F1) QPU only 14.299 Medge/s MIXED NEON-3 + QPU 36.049 Medge/s <- +6.9% over NEON-4 MIXED NEON-4 + QPU 31.892 Medge/s <- -5.4% oversubscribed The "freed-core" pattern generalizes from IDCT to LPF: NEON-3+QPU beats pure NEON-4 by ~7% in both cycles. Cycle-2 NEW finding: oversubscribed mode hurts for lighter kernels (LPF -5.4% vs cycle-1 IDCT +9.4%). Recommendation for higgs deployment hardens to "always N-1 NEON cores + QPU, never N + QPU". Phase 9 lessons (in phase7 §"Phase 9 lessons"): 1. Cycle-1 v4-pattern is the v1 starting point (saves 3 iterations) 2. Phase 5 review pays off every cycle 3. R isolation misleading on bandwidth-saturated hardware 4. Oversubscription tax depends on kernel weight 5. shaderdb 4-threads/0-spills = compute not the bottleneck New artifacts: - src/v3d_lpf_h_4_8.comp — GLSL kernel - tests/bench_v3d_lpf.c — M1'' + M2'' harness with contract asserts + fm/hev pass-rate instrumentation - tests/bench_concurrent_lpf.c — M4'' pthread bench (mirrors bench_concurrent.c) - docs/k2_deblock_phase{4,5,7}.md — plan + review + verification Project verdict: continue. Cycle 3 candidates: MC interpolation (multiply-heavy, stress V3D SMUL24), CDEF (AV1-only, different neighborhood shape), or wd=8/wd=16 LPF variants. User to direct. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 12:39:26 +00:00
marfrit	be7ff5587c	Cycle 2 (deblocking) Phase 1-3: M3'' = 48.285 Medge/s baseline Second kernel candidate per phase7_M4.md verdict "next-kernel cycle authorised". VP9 4-tap inner loop filter, horizontal direction, 8-pixel edge (libavcodec ff_vp9_loop_filter_h_4_8_neon as baseline). Different workload shape from IDCT - boundary streaming, lighter compute per unit, per-row conditionals - tests whether QPU win generalises. docs/k2_deblock_phase1.md - goal-setting. Same R-band decision rules as cycle 1 (phase1.md), with the cycle-1 calibration adjustment: ORANGE band is no longer auto-close because M4 showed mixed > pure CPU even at modest R when CPU bandwidth-saturates. docs/k2_deblock_phase2.md - situation analysis. C reference already in vendored snapshot (vp9dsp_template.c:1780-1898). Fetched vp9lpf_neon.S fresh (1334 lines, LGPL-2.1+, FFmpeg n7.1.3 pin, SHA-256 384e49e7...). PROVENANCE.md updated. docs/k2_deblock_phase3.md - NEON baseline: M1''_c bit-exact 100.0000 % (10000 random edges) M3'' throughput 48.285 Medge/s (20.7 ns/edge, single A76) per-frame 1080p-eq 748 FPS (worst case 64 530 edges/frame) cycles/edge ~58 (=20.7ns x 2.8GHz), ~7 cycles/row LPF is 5.9x faster per-unit than IDCT M3 (20.7 vs 122 ns), so the QPU break-even point moves down. Predicted R''_v1 band ~0.5-0.9 - frame-level batching amortises the same 33us dispatch overhead; workload becomes bandwidth-bound rather than compute-bound (~5.7 MB/frame traffic at 64 530 edges x ~88 B per edge). New artifacts: - tests/vp9_lpf_ref.c - standalone bit-exact C ref (8-bit, wd=4 inner only; clean transcription) - tests/bench_neon_lpf.c - M1''_c gate + M3'' time-based bench (5s window, edge-content-biased RNG for realistic fm/hev hit rates) - external/ffmpeg-snapshot/libavcodec/aarch64/vp9lpf_neon.S - CMakeLists.txt updated with bench_neon_lpf target Phase 4 next: plan the QPU LPF compute shader. Cycle 1's phase4.md + phase5.md + phase7.md learnings apply directly - bake in the v4 winning patterns from the start (WG=256, edges-per-subgroup pattern adapted from blocks, uint8_t dst SSBO, oob flag, unrolled writes). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 12:28:57 +00:00

1 2

54 Commits