Files

T

marfrit 71db72928f Phase 4 plan + Phase 5 second-model review (PASS-WITH-REVISIONS)

Phase 4 — Plan first QPU IDCT8 kernel given the M5 32.95us/dispatch
constraint. Frame-level dispatch (8160 WGs for 1080p), 64
invocations/WG = 4 blocks/WG, 3 SSBOs + push constants, reproduces
FFmpeg's transposed column-pass orientation. Predicted M2 ~16 Mblock/s
(BW-limited), R ~ 2.0 -> strong PASS under phase1.md decision rules.

Phase 5 — Second-model review by Claude Sonnet (fresh-context Agent,
no prior session memory). Verdict PASS-WITH-REVISIONS with 2 RED-class
findings + 1 YELLOW that this commit applies:

  RED  finding 5 (dst race condition): int32_t[] dst with 4 lanes
       writing to overlapping 32-bit words = non-atomic concurrent
       writes = Vulkan UB. Fix: uint8_t[] via storageBuffer8BitAccess
       (verified exposed). Applied to phase4.md sec 5 + GLSL declaration.

  RED  finding 7 (early-return before barrier): if (block_idx >=
       n_blocks) return; ahead of barrier() is UB by Vulkan spec.
       For 1080p (32640 blocks, /4) no partial WGs; for any frame
       width not /32 there are. Fix: oob flag, gate work bodies,
       barrier() unconditional. Applied to phase4.md sec 4 pseudocode.

  YELLOW finding 6 (subgroup ops): docs claimed BASIC+VOTE only;
         actual exposed set is BASIC+VOTE+BALLOT+SHUFFLE+
         SHUFFLE_RELATIVE+QUAD per vulkaninfo. Plan doesn't use any
         subgroup ops in v1 so unaffected, but the wrong constraint
         would mislead Phase 6/7. Corrected in phase0.md sec 2,
         phase2.md sec 6, phase4.md sec 1 (C4).

GREEN/YELLOW findings 1-4, 8 (orientation, WG geom, idle lanes, BW
prediction, compute envelope accounting) accepted as-is or deferred to
Phase 7 M6 sweep per plan's existing flagging.

Reviewer verdict post-revisions: "Phase 4 is APPROVED for Phase 6
implementation. No re-review needed; revisions are mechanical and
address verified bugs/errors."

Phase 5 itself just paid for itself: two real UB bugs caught before
any GLSL was written.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-18 11:47:03 +00:00

11 KiB

Raw Permalink Blame History

phase, status, date_opened, parent, target_kernel

phase	status	date_opened	parent	target_kernel
2	closed 2026-05-18	2026-05-18	phase1.md	VP9 8×8 inverse DCT (DCT_DCT variant, 8-bit pixels)

Phase 2 — Situation analysis

Per dev_process.md:

Document current state. Identify constraints, dependencies, known failure modes. Reset context here — do not carry assumptions from prior sessions; re-read CLAUDE.md, relevant memory files, run git status, re-verify reachability.

1. Context reset

Working tree state: dirty (Phase 0/1/2 docs not yet committed). .git-broken-2026-05-18/ preserved as a forensic artifact of the 2026-05-18 10:25 working-tree wipe (cause undetermined).
CLAUDE.md re-read: no contradictions with the Path B scope set in README §"Architecture (Path B)".
hertz reachability: confirmed via SSH; vcgencmd, vulkaninfo, apt, sudo NOPASSWD all working as of 2026-05-17 inventory. Mesa 25.0.7 / Vulkan 1.3.305 / V3D 7.1.7 stable.

2. Reference implementations — VP9 8×8 IDCT (DCT_DCT)

The Phase 1 kernel has two canonical reference implementations in FFmpeg n7.1.3 (the version installed on hertz). The harness will link both: the C path as the bit-exact gate (M1), the NEON path as the throughput baseline (M3).

2.1 C reference

Source: libavcodec/vp9dsp_template.c, function idct_idct_8x8_add_c
Spec basis: VP9 specification §8.7 — Inverse transform process

Signature:

static void idct_idct_8x8_add_c(uint8_t *_dst, ptrdiff_t stride,
                                int16_t *_block, int eob);

Algorithm (8-bit path):
1. If eob == 1 (DC-only): single (coef * 11585 * 11585) round, broadcast to 8×8 with +pred, clamp[0,255].
2. Otherwise: 8 column passes through idct8_1d → tmp[64]. Zero the input block. 8 row passes through idct8_1d → out[8]. Per-element (out + 16) >> 5, add to dst, av_clip_pixel.
idct8_1d: 1-D 8-point inverse DCT, 8 trigonometric multiply-add stages with Q14 fixed-point constants then 8-butterfly add/sub stages. All arithmetic is signed int32 (dctint).

Q14 constants (matched against VP9 spec §8.7.1.4):

symbol	value	trig identity
cospi_16_64	11585	cos(π/4) × 2^14 ≈ 0.70711
cospi_24_64	6270	cos(3π/8) × 2^14 ≈ 0.38268
cospi_8_64	15137	sin(3π/8) × 2^14 ≈ 0.92388
cospi_28_64	3196	cos(7π/16) × 2^14 ≈ 0.19509
cospi_4_64	16069	sin(7π/16) × 2^14 ≈ 0.98079
cospi_20_64	9102	cos(5π/16) × 2^14 ≈ 0.55557
cospi_12_64	13623	sin(5π/16) × 2^14 ≈ 0.83147

Rounding convention: (product + (1 << 13)) >> 14, i.e. round-half-up at bit 14.

License: LGPL-2.1-or-later (FFmpeg).
Side effect: zeroes the input block[] (idempotency requirement; matches spec).

2.2 NEON reference

Source: libavcodec/aarch64/vp9itxfm_neon.S, symbol ff_vp9_idct_idct_8x8_add_neon

Signature (same as C):

void ff_vp9_idct_idct_8x8_add_neon(uint8_t *dst, ptrdiff_t stride,
                                   int16_t *block, int eob);

Registers: x0=dst, x1=stride, x2=block, w3=eob.

Internal dependencies (must be copied alongside the .S):

macro / symbol	location	role
`idct8`	`vp9itxfm_neon.S`	1-D 8-pt IDCT, fully unrolled with `dmbutterfly*`
`dmbutterfly0`	`vp9itxfm_neon.S`	rotation by π/4 (the `cospi_16_64` case)
`dmbutterfly`	`vp9itxfm_neon.S`	general 2-input rotation `[a,b] → [a·c1−b·c2, a·c2+b·c1]` (`Q14`)
`dmbutterfly_l`	`vp9itxfm_neon.S`	wide-form (4×i32 acc) for `dmbutterfly`
`butterfly_8h`	`vp9itxfm_neon.S`	trivial `[a+b, a−b]` on `int16x8_t`
`transpose_8x8H`	`libavcodec/aarch64/neon.S`	in-place 8×8 i16 transpose
`idct_coeffs`	`vp9itxfm_neon.S` (`const`)	Q14 trig constants table, aligned 4
`movrel`	`libavutil/aarch64/asm.S`	PIC-aware constant-pool relocation helper

License: LGPL-2.1-or-later (Google, 2016).
Performance shape: full unrolled 8-pt butterfly with NEON smull/smlsl/smlal + rshrn for the Q14 round-shift; output uses sqxtun for saturated narrow to u8. Estimated ~80 NEON instructions for the steady state (non-DC) path.

2.3 AV1 equivalence note

AV1's 8×8 DCT_DCT transform (av1_iidentity8_iidentity8_c vs av1_idct8_idct8_c family in libavcodec/av1dsp/...) shares the same 1-D 8-point structure but with different scaling: AV1 uses 12-bit fixed-point (>> 12) and a slightly different rounding shift due to its different transform-stage bit growth model. Calling our VP9 IDCT shader on AV1 coefficients will produce wrong output. AV1 support is out of scope for Phase 1. A Phase-N variant can fork the shader with the AV1 constants once Phase 1 has proven the VP9 path.

3. Vulkan compute dispatch path

Hertz exposes V3D 7.1 via Mesa's v3dv driver as Vulkan PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU, API 1.3.305, conformance 1.3.8.3. The compute-only dispatch path is:

host program
  ├─ vkCreateInstance / vkEnumeratePhysicalDevices (picks V3D 7.1.7.0)
  ├─ vkCreateDevice (queue family with COMPUTE_BIT, no graphics needed)
  ├─ vkCreateBuffer x N (SSBOs for block coeffs in / dst pixels in+out)
  │     - buffer flags: STORAGE_BUFFER_BIT | TRANSFER_SRC/DST
  │     - memory type: HOST_VISIBLE | HOST_COHERENT (zero-copy on shared LPDDR4x)
  ├─ vkCreateDescriptorSetLayout (≤8 SSBOs per layout — Pi 5 limit)
  ├─ vkCreateShaderModule (SPIR-V from glslang)
  ├─ vkCreateComputePipeline
  ├─ vkBeginCommandBuffer
  │     vkCmdBindPipeline / vkCmdBindDescriptorSets / vkCmdPushConstants
  │     vkCmdDispatch(group_count_x, 1, 1)   # one WG per ~K blocks
  ├─ vkQueueSubmit + vkQueueWaitIdle (or fence) — this is the measured op
  └─ (read back via the HOST_VISIBLE buffer, or alias it to the same memory the CPU populated)

Per Phase 0 §2 inside-view limits, the relevant constraints for this kernel:

≤8 SSBOs per stage → group inputs/outputs into ≤8 bindings (we only need 2: block[] in, dst[] in/out).
Shared mem ≤16 KiB → each 8×8 block fits trivially (256 B in i16 plus 64 B in u8). One WG can carry dozens of blocks of shared state if useful.
Subgroup size = 16 (fixed). One workgroup of 64 invocations = 4 subgroups; one block per subgroup is a natural shape (each 16-lane subgroup processes 8×8 = 64 pixels in 4 cycles of subgroup work).

4. Build path on hertz

Already installed (2026-05-17): cmake 3.31, ninja 1.12, gcc (Debian trixie default), libvulkan-dev 1.4.309, glslang-tools 15.1.0, spirv-tools 2025.1, libdrm-dev 2.4.131, vulkan-tools 1.4.304.

Missing but cheap:

libavcodec-dev — only needed if the harness wants to link against system libavcodec for cross-checks against the dynamic dispatcher. Not needed for the source-copy approach (preferred, see §5).

5. Reference-copy strategy (vs system-libavcodec link)

Decision: source-copy the 3 FFmpeg files into external/ffmpeg-snapshot/.

Rationale:

System libavcodec.so on hertz is symbol-stripped (nm returns empty for ff_vp9_idct_*). Internal NEON entry points are not reachable via dlsym.
The two reference implementations (C, NEON) plus their macro/ data dependencies total ~3 files / ~600 lines. Source-copy is smaller than the dlopen plumbing would be.
LGPL-2.1-or-later (FFmpeg license) is propagation-compatible with the harness binary if the harness binary itself is GPL or LGPL. The kernel shaders and dispatch library stay separately-licensed (BSD-2-Clause, default for this project).
Pinning to n7.1.3 matches hertz's runtime libavcodec version, so any in-session sanity cross-check against the running Mesa / video tooling stays consistent.

Files to vendor:

Source	License	Target path under `daedalus-fourier/`
`libavcodec/vp9dsp_template.c`	LGPL-2.1+	`external/ffmpeg-snapshot/vp9dsp_template.c`
`libavcodec/aarch64/vp9itxfm_neon.S`	LGPL-2.1+	`external/ffmpeg-snapshot/aarch64/vp9itxfm_neon.S`
`libavcodec/aarch64/neon.S` (for `transpose_8x8H`)	LGPL-2.1+	`external/ffmpeg-snapshot/aarch64/neon.S`
`libavutil/aarch64/asm.S` (for `movrel`, `function`, `endfunc`)	LGPL-2.1+	`external/ffmpeg-snapshot/aarch64/asm.S`
(whatever else `vp9dsp_template.c` transitively needs)	LGPL-2.1+	as required

A external/ffmpeg-snapshot/COPYING.LGPL and external/ffmpeg-snapshot/PROVENANCE.md document the upstream commit (n7.1.3 tag, commit hash) and the verbatim-copy guarantee.

6. Known constraints / failure modes carried from Phase 0

Repeated here so Phase 4 (plan) can bind against them without re-derivation:

C1: shaderFloat16 = false → all shader arithmetic must be int32 (we are int anyway — no risk).
C2: maxComputeSharedMemorySize = 16 KiB → kernel must not require more (8×8 IDCT trivially fits even with many blocks per WG).
C3: maxPerStageDescriptorStorageBuffers = 8 → we need only 2 (coeffs + dst), no risk.
C4: subgroupSupportedOperations = BASIC + VOTE + BALLOT + SHUFFLE + SHUFFLE_RELATIVE + QUAD (no arithmetic reductions like subgroupAdd; but subgroupShuffle is available — corrected per phase5.md finding 6). Workaround needed for accumulators: the IDCT structure is fully data-parallel without reductions; this constraint doesn't bite. subgroupShuffle is an alternative to shared-mem transpose for Phase 7.
C5: VC7 has SMUL24 but no INT8 MAC. Our Q14 multiplies are i16×i16→i32 — the multiplicands fit in 17 bits, so SMUL24 covers it. No INT8/INT4 issues.
C6: shared LPDDR4x bus; GPU sees ~4–7 GB/s vs CPU ~12–15 GB/s. For 8×8 IDCT, working set is tiny (≤320 B/block), so per-block bandwidth is not the bottleneck; per-dispatch submit overhead is.
C7: VPM read-stall serialization. If we hand-write QPU asm (we won't, in Phase 1) this would matter; the Vulkan compute path lets the v3d_compiler schedule for us.
C8: VC7 thermal throttle at 85°C GPU / 80°C CPU. Phase 7 measurements should record temp before/during/after to flag throttling.

7. What Phase 2 does not close

The harness architecture (single binary? Two binaries — one for bit-exact, one for throughput?). Phase 4 picks.
Block-per-WG dispatch geometry. Phase 4 + Phase 6 sweep.
Random-coefficient generation strategy (uniform i16 vs realistic-distribution; the latter affects DC-only path frequency). Phase 4 picks; Phase 7 may re-evaluate.
Whether NEON measurement uses clock_gettime(CLOCK_MONOTONIC_RAW) per-call (high overhead) or batched (more realistic for codec use). Phase 3 picks during baseline collection.

8. Hand-off to Phase 3

Phase 3 measures:

M3-prelim: NEON ff_vp9_idct_idct_8x8_add_neon throughput on hertz, batched over 10⁶ random blocks, single-threaded, 4-thread, sched-isolated. This is the floor.
M5-prelim: Vulkan dispatch overhead — pipeline create cost (one-time), per-vkCmdDispatch cost (per-frame-equivalent), per-vkQueueSubmit + vkQueueWaitIdle cost (per-completion). Bound below which kernel batching is mandatory.

Both are measurements on the existing substrate. Neither requires writing any shader code. Phase 3 closes before Phase 4 (plan) begins.

11 KiB Raw Permalink Blame History Unescape Escape