From 4320d7860f68e344c5ee076b1c226fcce6df171d Mon Sep 17 00:00:00 2001 From: Markus Fritsche Date: Wed, 13 May 2026 11:48:41 +0000 Subject: [PATCH] =?UTF-8?q?iter8=20Phase=203:=20empirical=20Bug=204=20rede?= =?UTF-8?q?finition=20=E2=80=94=20partial-fill,=20not=20inter=20race?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Phase 3 strace + byte-level analysis on fresnel rkvdec. Findings: 1. Bug 4 is NOT inter-race-loss. The IDR keyframe itself fails through libva (only 512 bytes of real Y data at top-left 16x32 region). 2. The 16x32 leak is structured real image content (smooth gradients, neutral luma ~0x80) — kernel decoded one tile / one MB pair, then stopped. 3. VP9 via libva WORKS through the same readback path (100% non-zero, real image data). So the bug isn't generic DMA-BUF cache coherency. 4. HEVC fails via libva (all-zero, distinct from H.264 partial-fill). 5. OUTPUT sizeimage = 1MB (SOURCE_SIZE_MAX) is sufficient — BBB IDR is only 6321 bytes. Not the bug. 6. Control payload diffs: SPS.constraint_set_flags = 0 vs kdirect's 2 (probably cosmetic); DECODE_PARAMS.dpb[0].bottom_field_order_cnt = 0 vs kdirect's 1 (load-bearing for POC). Refined hypothesis: a specific H.264 control field libva sends causes rkvdec to abort after partial decode. Phase 4 candidates: α fix POC fields, β bump OUTPUT sizeimage, γ instrumentation dump, δ relative timestamps. Co-Authored-By: Claude Opus 4.7 (1M context) --- phase3_iter8_findings.md | 170 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 170 insertions(+) create mode 100644 phase3_iter8_findings.md diff --git a/phase3_iter8_findings.md b/phase3_iter8_findings.md new file mode 100644 index 0000000..ac32261 --- /dev/null +++ b/phase3_iter8_findings.md @@ -0,0 +1,170 @@ +# Iteration 8 — Phase 3 (baseline empirical findings) + +Captured 2026-05-13 on fresnel (via `fresnel.vpn`). iter7 fork tip `6df2159`, backend SHA `520507f6…`. Test corpus: `~/fourier-test/bbb_1080p30_h264.mp4`, 3-frame sweep. + +## Anchors + +| Path | Hash | +|---|---| +| libva H.264 (broken) | `71ac099b8d007836385b6776e6bbf891ddd7b79caad66775ff1fbb85657fb349` | +| kdirect H.264 (correct) | `1e7a0bc98d5bd83f412b9fdb57fa1e30fd777775c5ab89ad5be1cc26cd61c81d` | + +Both runs use `/dev/video1` (rkvdec). kdirect via `-hwaccel v4l2request`; libva via `-hwaccel vaapi` + `LIBVA_DRIVER_NAME=v4l2_request`. + +## Big-picture finding + +**libva H.264 has a structured PARTIAL-FILL pattern that is highly localized.** Frame 1's Y plane is overwhelmingly zero, with **exactly 16 columns × 32 rows = 512 bytes of real-looking image content** at the top-left of the Y plane. Frames 2 and 3 have no Y-plane content at all (32 and 16 sparse bytes in the U plane, every 4th byte). + +Row-by-row dump (libva frame 1, first 16 columns of each Y row that has non-zero data): + +``` +row 0 offset 0: 81818080807f7f7f7f7f7f8080808181 +row 1 offset 1920: 81818080807f7f7f7f7f7f8080808181 +row 2 offset 3840: 81818080807f7e7e7e7e7f8080808181 +row 3 offset 5760: 82828080807f7e7e7e7e7f8080808282 +row 4 offset 7680: 828281817f7e7e7e7e7e7e7f81818282 +... +row 31 offset 59520: 80807f7e7e7e82837d7e828282818080 +``` + +These are not random or zero — values cluster around `0x7c..0x83` (luma neutral with subtle gradient). Adjacent rows are similar (smooth vertical/horizontal gradients). **This is real decoded image data, just confined to a 16×32 patch.** kdirect frame 1 fills the entire Y plane (1920×1088) with proper Y values; the 0x7c-0x83 cluster in libva is not in kdirect's frame 1 (kdirect has Y≈0x10 black throughout the frame, with U=V=0x80 chroma neutral). + +## Per-codec comparison (same backend, same kernel, same readback path) + +| Codec | Profile | Resolution | NV12 sizeimage | Result | Frame 1 non-zero bytes | +|---|---|---|---|---|---| +| VP9 | VAProfileVP9Profile0 | 1280×768 | 1,966,080 | **WORKS** (PASS direct since iter5b-β) | 100% of frame (real image) | +| HEVC | VAProfileHEVCMain | 1280×720 | 1,843,200 | FAILS (Bug 5; all-zero) | 0% | +| H.264 | VAProfileH264High | 1920×1088 | 4,177,920 | FAILS (Bug 4; structured 16×32 leak) | 512 bytes structured | + +**H.264 and HEVC fail via libva. VP9 succeeds via libva.** All three use rkvdec on RK3399, same NV12 output, same backend code post-CreateContext. + +The DMA-BUF cache-coherency blocker per `reference_dmabuf_resv_blocker.md` doesn't fit cleanly — VP9 succeeds through the SAME readback path. So the issue is per-codec, not a generic backend↔kernel cache problem. + +## V4L2 format negotiation + +### libva (H.264, FAILS): +``` +S_FMT OUTPUT_MPLANE: 1920x1088 H264_SLICE, sizeimage=1048576 (default 1MB from SOURCE_SIZE_MAX) +G_FMT CAPTURE_MPLANE: 1920x1088 NV12, sizeimage=4177920, bytesperline=1920, planes=1 +``` + +### kdirect (H.264, WORKS): +``` +S_FMT OUTPUT_MPLANE: 1920x1088 H264_SLICE, sizeimage=3133440 (explicit 3MB) +S_FMT CAPTURE_MPLANE: 1920x1088 NV12, sizeimage=4177920, bytesperline=1920, planes=1 +``` + +OUTPUT sizeimage differs (1MB vs 3MB), but **BBB 1080p30 frame 1 IDR is only 6321 bytes** (per ffprobe). Frames 4 and 5 are larger (248KB, 144KB) but still well under 1MB. **OUTPUT sizeimage 1MB is sufficient. Not the bug.** + +CAPTURE format matches between libva and kdirect: both NV12, bytesperline=1920, sizeimage=4177920. kdirect explicitly S_FMTs CAPTURE; libva only G_FMTs it (kernel auto-derived from OUTPUT). Same result. + +`SOURCE_SIZE_MAX` is hardcoded to `(1024 * 1024)` at v4l2.h:30. Worth bumping for future-proofing 4K/8K, but not the Bug-4 root cause. + +## Control payload comparison (frame 1 IDR) + +Both call `S_EXT_CTRLS` with count=4: SPS (0xa40902, 1048B), PPS (0xa40903, 12B), DECODE_PARAMS (0xa40907, 560B), SCALING_MATRIX (0xa40904, 480B). + +### SPS first 8 bytes: +| Backend | Hex | profile_idc | constraint_set | level_idc | reserved | chroma | bit_depth_l | bit_depth_c | log2_max_frame_num | +|---|---|---|---|---|---|---|---|---|---| +| libva | `4d 00 29 00 01 00 00 01` | 77 (Main) | 0 | 41 | 0 | 1 (4:2:0) | 0 | 0 | 1 | +| kdirect | `4d 02 29 00 01 00 00 01` | 77 (Main) | 2 | 41 | 0 | 1 (4:2:0) | 0 | 0 | 1 | + +**Diff**: `constraint_set_flags` is 0x00 (libva) vs 0x02 (kdirect). Probably cosmetic — informational SPS bit, not load-bearing for decode. + +### PPS first bytes: identical (`00 00 00 00 00 00 02 00 00 00 8a 00`). + +### DECODE_PARAMS (frame 1 IDR): both all-zero (correct — IDR has no references). + +### SCALING_MATRIX: identical (flat 16/16 default in both). + +### DECODE_PARAMS (frame 2 P, libva): `c8 94 2c 1b ae 1c af 18 00 00 00 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00 00 00 03 00 00 00` +- bytes 0–7 (reference_ts u64): `0x18AF1CAE_1B2C94C8` = large ns from `gettimeofday`. +- bytes 8–11 (pic_num u32): 0. +- bytes 12–13 (frame_num u16): 0. +- byte 14 (fields u8): 3 = V4L2_H264_FRAME_REF. +- bytes 24–27 (flags u32): 3 = VALID|ACTIVE. + +### DECODE_PARAMS (frame 2 P, kdirect): `f8 2a 00 00 00 00 00 00 00 00 00 00 00 00 03 00 00 00 00 00 00 00 01 00 00 00 01 00 03 00 00 00` +- bytes 0–7 (reference_ts u64): `0x000000000000_2AF8` = small relative ns (kdirect uses a relative timestamp scheme). +- bytes 12–13 (frame_num u16): 0. +- byte 14 (fields): 3. +- bytes 16–19 (top_field_order_cnt s32): 0. +- bytes 20–23 (bottom_field_order_cnt s32): 0x01000000 little-endian = 16777216, or maybe 1 if shifted. Either way kdirect has non-zero bot. +- bytes 24–27 (flags): 0x00030001 — VALID|ACTIVE plus some bit. + +**Diffs that may matter**: +1. **bottom_field_order_cnt**: libva=0, kdirect=1 (or 16777216). The kernel's reflist builder may use BFOC for POC computation. +2. **flags**: libva=0x00000003, kdirect=0x00030001. Bit 16 set in kdirect (`0x00030000` mask). + +Per `` `v4l2_h264_dpb_entry.flags`: +- bit 0: V4L2_H264_DPB_ENTRY_FLAG_VALID +- bit 1: V4L2_H264_DPB_ENTRY_FLAG_ACTIVE +- bit 2: V4L2_H264_DPB_ENTRY_FLAG_LONG_TERM +- bit 3: V4L2_H264_DPB_ENTRY_FLAG_FIELD_PIC + +Bit 16 is not a defined flag. kdirect setting bit 16 is unusual but may be a kernel-aligned padding artifact (or it's a u32 flags field encoded in a way I'm misreading). + +## Buffer plumbing + +- 24 CAPTURE mmaps via fd=5 (the libva video fd), each 4177920 bytes — matches cap_pool's `24 slots ready, 1 plane(s) per slot`. +- Each MEDIA_REQUEST_IOC_QUEUE returns 0. +- DQBUF OUTPUT returns without FLAG_ERROR. +- DQBUF CAPTURE returns without FLAG_ERROR. +- No dmesg rkvdec errors during the run. + +## What the 16×32 leak tells us + +The structured 16×32 region in libva H.264 frame 1 Y plane is: +- **Real decoded image data**: smooth gradients, neutral luma cluster around 0x80 (gray). +- **Confined to a single block**: exactly 16 columns × 32 rows at the top-left of the Y plane. Could be one 16x16 luma macroblock × 2 (16x32 = macroblock pair) OR a single tile of a hypothetical NV12 tiled output format. + +Two scenarios fit: +- **A — Kernel wrote one tile then stopped**: the kernel decoded one tile / one macroblock pair of the frame and then aborted (e.g., due to a control validation that fires per-tile). The rest is whatever the buffer held prior (mostly zero, but bytes 2073600..2073660 in frames 2/3 have stride-4 patterns suggesting some kernel scratch usage). +- **B — Kernel wrote linear NV12 but only the top-left region**: the bitstream slice data was truncated, kernel decoded only what it could read. + +(B) doesn't fit — BBB frame 1 is 6321 bytes total, far smaller than the 1MB OUTPUT buffer; no truncation. + +**(A) is the leading hypothesis**: kernel rejects/aborts after partial decode. WHY rkvdec aborts for libva's H.264 but not kdirect's is the next question. + +## Refined hypothesis ranking + +- **H-O (NEW, leading)**: A specific H.264 control field libva sends differs from kdirect in a way that causes rkvdec to reject the request after partial decode. Candidates: + - SPS.constraint_set_flags (libva=0, kdirect=2) — probably cosmetic but easy to test. + - DECODE_PARAMS.dpb[].bottom_field_order_cnt (libva=0, kdirect=1) — meaningful for POC-based reference resolution. Affects inter frames more than IDR; but if libva's H.264 frame 1 (IDR) already partially decodes, maybe it's a frame-2-rejection that cascades. + - DECODE_PARAMS.dpb[].flags upper bits (kdirect bit 16 set in frame 2) — needs UAPI reread. +- **H-P (NEW)**: rkvdec on RK3399 in linux-fresnel-fourier 7.0-1 has a per-codec quirk for H.264+HEVC that VP9 doesn't trigger. The quirk may be exercised by some specific control combination or buffer alignment. +- **H-Q (downgraded)**: SOURCE_SIZE_MAX=1MB. Empirically eliminated (BBB IDR=6321B). +- **H-R (NEW, alternative)**: The libva backend's cap_pool slot mmap doesn't expose the same memory region rkvdec writes to (per-codec memory pool). The mmap'd region for H.264 CAPTURE may be the same virtual address but a different kernel-side dma region. Per-codec CAPTURE buffer setup may have been broken at iter5b-β CreateContext refactor for H.264, but not for VP9. + +## Phase 3 conclusions + +1. **Bug 4 is NOT primarily an inter-frame race**. Even the keyframe (frame 1, IDR) fails to decode fully via libva. Calling it "inter-frame race-loss" was inaccurate from carryover terminology; the bug pattern is **partial-write at top-left + nothing elsewhere**. +2. **The kernel is writing SOMETHING** — the 16×32 leak shows real decode content (not garbage). The decode engine starts then stops. +3. **The same readback path works for VP9** (libva, NV12, 1280x768) — so the bug isn't generic cache coherency in the libva backend. +4. **Control payload differences are small** (constraint_set_flags + one POC field) — but they may be load-bearing for rkvdec's H.264 path specifically. +5. **OUTPUT sizeimage = 1MB is sufficient** for BBB frame sizes. Not the bottleneck. + +## Phase 4 plan candidates + +Phase 4 plan (to be drafted next): + +- **Option α**: Fix libva H.264 SPS constraint_set_flags + DECODE_PARAMS POC fields to match kdirect. Smallest possible change. ~10 LOC. +- **Option β**: Bump SOURCE_SIZE_MAX OUTPUT size to 3MB to match kdirect (preemptive, even though not strictly Bug 4's root). ~1 LOC. +- **Option γ**: Add an instrumentation patch that dumps the CAPTURE buffer immediately after DQBUF (in surface.c::RequestSyncSurface, before cap_pool_mark_decoded). This would distinguish whether the kernel wrote correct data and libva mis-reads it, or whether the kernel didn't write it. ~30 LOC. Diagnostic-only; not shipped. +- **Option δ**: kdirect's relative timestamp scheme. Replace `gettimeofday` with a monotonic-relative counter so reference_ts values are small integers. The kernel's vb2_find_buffer_by_timestamp may be sensitive to value range (unlikely but cheap to test). ~5 LOC. + +Best Phase 4 sequence: **diagnostic γ FIRST** to narrow whether the bug is on kernel write or libva read side. Then α/β/δ as targeted fixes. + +## Phase 4 sequence proposal + +Phase 4a: write α (fix the small control diffs) as a try-and-see — 10 LOC change, low risk, may not work but cheap to test. Phase 5 reviewer validates. + +If α fails Phase 7: Phase 4b adds γ (diagnostic dump) to definitively distinguish "kernel didn't write" from "libva mis-reads". If kernel didn't write → focus on remaining wire-protocol differences. If libva mis-reads → focus on mmap/offset/cache. + +## Substrate state at Phase 3 close + +- Fork tip `6df2159` (iter7), backend `520507f6…`. Unchanged. +- Kernel `linux-fresnel-fourier 7.0-1`. Unchanged. +- Artifacts at fresnel `/tmp/iter8_p3/`: libva strace, kdirect strace, libva H.264 YUV (broken), kdirect H.264 YUV (correct), VP9/HEVC YUV for codec comparison.