Files
fresnel-fourier/phase3_iter8_findings.md
marfrit 4320d7860f iter8 Phase 3: empirical Bug 4 redefinition — partial-fill, not inter race
Phase 3 strace + byte-level analysis on fresnel rkvdec. Findings:

1. Bug 4 is NOT inter-race-loss. The IDR keyframe itself fails through
   libva (only 512 bytes of real Y data at top-left 16x32 region).
2. The 16x32 leak is structured real image content (smooth gradients,
   neutral luma ~0x80) — kernel decoded one tile / one MB pair, then
   stopped.
3. VP9 via libva WORKS through the same readback path (100% non-zero,
   real image data). So the bug isn't generic DMA-BUF cache coherency.
4. HEVC fails via libva (all-zero, distinct from H.264 partial-fill).
5. OUTPUT sizeimage = 1MB (SOURCE_SIZE_MAX) is sufficient — BBB IDR is
   only 6321 bytes. Not the bug.
6. Control payload diffs: SPS.constraint_set_flags = 0 vs kdirect's 2
   (probably cosmetic); DECODE_PARAMS.dpb[0].bottom_field_order_cnt = 0
   vs kdirect's 1 (load-bearing for POC).

Refined hypothesis: a specific H.264 control field libva sends causes
rkvdec to abort after partial decode. Phase 4 candidates: α fix POC
fields, β bump OUTPUT sizeimage, γ instrumentation dump, δ relative
timestamps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 11:48:41 +00:00

171 lines
12 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Iteration 8 — Phase 3 (baseline empirical findings)
Captured 2026-05-13 on fresnel (via `fresnel.vpn`). iter7 fork tip `6df2159`, backend SHA `520507f6…`. Test corpus: `~/fourier-test/bbb_1080p30_h264.mp4`, 3-frame sweep.
## Anchors
| Path | Hash |
|---|---|
| libva H.264 (broken) | `71ac099b8d007836385b6776e6bbf891ddd7b79caad66775ff1fbb85657fb349` |
| kdirect H.264 (correct) | `1e7a0bc98d5bd83f412b9fdb57fa1e30fd777775c5ab89ad5be1cc26cd61c81d` |
Both runs use `/dev/video1` (rkvdec). kdirect via `-hwaccel v4l2request`; libva via `-hwaccel vaapi` + `LIBVA_DRIVER_NAME=v4l2_request`.
## Big-picture finding
**libva H.264 has a structured PARTIAL-FILL pattern that is highly localized.** Frame 1's Y plane is overwhelmingly zero, with **exactly 16 columns × 32 rows = 512 bytes of real-looking image content** at the top-left of the Y plane. Frames 2 and 3 have no Y-plane content at all (32 and 16 sparse bytes in the U plane, every 4th byte).
Row-by-row dump (libva frame 1, first 16 columns of each Y row that has non-zero data):
```
row 0 offset 0: 81818080807f7f7f7f7f7f8080808181
row 1 offset 1920: 81818080807f7f7f7f7f7f8080808181
row 2 offset 3840: 81818080807f7e7e7e7e7f8080808181
row 3 offset 5760: 82828080807f7e7e7e7e7f8080808282
row 4 offset 7680: 828281817f7e7e7e7e7e7e7f81818282
...
row 31 offset 59520: 80807f7e7e7e82837d7e828282818080
```
These are not random or zero — values cluster around `0x7c..0x83` (luma neutral with subtle gradient). Adjacent rows are similar (smooth vertical/horizontal gradients). **This is real decoded image data, just confined to a 16×32 patch.** kdirect frame 1 fills the entire Y plane (1920×1088) with proper Y values; the 0x7c-0x83 cluster in libva is not in kdirect's frame 1 (kdirect has Y≈0x10 black throughout the frame, with U=V=0x80 chroma neutral).
## Per-codec comparison (same backend, same kernel, same readback path)
| Codec | Profile | Resolution | NV12 sizeimage | Result | Frame 1 non-zero bytes |
|---|---|---|---|---|---|
| VP9 | VAProfileVP9Profile0 | 1280×768 | 1,966,080 | **WORKS** (PASS direct since iter5b-β) | 100% of frame (real image) |
| HEVC | VAProfileHEVCMain | 1280×720 | 1,843,200 | FAILS (Bug 5; all-zero) | 0% |
| H.264 | VAProfileH264High | 1920×1088 | 4,177,920 | FAILS (Bug 4; structured 16×32 leak) | 512 bytes structured |
**H.264 and HEVC fail via libva. VP9 succeeds via libva.** All three use rkvdec on RK3399, same NV12 output, same backend code post-CreateContext.
The DMA-BUF cache-coherency blocker per `reference_dmabuf_resv_blocker.md` doesn't fit cleanly — VP9 succeeds through the SAME readback path. So the issue is per-codec, not a generic backend↔kernel cache problem.
## V4L2 format negotiation
### libva (H.264, FAILS):
```
S_FMT OUTPUT_MPLANE: 1920x1088 H264_SLICE, sizeimage=1048576 (default 1MB from SOURCE_SIZE_MAX)
G_FMT CAPTURE_MPLANE: 1920x1088 NV12, sizeimage=4177920, bytesperline=1920, planes=1
```
### kdirect (H.264, WORKS):
```
S_FMT OUTPUT_MPLANE: 1920x1088 H264_SLICE, sizeimage=3133440 (explicit 3MB)
S_FMT CAPTURE_MPLANE: 1920x1088 NV12, sizeimage=4177920, bytesperline=1920, planes=1
```
OUTPUT sizeimage differs (1MB vs 3MB), but **BBB 1080p30 frame 1 IDR is only 6321 bytes** (per ffprobe). Frames 4 and 5 are larger (248KB, 144KB) but still well under 1MB. **OUTPUT sizeimage 1MB is sufficient. Not the bug.**
CAPTURE format matches between libva and kdirect: both NV12, bytesperline=1920, sizeimage=4177920. kdirect explicitly S_FMTs CAPTURE; libva only G_FMTs it (kernel auto-derived from OUTPUT). Same result.
`SOURCE_SIZE_MAX` is hardcoded to `(1024 * 1024)` at v4l2.h:30. Worth bumping for future-proofing 4K/8K, but not the Bug-4 root cause.
## Control payload comparison (frame 1 IDR)
Both call `S_EXT_CTRLS` with count=4: SPS (0xa40902, 1048B), PPS (0xa40903, 12B), DECODE_PARAMS (0xa40907, 560B), SCALING_MATRIX (0xa40904, 480B).
### SPS first 8 bytes:
| Backend | Hex | profile_idc | constraint_set | level_idc | reserved | chroma | bit_depth_l | bit_depth_c | log2_max_frame_num |
|---|---|---|---|---|---|---|---|---|---|
| libva | `4d 00 29 00 01 00 00 01` | 77 (Main) | 0 | 41 | 0 | 1 (4:2:0) | 0 | 0 | 1 |
| kdirect | `4d 02 29 00 01 00 00 01` | 77 (Main) | 2 | 41 | 0 | 1 (4:2:0) | 0 | 0 | 1 |
**Diff**: `constraint_set_flags` is 0x00 (libva) vs 0x02 (kdirect). Probably cosmetic — informational SPS bit, not load-bearing for decode.
### PPS first bytes: identical (`00 00 00 00 00 00 02 00 00 00 8a 00`).
### DECODE_PARAMS (frame 1 IDR): both all-zero (correct — IDR has no references).
### SCALING_MATRIX: identical (flat 16/16 default in both).
### DECODE_PARAMS (frame 2 P, libva): `c8 94 2c 1b ae 1c af 18 00 00 00 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00 00 00 03 00 00 00`
- bytes 07 (reference_ts u64): `0x18AF1CAE_1B2C94C8` = large ns from `gettimeofday`.
- bytes 811 (pic_num u32): 0.
- bytes 1213 (frame_num u16): 0.
- byte 14 (fields u8): 3 = V4L2_H264_FRAME_REF.
- bytes 2427 (flags u32): 3 = VALID|ACTIVE.
### DECODE_PARAMS (frame 2 P, kdirect): `f8 2a 00 00 00 00 00 00 00 00 00 00 00 00 03 00 00 00 00 00 00 00 01 00 00 00 01 00 03 00 00 00`
- bytes 07 (reference_ts u64): `0x000000000000_2AF8` = small relative ns (kdirect uses a relative timestamp scheme).
- bytes 1213 (frame_num u16): 0.
- byte 14 (fields): 3.
- bytes 1619 (top_field_order_cnt s32): 0.
- bytes 2023 (bottom_field_order_cnt s32): 0x01000000 little-endian = 16777216, or maybe 1 if shifted. Either way kdirect has non-zero bot.
- bytes 2427 (flags): 0x00030001 — VALID|ACTIVE plus some bit.
**Diffs that may matter**:
1. **bottom_field_order_cnt**: libva=0, kdirect=1 (or 16777216). The kernel's reflist builder may use BFOC for POC computation.
2. **flags**: libva=0x00000003, kdirect=0x00030001. Bit 16 set in kdirect (`0x00030000` mask).
Per `<linux/v4l2-controls.h>` `v4l2_h264_dpb_entry.flags`:
- bit 0: V4L2_H264_DPB_ENTRY_FLAG_VALID
- bit 1: V4L2_H264_DPB_ENTRY_FLAG_ACTIVE
- bit 2: V4L2_H264_DPB_ENTRY_FLAG_LONG_TERM
- bit 3: V4L2_H264_DPB_ENTRY_FLAG_FIELD_PIC
Bit 16 is not a defined flag. kdirect setting bit 16 is unusual but may be a kernel-aligned padding artifact (or it's a u32 flags field encoded in a way I'm misreading).
## Buffer plumbing
- 24 CAPTURE mmaps via fd=5 (the libva video fd), each 4177920 bytes — matches cap_pool's `24 slots ready, 1 plane(s) per slot`.
- Each MEDIA_REQUEST_IOC_QUEUE returns 0.
- DQBUF OUTPUT returns without FLAG_ERROR.
- DQBUF CAPTURE returns without FLAG_ERROR.
- No dmesg rkvdec errors during the run.
## What the 16×32 leak tells us
The structured 16×32 region in libva H.264 frame 1 Y plane is:
- **Real decoded image data**: smooth gradients, neutral luma cluster around 0x80 (gray).
- **Confined to a single block**: exactly 16 columns × 32 rows at the top-left of the Y plane. Could be one 16x16 luma macroblock × 2 (16x32 = macroblock pair) OR a single tile of a hypothetical NV12 tiled output format.
Two scenarios fit:
- **A — Kernel wrote one tile then stopped**: the kernel decoded one tile / one macroblock pair of the frame and then aborted (e.g., due to a control validation that fires per-tile). The rest is whatever the buffer held prior (mostly zero, but bytes 2073600..2073660 in frames 2/3 have stride-4 patterns suggesting some kernel scratch usage).
- **B — Kernel wrote linear NV12 but only the top-left region**: the bitstream slice data was truncated, kernel decoded only what it could read.
(B) doesn't fit — BBB frame 1 is 6321 bytes total, far smaller than the 1MB OUTPUT buffer; no truncation.
**(A) is the leading hypothesis**: kernel rejects/aborts after partial decode. WHY rkvdec aborts for libva's H.264 but not kdirect's is the next question.
## Refined hypothesis ranking
- **H-O (NEW, leading)**: A specific H.264 control field libva sends differs from kdirect in a way that causes rkvdec to reject the request after partial decode. Candidates:
- SPS.constraint_set_flags (libva=0, kdirect=2) — probably cosmetic but easy to test.
- DECODE_PARAMS.dpb[].bottom_field_order_cnt (libva=0, kdirect=1) — meaningful for POC-based reference resolution. Affects inter frames more than IDR; but if libva's H.264 frame 1 (IDR) already partially decodes, maybe it's a frame-2-rejection that cascades.
- DECODE_PARAMS.dpb[].flags upper bits (kdirect bit 16 set in frame 2) — needs UAPI reread.
- **H-P (NEW)**: rkvdec on RK3399 in linux-fresnel-fourier 7.0-1 has a per-codec quirk for H.264+HEVC that VP9 doesn't trigger. The quirk may be exercised by some specific control combination or buffer alignment.
- **H-Q (downgraded)**: SOURCE_SIZE_MAX=1MB. Empirically eliminated (BBB IDR=6321B).
- **H-R (NEW, alternative)**: The libva backend's cap_pool slot mmap doesn't expose the same memory region rkvdec writes to (per-codec memory pool). The mmap'd region for H.264 CAPTURE may be the same virtual address but a different kernel-side dma region. Per-codec CAPTURE buffer setup may have been broken at iter5b-β CreateContext refactor for H.264, but not for VP9.
## Phase 3 conclusions
1. **Bug 4 is NOT primarily an inter-frame race**. Even the keyframe (frame 1, IDR) fails to decode fully via libva. Calling it "inter-frame race-loss" was inaccurate from carryover terminology; the bug pattern is **partial-write at top-left + nothing elsewhere**.
2. **The kernel is writing SOMETHING** — the 16×32 leak shows real decode content (not garbage). The decode engine starts then stops.
3. **The same readback path works for VP9** (libva, NV12, 1280x768) — so the bug isn't generic cache coherency in the libva backend.
4. **Control payload differences are small** (constraint_set_flags + one POC field) — but they may be load-bearing for rkvdec's H.264 path specifically.
5. **OUTPUT sizeimage = 1MB is sufficient** for BBB frame sizes. Not the bottleneck.
## Phase 4 plan candidates
Phase 4 plan (to be drafted next):
- **Option α**: Fix libva H.264 SPS constraint_set_flags + DECODE_PARAMS POC fields to match kdirect. Smallest possible change. ~10 LOC.
- **Option β**: Bump SOURCE_SIZE_MAX OUTPUT size to 3MB to match kdirect (preemptive, even though not strictly Bug 4's root). ~1 LOC.
- **Option γ**: Add an instrumentation patch that dumps the CAPTURE buffer immediately after DQBUF (in surface.c::RequestSyncSurface, before cap_pool_mark_decoded). This would distinguish whether the kernel wrote correct data and libva mis-reads it, or whether the kernel didn't write it. ~30 LOC. Diagnostic-only; not shipped.
- **Option δ**: kdirect's relative timestamp scheme. Replace `gettimeofday` with a monotonic-relative counter so reference_ts values are small integers. The kernel's vb2_find_buffer_by_timestamp may be sensitive to value range (unlikely but cheap to test). ~5 LOC.
Best Phase 4 sequence: **diagnostic γ FIRST** to narrow whether the bug is on kernel write or libva read side. Then α/β/δ as targeted fixes.
## Phase 4 sequence proposal
Phase 4a: write α (fix the small control diffs) as a try-and-see — 10 LOC change, low risk, may not work but cheap to test. Phase 5 reviewer validates.
If α fails Phase 7: Phase 4b adds γ (diagnostic dump) to definitively distinguish "kernel didn't write" from "libva mis-reads". If kernel didn't write → focus on remaining wire-protocol differences. If libva mis-reads → focus on mmap/offset/cache.
## Substrate state at Phase 3 close
- Fork tip `6df2159` (iter7), backend `520507f6…`. Unchanged.
- Kernel `linux-fresnel-fourier 7.0-1`. Unchanged.
- Artifacts at fresnel `/tmp/iter8_p3/`: libva strace, kdirect strace, libva H.264 YUV (broken), kdirect H.264 YUV (correct), VP9/HEVC YUV for codec comparison.