Phase 0 amendment: hantro writes zeros, sentinel test cache-buggy

Re-baselined libva-v4l2-request decode path with kernel-side
observability (ftrace v4l2/vb2/dma_fence + dmesg + dynamic_debug)
and visual disambiguator (mpv --vo=gpu in operator's live Plasma
session).

Findings:

1. Kernel reports successful CAPTURE buffer write every frame:
   ftrace vb2_buf_done shows bytesused=3655712 (full NV12 1920x1088
   + hantro tile padding). dmesg completely silent — no
   hantro/vpu/decode/error/warn messages.

2. Visual disambiguator: mpv --hwdec=vaapi-copy --vo=gpu shows a
   solid GREEN frame; --hwdec=vaapi --vo=gpu shows solid BLUE.
   Neither shows the sentinel mid-beige (NV12 Y=0xab,UV=0xab would
   render cream). Both colors are consistent with the kernel
   writing all-zero NV12 (Y=0,UV=0 → green via BT.709 limited; same
   buffer GL-imported as DMA-BUF with different colorspace → blue).

3. Patch 0011 sentinel test has a cache-coherency bug: writes
   0xab via cached surface_object->destination_map[0] mmap, never
   invalidates cache before readback. So the readback always
   shows the stale sentinel even when kernel DMA-overwrote it
   with zeros. vaapi-copy and Mesa DMA-BUF GL import correctly
   invalidate cache and see the real (zero) contents.

This corrects the previous Phase 0 verdicts twice in one day:
- Original commit f15ba8b ("the 2026-04-26 picture holds") was
  wrong: clean contract trace, never checked pixel content.
- Revised commit e892cea ("kernel produces no decoded pixel
  output, sentinel survives") was half right: kernel does write,
  writes zeros, and the sentinel test was reading stale cache.
- Now: kernel writes ALL ZEROS to the CAPTURE buffer. Hantro is
  silently failing the bitstream parse or some control validation.

This is consistent with patch 0011's own commit message hypothesis:
"All zeros → kernel did write 0x00s (overwriting our sentinel),
and the apparent 'no picture' output is the kernel-side decode
actually producing zeros (e.g. parser rejected the bitstream)."
That hypothesis was right; we just couldn't confirm it via the
sentinel test (cache bug) and went down the wrong rabbit hole.

Phase 6 direction sharpens substantially. Bug isn't "we can't
engage hantro" — it's "hantro engages but its parser produces
zeros." Bisect the control submission: VIDIOC_G_EXT_CTRLS
readback to verify writes stick, diff against FFmpeg's
v4l2_request_h264.c (proven working on hantro), verify SPS
completeness, resolve patch 0008's slice_header bit_size open
question, dyndbg the hantro module, etc. Phase 1 boolean-
correctness criterion needs a working pixel-content check before
lock; fix patch 0011's cache sync first.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-04 11:39:42 +00:00
parent e892cea858
commit 365764fffb
6 changed files with 487 additions and 8 deletions
@@ -0,0 +1,115 @@
# Phase 0 deliverable #1 + #4 amendment — kernel-side observability (2026-05-04)
In-session re-baseline of the libva-v4l2-request decode path with kernel-side observability (ftrace v4l2/vb2/dma_fence events + dmesg + dynamic_debug on v4l2-core/videobuf2 + a real-VO visual disambiguator). Supersedes the userspace-only baseline from `phase0_evidence/2026-05-04/findings.md`.
## Verdict
**Hantro accepts the request, reports successful buffer write, and produces all-zero output.** The patch-0011 sentinel test that we previously read as "kernel never wrote the buffer" was misleading due to a cache-coherency bug in the patch's mmap read — kernel DMA-writes the buffer (with zeros), and the cached userspace mmap continues to show the stale sentinel `0xab` until something invalidates the cache. Tools that DO invalidate (vaapi-copy via libva's vaMapBuffer, GL DMA-BUF import via Mesa) see the real contents (zeros).
**This corrects the previous Phase 0 verdict twice in one day**:
- Original prior commit `f15ba8b`: "the 2026-04-26 picture holds" — wrong, contract trace was clean but pixel content wasn't checked.
- Revised commit `e892cea`: "kernel produces no decoded pixel output, sentinel survives" — half right; kernel does write, but writes zeros, and the sentinel test was reading stale cache.
- Now (this finding): **kernel writes, writes ALL ZEROS, hantro is silently failing the bitstream parse or some control validation.**
The actual bug surface narrows substantially. Phase 6 work is now bisecting the control-submission / bitstream-parse path, not investigating "why doesn't kernel get called."
## Evidence
### ftrace v4l2/vb2 events (132 lines, full mpv vaapi-copy --frames=2 lifecycle)
For every CAPTURE_MPLANE QBUF/DQBUF cycle, hantro reports:
```
vb2_buf_done: ... index = 0, type = 9, bytesused = 3655712, timestamp = 1777893356955599000
```
`type=9 = V4L2_BUF_TYPE_VIDEO_CAPTURE_MPLANE`, `bytesused=3655712 = 1920×1088 NV12 + hantro tile padding`. **Kernel signals a successful full-buffer write.** Identical signature for every one of 10 CAPTURE buffers in the trace.
OUTPUT_MPLANE QBUF for each frame shows realistic per-slice sizes: 6272 (IDR), 108 (small P), 109, 144299 (B-frame size), 117, 122, etc. Real H.264 bitstream, real per-frame sizes.
`v4l2_dqbuf` reports `bytesused=0` to userspace — that's the multi-plane top-level field which is always 0; the per-plane bytesused lives in `m.planes[]` and isn't traced by these events.
### dmesg — completely silent
```
$ grep -iE "hantro|vpu|decode|fail|error|reject|einval|warn" dmesg.txt | wc -l
0
```
No kernel-side error indication. Hantro doesn't log when its bitstream parser rejects something at this severity tier. (Worth turning on full dynamic_debug for the verisilicon driver in a future probe.)
### Visual disambiguator — the load-bearing data point
After the prior commit (`e892cea`) declared the sentinel test conclusive, the operator ran two real-VO tests in their live Plasma 6 Wayland session:
| Variant | Decode path | Render path | Result |
|---|---|---|---|
| 1 | `--hwdec=vaapi-copy` (vaMapBuffer copies CAPTURE → system memory, then GL upload) | `--vo=gpu` | **Solid GREEN frame** |
| 2 | `--hwdec=vaapi` (CAPTURE stays as DMA-BUF) | `--vo=gpu` (Mesa-imports DMA-BUF as GL texture) | **Solid BLUE frame** |
Solid green for variant 1 means: NV12 luma=0, chroma=0, BT.709 limited-range expansion → green is the canonical "all-zeros NV12 displayed as RGB" color. Solid blue for variant 2 means same buffer, different colorspace/format interpretation by Mesa's DMA-BUF importer. **Neither shows the sentinel mid-beige (`0xab` on both Y and UV would render as cream/beige), and neither shows real bunny pixels.**
This rules out "decode produces real pixels" (would be bunny). It rules out "kernel never wrote buffer" (would be sentinel beige in variant 1, since vaapi-copy does its own cache invalidation and would see whatever's in physical memory — and the strace already showed the sentinel was the last thing written there pre-DMA).
The only remaining explanation: **kernel DMA-writes zeros to the buffer**, which overwrites the sentinel in physical memory. vaapi-copy and Mesa DMA-BUF import both correctly cache-invalidate and see the zeros. Patch 0011's sentinel-readback in libva-v4l2-request's `picture.c::RequestEndPicture` reads from a stale cache view — sees the sentinel that's no longer in physical memory.
## Cache-coherency bug in patch 0011 — root cause
Patch 0011 instruments `src/picture.c::RequestEndPicture` to write `0xab × 32` into `surface_object->destination_map[0]` immediately before VIDIOC_QBUF. The mmap pointer was set up by libva-v4l2-request's `surface.c` via `mmap(MAP_SHARED, V4L2_MEMORY_MMAP fd)`. On hantro/RK3568 (CMA-backed, dma-contig allocator), this mmap is **CACHED by default unless the queue requests V4L2_MEMORY_DMABUF or the driver applies non-cached attrs**. After QBUF → kernel DMA write → DQBUF, the userspace cached view is stale relative to physical memory.
Patch 0010's hex-dump (which reads back the buffer for the DEBUG output) reads from the same cached mmap — which is why the dump consistently shows `ab ab ab ab ...`: the cache line containing the first 32 bytes was filled by the userspace 0xab write and never invalidated.
The fix for the test (so it stops misleading future probes):
- `msync(p, N, MS_SYNC | MS_INVALIDATE)` after DQBUF before reading
- OR call DMA-BUF cache sync via `DMA_BUF_IOCTL_SYNC` if using EXPBUF'd fd
- OR allocate via V4L2_MEMORY_DMABUF instead of MMAP (changes the ownership model — different fix)
- OR use VIDIOC_PREPARE_BUF with cache-flush flags (not supported by hantro M2M IIUC)
The fix doesn't matter for production — patches 0010/0011 are explicitly DEBUG-only (PKGBUILD comments say "Removed before upstream submission"). What matters is **the test was load-bearing for diagnosis and was wrong**. The whole cascade of "decode is broken" → "we need kernel observability" → this finding came from trusting it.
## What hantro is actually doing (and not doing)
Hantro's H.264 frontend likely does this for our requests:
1. Accepts the V4L2_BUF + REQUEST_FD pair
2. Reads the V4L2_CTRL_TYPE_H264_* controls from the request
3. Validates them; if validation fails silently, falls through
4. Parses the OUTPUT slice
5. If parse fails (e.g. SPS/PPS mismatch with slice data, scaling matrix shape wrong, decode_mode wrong), the hardware pipeline still runs but produces zeros — because the decode "kernel" (the actual silicon pipeline) gets garbage matrix coefficients or stops at slice header
6. `vb2_set_plane_payload(buf, 0, full_buffer_size)` is called regardless — explaining the bytesused=3655712 signal even though no real pixels were written
This is consistent with what patch 0011's commit message anticipated:
> All zeros → kernel did write 0x00s (overwriting our sentinel), and the apparent "no picture" output is the kernel-side decode actually producing zeros (e.g. parser rejected the bitstream).
That hypothesis was **right**; we just couldn't confirm it via the sentinel test (cache bug) and went down the wrong rabbit hole.
## Phase 6 priority list (substantially narrowed)
The bug isn't "we can't engage hantro." The bug is "hantro engages but its parser produces zeros." Bisect the control submission:
1. **Use `VIDIOC_G_EXT_CTRLS` to read controls back from the request fd before QUEUE.** Compare against what `0008-h264-fill-decode-params-from-vaapi.patch` populated. Any discrepancy = our writes aren't sticking.
2. **Compare the per-frame control set against FFmpeg's `v4l2_request_h264.c`** (proven working on hantro, per Bootlin's downstream branch `v4l2-request-n8.1`). Diff: which fields does FFmpeg set that we don't?
3. **Verify SPS submission:** the V4L2_CTRL_TYPE_H264_SPS control needs the full SPS struct. Patch 0013/0018 set level_idc; verify other SPS fields (`profile_idc`, `seq_parameter_set_id`, `log2_max_frame_num_minus4`, `pic_order_cnt_type`, `log2_max_pic_order_cnt_lsb_minus4`, `max_num_ref_frames`, etc.) — these come from VAAPI's `VAPictureParameterBufferH264` not directly, so we need to derive them from the bitstream or rely on VAAPI passing them.
4. **DECODE_PARAMS bit_size fields** (patch 0008's open question, never resolved): if hantro uses these for slice header parsing in FRAME_BASED mode, our zeros could trigger silent reject. FFmpeg sets these by parsing the slice header bit-precisely.
5. **POC sentinel** — re-run with VIDIOC_G_EXT_CTRLS readback after patch-0015's strip to confirm the values reach V4L2 as 0/0, not 65536/65536.
6. **`reference_ts` for the IDR frame**: even though IDRs have no references, hantro may still validate the field. Should be 0 / monotonic from frame 0.
## Next observable I should add to the rig
For the next decode probe:
- Enable `dyndbg="file drivers/media/platform/verisilicon/* +pmflt"` on the hantro module to surface any compiled-in `dev_dbg` calls.
- (If module reload is acceptable) reload hantro with `dyndbg` set at module-insert time.
- Add ftrace function-graph for `hantro_*` symbols to see the kernel decode path.
- Add a v4l2-ctl based reproducer (smaller and easier to instrument than mpv): feed bbb_h264.h264 (Annex B raw) directly via `v4l2-ctl --stream-out-mmap` — but the request-API shape makes this awkward. May need to write a small C reproducer.
## Phase 0 status (re-revised)
- **#1 Re-verify failure-mode finding** — ✗ kernel side completes successfully but produces zero-pixel output. Bitstream parser silently fails. Original 2026-04-26 STUDY claim was wrong at the pixel-content layer.
- **#2 Step 1 reconciliation** — ✓ done in commit `74b3793`.
- **#3 Firefox engagement** — ✓ engages libva in live session; same zero-pixel decode failure as mpv. Falls back to SW after frame 0.
- **#4 Phase 0 baseline anchor** — partial: userspace contract trace ✓, kernel-side ftrace ✓ (this run), but the patch-0011 sentinel test we relied on for pixel-content verification is buggy. **Need to fix or replace the pixel-content check** before Phase 1 lock.
## Artifacts
- `ftrace.txt` — 132 lines, v4l2/vb2/dma_fence events for the mpv vaapi-copy --frames=2 run
- `dmesg.txt` — 63 lines, no hantro/vpu/decode/error/warn messages
- `mpv.stderr` — 40 lines, the .so DEBUG dumps (10 sentinel-survivor CAPTURE reads, all of which we now know are cache-stale)
- `mpv.stdout` — 107 lines, normal mpv playback log (`Using hardware decoding (vaapi-copy).`)