Phase 0 amendment: hantro writes zeros, sentinel test cache-buggy
Re-baselined libva-v4l2-request decode path with kernel-side observability (ftrace v4l2/vb2/dma_fence + dmesg + dynamic_debug) and visual disambiguator (mpv --vo=gpu in operator's live Plasma session). Findings: 1. Kernel reports successful CAPTURE buffer write every frame: ftrace vb2_buf_done shows bytesused=3655712 (full NV12 1920x1088 + hantro tile padding). dmesg completely silent — no hantro/vpu/decode/error/warn messages. 2. Visual disambiguator: mpv --hwdec=vaapi-copy --vo=gpu shows a solid GREEN frame; --hwdec=vaapi --vo=gpu shows solid BLUE. Neither shows the sentinel mid-beige (NV12 Y=0xab,UV=0xab would render cream). Both colors are consistent with the kernel writing all-zero NV12 (Y=0,UV=0 → green via BT.709 limited; same buffer GL-imported as DMA-BUF with different colorspace → blue). 3. Patch 0011 sentinel test has a cache-coherency bug: writes 0xab via cached surface_object->destination_map[0] mmap, never invalidates cache before readback. So the readback always shows the stale sentinel even when kernel DMA-overwrote it with zeros. vaapi-copy and Mesa DMA-BUF GL import correctly invalidate cache and see the real (zero) contents. This corrects the previous Phase 0 verdicts twice in one day: - Original commitf15ba8b("the 2026-04-26 picture holds") was wrong: clean contract trace, never checked pixel content. - Revised commite892cea("kernel produces no decoded pixel output, sentinel survives") was half right: kernel does write, writes zeros, and the sentinel test was reading stale cache. - Now: kernel writes ALL ZEROS to the CAPTURE buffer. Hantro is silently failing the bitstream parse or some control validation. This is consistent with patch 0011's own commit message hypothesis: "All zeros → kernel did write 0x00s (overwriting our sentinel), and the apparent 'no picture' output is the kernel-side decode actually producing zeros (e.g. parser rejected the bitstream)." That hypothesis was right; we just couldn't confirm it via the sentinel test (cache bug) and went down the wrong rabbit hole. Phase 6 direction sharpens substantially. Bug isn't "we can't engage hantro" — it's "hantro engages but its parser produces zeros." Bisect the control submission: VIDIOC_G_EXT_CTRLS readback to verify writes stick, diff against FFmpeg's v4l2_request_h264.c (proven working on hantro), verify SPS completeness, resolve patch 0008's slice_header bit_size open question, dyndbg the hantro module, etc. Phase 1 boolean- correctness criterion needs a working pixel-content check before lock; fix patch 0011's cache sync first. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
+30
-8
@@ -146,7 +146,7 @@ Stock Firefox 150.0.1 + `media.ffmpeg.vaapi.enabled=true` + `LIBVA_DRIVER_NAME=v
|
||||
|
||||
**Result**: Firefox's RDD process dlopens libva.so.2 + libva-drm.so.2 + libva-x11.so.2 for capability probe then immediately closes them; never reaches `vaInitialize`. Gfx-environment platform-fitness check rejects VAAPI under Xvfb's software-framebuffer-with-no-DRI rig. Not a libva-side fault. Re-test in live session needed.
|
||||
|
||||
### Live Plasma Wayland session run — INVERTS PRIOR PHASE 0 VERDICT
|
||||
### Live Plasma Wayland session run — and follow-up kernel-side disambiguation
|
||||
|
||||
Same Firefox profile + LIBVA env, executed inside the operator's active Plasma 6 Wayland session (XDG_SESSION_TYPE=wayland, XDG_RUNTIME_DIR=/run/user/1001). Full write-up: [`phase0_evidence/2026-05-04-firefox-live/findings.md`](phase0_evidence/2026-05-04-firefox-live/findings.md).
|
||||
|
||||
@@ -169,14 +169,36 @@ Phase 0 deliverable status corrections:
|
||||
- **#3** (Firefox configuration end-to-end) — ✓ engagement confirmed in live Plasma session; pixel-content failure mode identical to mpv.
|
||||
- **#4** (Phase 0 baseline anchor) — ✗ **AMENDED**: captured trace describes Step 1's userspace behaviour, not the kernel-side spec Phase 6 must reproduce.
|
||||
|
||||
**Phase 1 lock should be deferred** until: (a) the boolean-correctness criterion is sharpened to require pixel-content verification (sentinel-overwrite check, NV12 luma min/max sanity, etc.), and (b) Phase 0 includes a kernel-side observability layer (ftrace `events/v4l2/`, `dmesg` for silent decode errors) so we can characterize *why* hantro is silent. The Step 1 18-patch series engages libva but doesn't make hantro decode — Phase 6 has substantive work.
|
||||
### Kernel-side re-baseline (2026-05-04) — corrects the prior verdict AGAIN
|
||||
|
||||
Likely failure-mode candidates (priority order, from patch comments):
|
||||
1. `reference_ts` not propagated (per patch-0017 commit body: "hantro doesn't read pic_num, uses reference_ts")
|
||||
2. DECODE_PARAMS slice_header bit_size fields all zero (patch 0008's open question, never resolved)
|
||||
3. POC sentinel still leaking past patch-0015's strip (DEBUG dump runs *before* the strip; need post-strip verification via VIDIOC_G_EXT_CTRLS)
|
||||
4. level_idc over-allocation interaction (patch 0013 → 0018 transition)
|
||||
5. `V4L2_EVENT_SOURCE_CHANGE` not handled (open Q #5)
|
||||
ftrace v4l2/vb2/dma_fence + dmesg + dynamic_debug enabled while running mpv `--hwdec=vaapi-copy --frames=2`. Full write-up: [`phase0_evidence/2026-05-04-kernel-trace/findings.md`](phase0_evidence/2026-05-04-kernel-trace/findings.md).
|
||||
|
||||
| Layer | Result |
|
||||
|---|---|
|
||||
| ftrace `vb2_buf_done` for CAPTURE_MPLANE | **`bytesused=3655712`** (full NV12 + hantro tile padding) reported every frame. **Kernel signals successful full-buffer write.** |
|
||||
| dmesg | Completely silent. No hantro/vpu/decode/fail/error/reject/einval/warn. |
|
||||
| Real-VO disambiguator (operator inspection in live session) | `--hwdec=vaapi-copy --vo=gpu`: **solid GREEN frame**. `--hwdec=vaapi --vo=gpu`: **solid BLUE frame**. NV12-with-Y=0,UV=0 BT.709-converted = green; same buffer via DMA-BUF GL import with different colorspace = blue. **Neither shows the sentinel mid-beige pattern; neither shows real bunny pixels.** |
|
||||
|
||||
**Corrected verdict**: hantro accepts the request, returns success, **and writes ALL ZEROS to the CAPTURE buffer**. The patch-0011 sentinel test we relied on is misleading — it has a **cache-coherency bug**. Patch 0011 writes `0xab` via cached `surface_object->destination_map[0]` mmap, but neither `0010-DEBUG-hex-dump` nor any other read path in libva-v4l2-request invalidates the cache after DQBUF. So the readback always shows the stale sentinel, hiding the fact that the kernel DMA-overwrote it with zeros. vaapi-copy and Mesa DMA-BUF GL import correctly invalidate cache and see the real (zero) contents.
|
||||
|
||||
**Bug surface narrows substantially.** The path is:
|
||||
- libva engagement: ✓
|
||||
- Contract trace: ✓ no EINVAL, all ioctls succeed
|
||||
- Hantro request acceptance: ✓ kernel reports success
|
||||
- **Hantro produces meaningful pixel output: ✗ writes ALL ZEROS** — almost certainly the bitstream parser silently rejects something (per patch-0011's own commit-message hypothesis: "the apparent 'no picture' output is the kernel-side decode actually producing zeros, e.g. parser rejected the bitstream")
|
||||
|
||||
This is consistent with a control-submission bug (something in SPS/PPS/DECODE_PARAMS is off), not a fundamental "we can't drive hantro" problem. Phase 6 work direction sharpens accordingly.
|
||||
|
||||
### Phase 6 priority list (revised after kernel-side baseline)
|
||||
|
||||
1. **Fix the patch-0011 sentinel test** (or replace it). Add `msync(MS_SYNC|MS_INVALIDATE)` or DMA-BUF cache sync before the readback. Without this, future debugging is unreliable in exactly the same way.
|
||||
2. **VIDIOC_G_EXT_CTRLS readback** of the request fd before QUEUE — confirms our writes actually stick at the V4L2 layer (e.g. POC sentinel actually stripped to 0 by patch-0015, level_idc actually set, etc.).
|
||||
3. **Diff our per-frame control set against FFmpeg's `v4l2_request_h264.c`** (proven working on hantro, downstream branch `code.ffmpeg.org/Kwiboo/FFmpeg.git v4l2-request-n8.1`). Identify any field FFmpeg sets that we don't.
|
||||
4. **Verify SPS submission completeness**: VAAPI's `VAPictureParameterBufferH264` doesn't carry the full SPS — we may need to derive `profile_idc` / `seq_parameter_set_id` / `log2_max_frame_num_minus4` / `pic_order_cnt_type` / `log2_max_pic_order_cnt_lsb_minus4` / `max_num_ref_frames` from VAAPI fields or by parsing the slice header.
|
||||
5. **DECODE_PARAMS slice_header bit_size fields** (patch 0008's never-resolved question): if hantro requires them for parse, our zeros could be the silent-reject trigger.
|
||||
6. **dyndbg on hantro module**: reload with `dyndbg="file drivers/media/platform/verisilicon/* +pmflt"` to surface compiled-in `dev_dbg` calls for the next probe.
|
||||
|
||||
Phase 1 boolean-correctness criterion now must include pixel-content verification — but the verification can't rely on patch 0011 in its current form. Either fix patch 0011's cache sync, or use a different check: e.g. mpv `--vo=image-sequence` and inspect the dumped frame, or a small C reproducer that maps the buffer with proper cache flags and computes a luma histogram.
|
||||
|
||||
## Source-read references (carry-over from STUDY.md)
|
||||
|
||||
|
||||
Reference in New Issue
Block a user