iter5 close — HW reports DEC_RDY but CAPTURE is uniform black

Post-iter3+iter4 patches: per-frame S_EXT_CTRLS succeeds, OUTPUT
bitstream is byte-identical to raw HEVC, hardware IRQ reports
STA_INT=0x107 DEC_RDY=1 TIMEOUT=0 ERROR=0 on every decode, zero
IOMMU faults. But γ-dump shows CAPTURE plane[0]=uniform 0x10 (Y),
plane[1]=uniform 0x80 (CbCr) — video black.

Leading hypothesis for iter6: cache coherency between hardware-
written DMA buffer and userspace cached mmap — same pattern as
RK3399 documented in feedback_rockchip_pixel_verify_path. Iter6
falsifier: VAExportSurfaceHandle → DMA-BUF → DMA_BUF_IOCTL_SYNC,
read. If real content visible, coherency confirmed.

Three open kernel-agent issues: #14 (iter3, verified), #15 (iter4,
verified), #16 TBD (iter5 finding).

Substrate: ampere kernel carries iter3 + iter4 + iter5 IRQ pr_warn.
Backend .so unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Markus Fritsche
2026-05-16 11:43:59 +00:00
parent 46c956bd51
commit 1594def84e
+53
View File
@@ -0,0 +1,53 @@
# Iter5 close — hardware reports DEC_RDY but CAPTURE buffer is uniform black
Date: 2026-05-16 (afternoon, immediately following iter4 close)
Branch: `master`
Substrate: ampere `7.0.0-rc3-devices+` with iter3 + iter4 kernel patches applied. Backend iter3 instrumented build (md5 `404041ea2dcc03c769e0ab8c43ddadd6`).
## Bottom line
After iter3 (ext_sps NULL init) + iter4 (HEVC_SLICE_PARAMS registration), the entire HEVC submission pipeline is structurally clean: all 5 controls commit per frame, hardware reports `VDPU381_STA_INT_DEC_RDY_STA=1` on every IRQ, no timeouts, no IOMMU faults, no kernel errors. **And yet** every CAPTURE buffer plane comes back uniform: Y=`0x10` everywhere, CbCr=`0x80` everywhere — i.e. "video black" in NV12 studio-range.
The decoder claims success, the buffer is the right size, the right thing was fed in — but the content is empty.
## Falsifier outcome
F1 (per-frame S_EXT_CTRLS still rejected post-iter4): **FALSE** — confirmed via dmesg with `dev_debug=0x3f`. Lines like `VIDIOC_S_EXT_CTRLS: which=0xf010000, count=5, error_idx=4, request_fd=NN` now appear WITHOUT a `error -22` prefix → batch accepted. `error_idx=4` here is the last-processed-index (success), not a failure indicator (which would set error_idx=count=5).
F2 (backend feeds wrong OUTPUT bitstream): **FALSE**`LIBVA_V4L2_DUMP_OUTPUT=/tmp/iter5_out` dumped per-frame OUTPUT payloads. Frame-1 dump is 280 bytes (`00 00 01 28 01 af 1d 18 68 17 59 55 54 51 34 d2 ...`). Comparing to `ffmpeg -c copy -f hevc /tmp/raw.hevc` extracted raw stream (mp4 uses 4-byte length prefix; first NAL is 277 bytes starting at offset 4 with NAL header `28` = IDR_N_LP), the backend's dump = Annex-B 3-byte start code (`00 00 01`) prepended to the same 277 NAL bytes. **Byte-identical**.
F3 (hardware times out / errors / IOMMU faults): **FALSE** — diagnostic `pr_warn` added to `vdpu381_irq_handler` logged `STA_INT=0x00000107 DEC_RDY=1 TIMEOUT=0 ERROR=0 SOFTRESET=0` for every IRQ across all 15 attempted decodes. Zero `iommu`/`smmu`/`fault` lines in dmesg. Hardware itself reports successful decode.
F4 (CAPTURE buffer reaches userspace with hardware-decoded data): **FALSE**`LIBVA_V4L2_DUMP_CAPTURE=1` (γ-dump) immediately after DQBUF + mark_decoded scans the CPU-visible mmap and finds plane[0] Y = uniform `0x10` (20480/20480 non-zero, but ALL bytes are `0x10`), plane[1] CbCr = uniform `0x80`. The hardware "wrote successfully" but the CPU side reads back video-black.
F5 (different decode-path verifier sees real content): **inconclusive** — ffmpeg-v4l2request direct path (bypasses libva) fails with EINVAL even post-iter4 patches (separate path with its own control-shape mismatch — out of scope for iter5). mpv `--vo=drm` was blocked by DRM master held by SDDM (couldn't switch consoles inside ssh). DRM_PRIME export path verification pending.
## Hypotheses for iter6 (ranked)
1. **Cache coherency between hardware-written DMA buffer and userspace cached mmap** — leading hypothesis. RK3399 has the same pattern documented in `feedback_rockchip_pixel_verify_path.md`: vaDeriveImage / cached-mmap returns all-zero on RK3399 because the userspace mmap is CPU-cached and hardware DMA writes don't invalidate the cache. RK3588 could have the same issue with vb2_dma_contig + cached mmap. The `0x10`/`0x80` content might be page-allocator's pre-zero pattern after V4L2 cleared the buffer to NV12 black baseline. Iter6 falsifier: export the CAPTURE buffer as DMA-BUF (DRM_PRIME via VAExportSurfaceHandle), import into a separate fd, use DMA_BUF_IOCTL_SYNC + read — if THAT path shows real decoded content, coherency is confirmed. Fix would be to either (a) use coherent mmap mode on the v4l2 driver, OR (b) backend uses dma-buf-sync ioctls before reading mmap.
2. **Hardware writes to a different physical address than what's mmapped** — possible if there's a stale dst_addr cached somewhere, or if the iommu translation differs between hardware and CPU. Less likely given DEC_RDY=1 and no IOMMU faults, but worth a sanity check. Iter6 falsifier: log `vb2_dma_contig_plane_dma_addr(&dst_buf->vb2_buf, 0)` vs the CAPTURE buffer's vma->vm_pgoff*4096 — they should match.
3. **Hardware "successful decode" is actually a no-op** — the rkvdec sees the SPS bit_depth/chroma/dimensions, allocates the right output sizes, asserts DEC_RDY because the pipeline registers look valid, but the actual entropy decode loop never runs because some other register is mis-programmed (e.g., a "decode enable" beyond `VDPU381_DEC_E_BIT`). Lowest priority — would normally show timeout/error.
4. **The Casanova v7.0 series has a third bug**: maybe the SLICE_PARAMS layout my iter4 patch registered with `cfg.dims = { 600 }` and `DYNAMIC_ARRAY` is wrong, and the hardware silently processes garbage slice headers. Counter-evidence: visl uses identical `dims = { 600 }`. Iter6 falsifier: capture iter5-DIAG `validate_sps` per-frame logs vs what should fire — if SPS reaches kernel with correct dims, this is ruled out.
## Substrate state at close
- Kernel: ampere `7.0.0-rc3-devices+` carries iter3 fix (rkvdec-hevc-common.c preamble NULL init) + iter4 fix (rkvdec.c SLICE_PARAMS registration in vdpu38x_hevc_ctrl_descs) + iter5 diagnostic `pr_warn` in `vdpu381_irq_handler` (cheap, fires per IRQ).
- Backend `.so`: unchanged md5 `404041ea2dcc03c769e0ab8c43ddadd6` on `/usr/lib/dri/`.
- `/sys/.../dev_debug` left at `0x3f`. Reset to `0` for production.
- Three pending kernel-agent issues: #14 (iter3, filed + verified), #15 (iter4, filed + verified), #16 (iter5, **TBD** — file with the empirical "DEC_RDY but black" finding and the 4-hypothesis ladder).
## Phase 6 question completion (iter5)
| Q | Answer |
|---|--------|
| Q1 — is OUTPUT bitstream correct? | YES — byte-identical to raw HEVC NAL (with Annex-B start code prepended as required). |
| Q2 — does HW IRQ report success? | YES — `STA_INT=0x107 DEC_RDY=1` on every IRQ. |
| Q3 — is CAPTURE buffer being read from the right place? | TBD (iter6 H1) — coherency or address-mismatch hypothesis. |
| Q4 — fallback decode path (v4l2request direct) | Fails at EINVAL too — separate control-shape bug, not in iter5 scope. |
## Iter5 takeaway
iter5 narrowed the "all-black output" bug from a giant unknown to a precise hand-off: the kernel/HW pipeline succeeds AND the userspace sees uniform NV12-black. The single most likely cause is the well-known RK-family pattern of vb2_dma_contig + cached mmap NOT being coherent with hardware writes — which is exactly what the fresnel-fourier campaign already documented and worked around. Iter6 starts with the DMA-BUF sync verifier.