Files
Markus Fritsche 1594def84e iter5 close — HW reports DEC_RDY but CAPTURE is uniform black
Post-iter3+iter4 patches: per-frame S_EXT_CTRLS succeeds, OUTPUT
bitstream is byte-identical to raw HEVC, hardware IRQ reports
STA_INT=0x107 DEC_RDY=1 TIMEOUT=0 ERROR=0 on every decode, zero
IOMMU faults. But γ-dump shows CAPTURE plane[0]=uniform 0x10 (Y),
plane[1]=uniform 0x80 (CbCr) — video black.

Leading hypothesis for iter6: cache coherency between hardware-
written DMA buffer and userspace cached mmap — same pattern as
RK3399 documented in feedback_rockchip_pixel_verify_path. Iter6
falsifier: VAExportSurfaceHandle → DMA-BUF → DMA_BUF_IOCTL_SYNC,
read. If real content visible, coherency confirmed.

Three open kernel-agent issues: #14 (iter3, verified), #15 (iter4,
verified), #16 TBD (iter5 finding).

Substrate: ampere kernel carries iter3 + iter4 + iter5 IRQ pr_warn.
Backend .so unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 11:43:59 +00:00

54 lines
6.5 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Iter5 close — hardware reports DEC_RDY but CAPTURE buffer is uniform black
Date: 2026-05-16 (afternoon, immediately following iter4 close)
Branch: `master`
Substrate: ampere `7.0.0-rc3-devices+` with iter3 + iter4 kernel patches applied. Backend iter3 instrumented build (md5 `404041ea2dcc03c769e0ab8c43ddadd6`).
## Bottom line
After iter3 (ext_sps NULL init) + iter4 (HEVC_SLICE_PARAMS registration), the entire HEVC submission pipeline is structurally clean: all 5 controls commit per frame, hardware reports `VDPU381_STA_INT_DEC_RDY_STA=1` on every IRQ, no timeouts, no IOMMU faults, no kernel errors. **And yet** every CAPTURE buffer plane comes back uniform: Y=`0x10` everywhere, CbCr=`0x80` everywhere — i.e. "video black" in NV12 studio-range.
The decoder claims success, the buffer is the right size, the right thing was fed in — but the content is empty.
## Falsifier outcome
F1 (per-frame S_EXT_CTRLS still rejected post-iter4): **FALSE** — confirmed via dmesg with `dev_debug=0x3f`. Lines like `VIDIOC_S_EXT_CTRLS: which=0xf010000, count=5, error_idx=4, request_fd=NN` now appear WITHOUT a `error -22` prefix → batch accepted. `error_idx=4` here is the last-processed-index (success), not a failure indicator (which would set error_idx=count=5).
F2 (backend feeds wrong OUTPUT bitstream): **FALSE**`LIBVA_V4L2_DUMP_OUTPUT=/tmp/iter5_out` dumped per-frame OUTPUT payloads. Frame-1 dump is 280 bytes (`00 00 01 28 01 af 1d 18 68 17 59 55 54 51 34 d2 ...`). Comparing to `ffmpeg -c copy -f hevc /tmp/raw.hevc` extracted raw stream (mp4 uses 4-byte length prefix; first NAL is 277 bytes starting at offset 4 with NAL header `28` = IDR_N_LP), the backend's dump = Annex-B 3-byte start code (`00 00 01`) prepended to the same 277 NAL bytes. **Byte-identical**.
F3 (hardware times out / errors / IOMMU faults): **FALSE** — diagnostic `pr_warn` added to `vdpu381_irq_handler` logged `STA_INT=0x00000107 DEC_RDY=1 TIMEOUT=0 ERROR=0 SOFTRESET=0` for every IRQ across all 15 attempted decodes. Zero `iommu`/`smmu`/`fault` lines in dmesg. Hardware itself reports successful decode.
F4 (CAPTURE buffer reaches userspace with hardware-decoded data): **FALSE**`LIBVA_V4L2_DUMP_CAPTURE=1` (γ-dump) immediately after DQBUF + mark_decoded scans the CPU-visible mmap and finds plane[0] Y = uniform `0x10` (20480/20480 non-zero, but ALL bytes are `0x10`), plane[1] CbCr = uniform `0x80`. The hardware "wrote successfully" but the CPU side reads back video-black.
F5 (different decode-path verifier sees real content): **inconclusive** — ffmpeg-v4l2request direct path (bypasses libva) fails with EINVAL even post-iter4 patches (separate path with its own control-shape mismatch — out of scope for iter5). mpv `--vo=drm` was blocked by DRM master held by SDDM (couldn't switch consoles inside ssh). DRM_PRIME export path verification pending.
## Hypotheses for iter6 (ranked)
1. **Cache coherency between hardware-written DMA buffer and userspace cached mmap** — leading hypothesis. RK3399 has the same pattern documented in `feedback_rockchip_pixel_verify_path.md`: vaDeriveImage / cached-mmap returns all-zero on RK3399 because the userspace mmap is CPU-cached and hardware DMA writes don't invalidate the cache. RK3588 could have the same issue with vb2_dma_contig + cached mmap. The `0x10`/`0x80` content might be page-allocator's pre-zero pattern after V4L2 cleared the buffer to NV12 black baseline. Iter6 falsifier: export the CAPTURE buffer as DMA-BUF (DRM_PRIME via VAExportSurfaceHandle), import into a separate fd, use DMA_BUF_IOCTL_SYNC + read — if THAT path shows real decoded content, coherency is confirmed. Fix would be to either (a) use coherent mmap mode on the v4l2 driver, OR (b) backend uses dma-buf-sync ioctls before reading mmap.
2. **Hardware writes to a different physical address than what's mmapped** — possible if there's a stale dst_addr cached somewhere, or if the iommu translation differs between hardware and CPU. Less likely given DEC_RDY=1 and no IOMMU faults, but worth a sanity check. Iter6 falsifier: log `vb2_dma_contig_plane_dma_addr(&dst_buf->vb2_buf, 0)` vs the CAPTURE buffer's vma->vm_pgoff*4096 — they should match.
3. **Hardware "successful decode" is actually a no-op** — the rkvdec sees the SPS bit_depth/chroma/dimensions, allocates the right output sizes, asserts DEC_RDY because the pipeline registers look valid, but the actual entropy decode loop never runs because some other register is mis-programmed (e.g., a "decode enable" beyond `VDPU381_DEC_E_BIT`). Lowest priority — would normally show timeout/error.
4. **The Casanova v7.0 series has a third bug**: maybe the SLICE_PARAMS layout my iter4 patch registered with `cfg.dims = { 600 }` and `DYNAMIC_ARRAY` is wrong, and the hardware silently processes garbage slice headers. Counter-evidence: visl uses identical `dims = { 600 }`. Iter6 falsifier: capture iter5-DIAG `validate_sps` per-frame logs vs what should fire — if SPS reaches kernel with correct dims, this is ruled out.
## Substrate state at close
- Kernel: ampere `7.0.0-rc3-devices+` carries iter3 fix (rkvdec-hevc-common.c preamble NULL init) + iter4 fix (rkvdec.c SLICE_PARAMS registration in vdpu38x_hevc_ctrl_descs) + iter5 diagnostic `pr_warn` in `vdpu381_irq_handler` (cheap, fires per IRQ).
- Backend `.so`: unchanged md5 `404041ea2dcc03c769e0ab8c43ddadd6` on `/usr/lib/dri/`.
- `/sys/.../dev_debug` left at `0x3f`. Reset to `0` for production.
- Three pending kernel-agent issues: #14 (iter3, filed + verified), #15 (iter4, filed + verified), #16 (iter5, **TBD** — file with the empirical "DEC_RDY but black" finding and the 4-hypothesis ladder).
## Phase 6 question completion (iter5)
| Q | Answer |
|---|--------|
| Q1 — is OUTPUT bitstream correct? | YES — byte-identical to raw HEVC NAL (with Annex-B start code prepended as required). |
| Q2 — does HW IRQ report success? | YES — `STA_INT=0x107 DEC_RDY=1` on every IRQ. |
| Q3 — is CAPTURE buffer being read from the right place? | TBD (iter6 H1) — coherency or address-mismatch hypothesis. |
| Q4 — fallback decode path (v4l2request direct) | Fails at EINVAL too — separate control-shape bug, not in iter5 scope. |
## Iter5 takeaway
iter5 narrowed the "all-black output" bug from a giant unknown to a precise hand-off: the kernel/HW pipeline succeeds AND the userspace sees uniform NV12-black. The single most likely cause is the well-known RK-family pattern of vb2_dma_contig + cached mmap NOT being coherent with hardware writes — which is exactly what the fresnel-fourier campaign already documented and worked around. Iter6 starts with the DMA-BUF sync verifier.