Post-iter3+iter4 patches: per-frame S_EXT_CTRLS succeeds, OUTPUT bitstream is byte-identical to raw HEVC, hardware IRQ reports STA_INT=0x107 DEC_RDY=1 TIMEOUT=0 ERROR=0 on every decode, zero IOMMU faults. But γ-dump shows CAPTURE plane[0]=uniform 0x10 (Y), plane[1]=uniform 0x80 (CbCr) — video black. Leading hypothesis for iter6: cache coherency between hardware- written DMA buffer and userspace cached mmap — same pattern as RK3399 documented in feedback_rockchip_pixel_verify_path. Iter6 falsifier: VAExportSurfaceHandle → DMA-BUF → DMA_BUF_IOCTL_SYNC, read. If real content visible, coherency confirmed. Three open kernel-agent issues: #14 (iter3, verified), #15 (iter4, verified), #16 TBD (iter5 finding). Substrate: ampere kernel carries iter3 + iter4 + iter5 IRQ pr_warn. Backend .so unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
6.5 KiB
Iter5 close — hardware reports DEC_RDY but CAPTURE buffer is uniform black
Date: 2026-05-16 (afternoon, immediately following iter4 close)
Branch: master
Substrate: ampere 7.0.0-rc3-devices+ with iter3 + iter4 kernel patches applied. Backend iter3 instrumented build (md5 404041ea2dcc03c769e0ab8c43ddadd6).
Bottom line
After iter3 (ext_sps NULL init) + iter4 (HEVC_SLICE_PARAMS registration), the entire HEVC submission pipeline is structurally clean: all 5 controls commit per frame, hardware reports VDPU381_STA_INT_DEC_RDY_STA=1 on every IRQ, no timeouts, no IOMMU faults, no kernel errors. And yet every CAPTURE buffer plane comes back uniform: Y=0x10 everywhere, CbCr=0x80 everywhere — i.e. "video black" in NV12 studio-range.
The decoder claims success, the buffer is the right size, the right thing was fed in — but the content is empty.
Falsifier outcome
F1 (per-frame S_EXT_CTRLS still rejected post-iter4): FALSE — confirmed via dmesg with dev_debug=0x3f. Lines like VIDIOC_S_EXT_CTRLS: which=0xf010000, count=5, error_idx=4, request_fd=NN now appear WITHOUT a error -22 prefix → batch accepted. error_idx=4 here is the last-processed-index (success), not a failure indicator (which would set error_idx=count=5).
F2 (backend feeds wrong OUTPUT bitstream): FALSE — LIBVA_V4L2_DUMP_OUTPUT=/tmp/iter5_out dumped per-frame OUTPUT payloads. Frame-1 dump is 280 bytes (00 00 01 28 01 af 1d 18 68 17 59 55 54 51 34 d2 ...). Comparing to ffmpeg -c copy -f hevc /tmp/raw.hevc extracted raw stream (mp4 uses 4-byte length prefix; first NAL is 277 bytes starting at offset 4 with NAL header 28 = IDR_N_LP), the backend's dump = Annex-B 3-byte start code (00 00 01) prepended to the same 277 NAL bytes. Byte-identical.
F3 (hardware times out / errors / IOMMU faults): FALSE — diagnostic pr_warn added to vdpu381_irq_handler logged STA_INT=0x00000107 DEC_RDY=1 TIMEOUT=0 ERROR=0 SOFTRESET=0 for every IRQ across all 15 attempted decodes. Zero iommu/smmu/fault lines in dmesg. Hardware itself reports successful decode.
F4 (CAPTURE buffer reaches userspace with hardware-decoded data): FALSE — LIBVA_V4L2_DUMP_CAPTURE=1 (γ-dump) immediately after DQBUF + mark_decoded scans the CPU-visible mmap and finds plane[0] Y = uniform 0x10 (20480/20480 non-zero, but ALL bytes are 0x10), plane[1] CbCr = uniform 0x80. The hardware "wrote successfully" but the CPU side reads back video-black.
F5 (different decode-path verifier sees real content): inconclusive — ffmpeg-v4l2request direct path (bypasses libva) fails with EINVAL even post-iter4 patches (separate path with its own control-shape mismatch — out of scope for iter5). mpv --vo=drm was blocked by DRM master held by SDDM (couldn't switch consoles inside ssh). DRM_PRIME export path verification pending.
Hypotheses for iter6 (ranked)
-
Cache coherency between hardware-written DMA buffer and userspace cached mmap — leading hypothesis. RK3399 has the same pattern documented in
feedback_rockchip_pixel_verify_path.md: vaDeriveImage / cached-mmap returns all-zero on RK3399 because the userspace mmap is CPU-cached and hardware DMA writes don't invalidate the cache. RK3588 could have the same issue with vb2_dma_contig + cached mmap. The0x10/0x80content might be page-allocator's pre-zero pattern after V4L2 cleared the buffer to NV12 black baseline. Iter6 falsifier: export the CAPTURE buffer as DMA-BUF (DRM_PRIME via VAExportSurfaceHandle), import into a separate fd, use DMA_BUF_IOCTL_SYNC + read — if THAT path shows real decoded content, coherency is confirmed. Fix would be to either (a) use coherent mmap mode on the v4l2 driver, OR (b) backend uses dma-buf-sync ioctls before reading mmap. -
Hardware writes to a different physical address than what's mmapped — possible if there's a stale dst_addr cached somewhere, or if the iommu translation differs between hardware and CPU. Less likely given DEC_RDY=1 and no IOMMU faults, but worth a sanity check. Iter6 falsifier: log
vb2_dma_contig_plane_dma_addr(&dst_buf->vb2_buf, 0)vs the CAPTURE buffer's vma->vm_pgoff*4096 — they should match. -
Hardware "successful decode" is actually a no-op — the rkvdec sees the SPS bit_depth/chroma/dimensions, allocates the right output sizes, asserts DEC_RDY because the pipeline registers look valid, but the actual entropy decode loop never runs because some other register is mis-programmed (e.g., a "decode enable" beyond
VDPU381_DEC_E_BIT). Lowest priority — would normally show timeout/error. -
The Casanova v7.0 series has a third bug: maybe the SLICE_PARAMS layout my iter4 patch registered with
cfg.dims = { 600 }andDYNAMIC_ARRAYis wrong, and the hardware silently processes garbage slice headers. Counter-evidence: visl uses identicaldims = { 600 }. Iter6 falsifier: capture iter5-DIAGvalidate_spsper-frame logs vs what should fire — if SPS reaches kernel with correct dims, this is ruled out.
Substrate state at close
- Kernel: ampere
7.0.0-rc3-devices+carries iter3 fix (rkvdec-hevc-common.c preamble NULL init) + iter4 fix (rkvdec.c SLICE_PARAMS registration in vdpu38x_hevc_ctrl_descs) + iter5 diagnosticpr_warninvdpu381_irq_handler(cheap, fires per IRQ). - Backend
.so: unchanged md5404041ea2dcc03c769e0ab8c43ddadd6on/usr/lib/dri/. /sys/.../dev_debugleft at0x3f. Reset to0for production.- Three pending kernel-agent issues: #14 (iter3, filed + verified), #15 (iter4, filed + verified), #16 (iter5, TBD — file with the empirical "DEC_RDY but black" finding and the 4-hypothesis ladder).
Phase 6 question completion (iter5)
| Q | Answer |
|---|---|
| Q1 — is OUTPUT bitstream correct? | YES — byte-identical to raw HEVC NAL (with Annex-B start code prepended as required). |
| Q2 — does HW IRQ report success? | YES — STA_INT=0x107 DEC_RDY=1 on every IRQ. |
| Q3 — is CAPTURE buffer being read from the right place? | TBD (iter6 H1) — coherency or address-mismatch hypothesis. |
| Q4 — fallback decode path (v4l2request direct) | Fails at EINVAL too — separate control-shape bug, not in iter5 scope. |
Iter5 takeaway
iter5 narrowed the "all-black output" bug from a giant unknown to a precise hand-off: the kernel/HW pipeline succeeds AND the userspace sees uniform NV12-black. The single most likely cause is the well-known RK-family pattern of vb2_dma_contig + cached mmap NOT being coherent with hardware writes — which is exactly what the fresnel-fourier campaign already documented and worked around. Iter6 starts with the DMA-BUF sync verifier.