Files
ampere-vp9-enablement/phase3_close.md
T
claude-noether 53df917afb Phase 3 close: first-light FAILED across 6 iterations — HW stalls
Module loads + format enumerates (kernel-side wiring works) but every
VP9 decode attempt stalls the HW: STA_INT=0x23 (IRQ_RAW + TIMEOUT) when
HW timeout enabled; NO IRQ at all when timeout disabled (genuine stall).

6 iterations of register-tuning all failed identically. Hypothesis
space narrowed to 3 structural-level possibilities:

1. Probability buffer format mismatch (legacy struct rkvdec_vp9_probs
   vs BSP hal_vp9d_prob_default format)
2. Missing kernel-side init (cache config, SRAM/QoS, AXI setup that
   BSP mpp_rkvdec2 does but mainline HEVC happens not to need)
3. vdpu381 register layout doesn't fully expose VP9 control surface
   (BSP MPP may rely on mpp_dev_set_reg_offset translations we don't
   replicate)

Recommended next path: Sonnet/Janet architect review BEFORE more
iteration. The 6 single-bit tuning rounds established that no bit-flip
fix is sufficient — structural rethink required.

Sibling-campaign close state (HEVC bit-perfect) recoverable on ampere
via backup at ~/vp9-iter1-backup/rockchip-vdec.ko.sibling-campaign-close
+ depmod cycle.

Phase 0-2.1 work preserved at boltzmann:~/src/linux-rockchip:
vp9-enablement-iter1 (6 commits, 1517 LoC). Format enumeration on
/dev/video1 confirms kernel-side wiring is correct.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 06:14:06 +00:00

96 lines
6.7 KiB
Markdown

# Phase 3 close — first-light HW STALL across 6 iterations
Date: 2026-05-17 ~08:15. Phase 3 (install + first decode) did NOT achieve first-light. 6 iterations of register-tuning all fail with the same pattern.
## What works
- Module builds + loads + format enumerates: `v4l2-ctl --device=/dev/video1 --list-formats-out` shows `VP9F` alongside `S265 (HEVC)` and `S264 (H264)`. Kernel-side wiring (vdpu381_coded_fmts[] entry, ops table, sysfs/uapi exposure) is correct.
- libva backend (iter38b) accepts VP9 frames and pushes to V4L2 OUTPUT.
- DMA addresses look reasonable (IOMMU-virtual high addrs).
- HEVC + H.264 unchanged from sibling-campaign-close (bit-perfect).
## What fails
Every VP9 decode attempt times out at the hardware level. With `reg011.dec_timeout_e = 1` (HW timeout enabled):
```
rkvdec vdpu381 IRQ #0: STA_INT=0x00000023 DEC_RDY=0 ERROR=0 TIMEOUT=1 SOFTRST=0
```
0x23 = `IRQ_RAW (BIT 1) | TIMEOUT (BIT 5) | IRQ (BIT 0)`. HW engaged, ran its internal timer to threshold, fired IRQ with TIMEOUT bit.
With `dec_timeout_e = 0` (HW timeout disabled): **HW never fires any IRQ**. Kernel software watchdog reports `Frame processing timed out!` every 40ms but no hardware IRQ. Confirms HW is genuinely stuck, not just slow.
## Iterations attempted (all on `boltzmann:~/src/linux-rockchip:vp9-enablement-iter1`)
| Iter | Change | Result |
|---|---|---|
| 1 | Baseline (Phase 2.1 complete: 4-segment writes, RCB, block-gating, deltas, prob aliasing) | TIMEOUT |
| 2 | +12 BSP fields (reg010.dec_e, reg011 err_fills, reg012.colmv_compress, reg013.cur_pic_is_idr, reg026.busifd_auto_gating, reg028.poc_arb, full reg103 flag set, lasttile = stream_len - first_part_size, stream_len padding to +0x80) | TIMEOUT (same bit pattern) |
| 3 | iter2 + `prob_update_en = 0` (rule out priv_tbl.probs format mismatch) | TIMEOUT (no change) |
| 3b | iter3 + `cprhdr_offset = 0` (BSP semantic) | TIMEOUT (no change) |
| 4 | HEVC-aligned (slice_num=1, pix_range_det, raw stream_len no padding, bytesperline-based strides) | TIMEOUT (no change) |
| 5 | iter4 + reg160/162/172 = 0 (NULL all prob bases — rule out prob buffer entirely) | TIMEOUT (no change) |
| 6 | iter5 + dec_timeout_e = 0 | **NO IRQ AT ALL** (genuine HW stall) |
## Diagnostic data captured
```
vp9 #0: stream_len=140 uncomp_hdr=18 comp_hdr=10 cprhdr_off=0 lasttile=130
rlc_base=0xfcf00000 decout=0xffe00000 first16=...
```
- 140-byte first frame is plausible (black fade-in keyframe per sibling-campaign discovery).
- 18-byte uncompressed + 10-byte compressed header agree with VP9 spec.
- DMA addrs are IOMMU-virtual (rkvdec0_mmu enabled).
- `vb2_plane_vaddr` returned NULL — backend uses dmabuf import, not MMAP. CPU-side byte-dump can't validate stream contents but HW reads via DMA addr regardless.
## Most-likely cause (architect review territory)
The hypothesis space narrowed to three structural-level possibilities:
1. **Probability buffer format mismatch**: legacy `struct rkvdec_vp9_probs` (vdpu34x layout) ≠ what vdpu381 HW expects. BSP MPP has `hal_vp9d_prob_default()` writing a different format. Iter5 (NULL bases) didn't help — but HW may still try to read at address 0 and fault silently.
2. **Missing kernel-side init (cache / SRAM / AXI-QoS)**: BSP `mpp_rkvdec2.c::rkvdec2_run` writes RKVDEC_REG_CACHE0/1/2_SIZE_BASE and clears caches before each decode; also writes statistic-block setup (reg256/257/270 — QoS). Mainline HEVC doesn't do these and works — but maybe VP9 specifically requires them.
3. **vdpu381 register layout doesn't fully expose VP9 control surface**: Casanova's v7.0 mainline series added vdpu381 register definitions for HEVC + H.264 only. The struct definitions we extended in `rkvdec-vdpu381-regs.h` are based on BSP `vdpu382_vp9d.h` — but maybe the BSP MPP also relies on mpp_rkvdec2 DRIVER-side translations (`mpp_dev_set_reg_offset(cfg->dev, 167, hw_ctx->offset_count)` for example) that mainline doesn't replicate.
## Recommended next path forward
**Sonnet/Janet architect review** before further iteration. The 6 iterations of register-tuning have established:
- No bit-flip-level fix is sufficient
- HW is genuinely stuck on something structural we're missing
The review should consider:
- Pull mainline RKVDEC2 series WIP (Collabora) for VP9 hints
- Compare BSP `mpp_dev_set_reg_offset` usage to mainline `rkvdec_memcpy_toio` — does BSP do per-register address translation we're missing?
- Whether mainline-port-target should be the RKVDEC2 driver path (newer) instead of extending Casanova's vdpu381
- Add SRAM/cache init alongside the regs writes
## Tonight's persistent state
- **Branch**: `boltzmann:~/src/linux-rockchip:vp9-enablement-iter1` — 6 commits, 1517 LoC
- `47431635801d` regs header VP9 struct definitions
- `da8482271938` shared codec-spec helpers in rkvdec-vp9-common.h
- `71cc0d96d212` new vdpu381-vp9 backend
- `63c4db0095ca` Phase 2.1 register-packing
- `feaae8743450` iter1-6 first-light diagnostic state (this iteration)
- **Ampere module**: `/lib/modules/7.0.0-rc3-devices+/kernel/drivers/media/platform/rockchip/rkvdec/rockchip-vdec.ko` (iter6 = 6th first-light attempt). To revert to sibling-campaign close: `sudo cp ~/vp9-iter1-backup/rockchip-vdec.ko.sibling-campaign-close /lib/modules/$(uname -r)/kernel/drivers/media/platform/rockchip/rkvdec/rockchip-vdec.ko && sudo depmod -a && sudo modprobe -r rockchip-vdec && sudo modprobe rockchip-vdec` (HEVC bit-perfect restored).
- **Diagnostic instrumentation in kernel**: vdpu381_irq_handler logs first 10 IRQs; vdpu381_config_vp9_regs logs first 3 frames. Removable in iter7.
- **Off-device backup**: `fresnel:~/backups/ampere/vp9-iter1/rockchip-vdec.ko.sibling-campaign-close` md5 5f5cfba42750e8d70902eb1381017d92.
- **NOT touched**: legacy `rkvdec-vp9.c` (RK3399 path untouched), kernel patches from sibling campaign (still in place).
## Campaign summary so far
- Phase 0: closed (substrate, architectural correction)
- Phase 1: closed (3 plan revisions, Janet PROCEED)
- Phase 2: closed (compile-clean backend, 1390 LoC)
- Phase 2.1: closed (full register-packing translation)
- Phase 3 (this): **FAILED first-light**, 6 register-tuning iterations all stall HW
Total session work: ~10 hours (00:50 sibling-campaign close → 08:15 Phase 3 first-light failure). Substantial code artifact + comprehensive failure data; structural rethink required next.
## Honest assessment
We've translated BSP's MPP register layout to mainline vdpu381 register structs and applied every documented BSP setup step. The HW still refuses to decode. The remaining failure modes are structural: either the prob buffer needs a totally-different format we haven't reverse-engineered, OR the kernel-side init has gaps mainline never had to fill for HEVC. Both require architect-level rethinking rather than more iteration.