diff --git a/phase1_plan.md b/phase1_plan.md new file mode 100644 index 0000000..f28b024 --- /dev/null +++ b/phase1_plan.md @@ -0,0 +1,101 @@ +# Phase 1 plan — VP9 wiring patch for vdpu381 (RK3588) + +Date: 2026-05-17 ~01:15. Phase 0 closed `8dce724` with the architectural correction that VP9 is a wiring patch, not a backend port. + +## Phase 1 substrate resolutions (open questions from Phase 0) + +| Q | Question | Resolution | +|---|---|---| +| O1 | Translate-table fits 0x400 MMIO? | Yes. `rkvdec_v2_trans[RKVDEC_FMT_VP9D]` offsets 128..232 × 4 = 0x200..0x3A0 < 0x400. Confirmed in `mpp_rkvdec2.c` BSP. | +| O2 | Codec-aware IRQ split needed? | **NO**. BSP `rkvdec2_irq` (used for all v2-family devices including RK3588) reads ONE register — `RKVDEC_REG_INT_EN = 0x380`. Mainline `vdpu381_irq_handler` reads the **same offset** under the name `VDPU381_REG_STA_INT = 0x380`. Same hardware register; same handler works for HEVC, H.264, and VP9. | +| O3 | RK3588-specific clock/reset requirements for VP9 beyond HEVC? | None observed. BSP `rkvdec_rk3588_hw_ops` uses generic `rkvdec2_init`, `rkvdec2_clk_on/off`, `rkvdec2_sip_reset` — codec-agnostic. Mainline `vdpu381_variant` already exposes the full RK3588 clock set (`aclk_vcodec`, `hclk_vcodec`, `clk_core`, `clk_cabac`, `clk_hevc_cabac`). | +| O4 | Legacy `rkvdec_vp9_start`/`stop` against RK3588 IOMMU? | Should work transparently — vb2_dma_contig handles IOMMU. Verify at first decode attempt. | +| O5 | RCB / SRAM — needed for VP9 on RK3588? | **Yes** — BSP MPP `vp9d_vdpu34x.c` allocates `rcb_buf`/`g_buf[i].rcb_buf`. Mainline `rkvdec_start_streaming` already calls `rkvdec_allocate_rcb(ctx, variant->rcb_sizes, variant->num_rcb_sizes)` **per-variant**, so a VP9 fmt added to `vdpu381_coded_fmts[]` will automatically inherit `vdpu381_rcb_sizes[]`. RCB sizes are HEVC-flavoured but the buffer-purpose names (intra, transd, stream, inter, dblk, sao, fbc, filt-col) are codec-generic primitives; HEVC sizing should oversize-but-work for VP9 (4096-max width is smaller than HEVC 8K max). | +| O6 | Validate via Fluster `VP9-TEST-VECTORS` post-Phase-3 | Deferred; first pass uses ffmpeg-vaapi + byte-compare. | +| O7 | Watch for Collabora upstream VP9 series | Monitor lore + Collabora gitlab weekly. | +| R3 | RK3588 VP9 max-resolution | 4096×N. BSP `mpp/hal/rkdec/vp9d/hal_vp9d_vdpu34x.c:284` clamps at `width > 4096`. Mirror RK3399's `{4096, 2304, step 64}` frmsize in the new fmt entry. | + +## The wiring patch + +Target file: `drivers/media/platform/rockchip/rkvdec/rkvdec.c`. + +### Change 1: add `vdpu38x_vp9_ctrls` ctrl-set descriptor + +Insert near `vdpu38x_h264_ctrls` (~line 397): + +```c +static const struct rkvdec_ctrls vdpu38x_vp9_ctrls = { + .ctrls = rkvdec_vp9_ctrl_descs, + .num_ctrls = ARRAY_SIZE(rkvdec_vp9_ctrl_descs), +}; +``` + +Reuses the existing `rkvdec_vp9_ctrl_descs[]` (line 419). V4L2 controls are HW-agnostic so no duplication needed; only a fresh `rkvdec_ctrls` wrapper. + +### Change 2: add VP9 entry to `vdpu381_coded_fmts[]` + +Append after the H264 entry (~line 549): + +```c + { + .fourcc = V4L2_PIX_FMT_VP9_FRAME, + .frmsize = { + .min_width = 64, + .max_width = 4096, + .step_width = 64, + .min_height = 64, + .max_height = 2304, + .step_height = 64, + }, + .ctrls = &vdpu38x_vp9_ctrls, + .ops = &rkvdec_vp9_fmt_ops, + .num_decoded_fmts = ARRAY_SIZE(rkvdec_vp9_decoded_fmts), + .decoded_fmts = rkvdec_vp9_decoded_fmts, + }, +``` + +Reuses existing `rkvdec_vp9_fmt_ops` (legacy vdpu34x register layout, written for RK3399) and `rkvdec_vp9_decoded_fmts` (NV12_4L4 + NV12). No `subsystem_flags = VB2_V4L2_FL_SUPPORTS_M2M_HOLD_CAPTURE_BUF` because VP9 frames are self-contained (no slice fragmentation). + +### Change 3 (conditional): instrumentation only if first decode silent-fails + +Add a one-shot DIAG `pr_warn` at the end of `rkvdec_vp9_run` if Phase 6 verification shows the IRQ never fires or fires with unexpected status. Otherwise the patch is **two structural additions and zero new logic**. + +## Test plan + +| Step | Action | Pass criterion | +|---|---|---| +| 1 | Build `drivers/media/platform/rockchip/rkvdec/rockchip-vdec.ko` on boltzmann (already has tree + toolchain) | Builds clean | +| 2 | Backup current `7.0.0-rc3-devices+/kernel/drivers/media/platform/rockchip/rkvdec/rockchip-vdec.ko` on ampere ([feedback_backup_before_module_replace](../../.claude/projects/-home-mfritsche-src-fresnel-fourier/memory/feedback_backup_before_module_replace.md)) | Backup md5'd + off-device copy on fresnel | +| 3 | Install + `modprobe -r rockchip-vdec; modprobe rockchip-vdec` (no reboot needed) | `dmesg \| grep rkvdec` shows VP9 fmt enumerated. `v4l2-ctl --list-formats-out --device=/dev/video0` shows `VP9F` | +| 4 | Decode test bitstream: `LIBVA_DRIVER_NAME=v4l2_request ffmpeg -hwaccel vaapi -hwaccel_output_format vaapi -i -vf hwdownload,format=nv12 -frames:v 100 -f rawvideo -pix_fmt nv12 /tmp/hw-vp9.nv12` | rc=0, output size = 100 × W × H × 1.5 bytes | +| 5 | SW reference: `ffmpeg -i -vf format=nv12,select='eq(n,100)' -frames:v 1 -f rawvideo -pix_fmt nv12 /tmp/sw-vp9-f100.nv12` | rc=0 | +| 6 | Byte-compare frame 100 (per [feedback_compare_hw_against_sw_reference](../../.claude/projects/-home-mfritsche-src-fresnel-fourier/memory/feedback_compare_hw_against_sw_reference.md)) | ≥ 99% exact match | +| 7 | Iteratively widen test vectors (different resolutions, frame-parallel decoding, super-frame, segmentation enabled) | Each VP9 feature within `vdpu381` capability passes | + +## Risk register update + +| # | Risk | Mitigation | +|---|---|---| +| R1 | Same physical IP, two register dialects — hidden mode-switch register? | BSP `rkvdec_rk3588_hw_ops.init` = `rkvdec2_init` (generic). No mode switch. **Resolved**. | +| R2 | `libva-v4l2-request-fourier` lacks rkvdec VP9 dispatch | Phase 7 task: add VP9 fmt to backend's `vendor_fourcc_to_driver_fourcc` switch. Sibling iter33 VP8 pattern. Independent of kernel patch. | +| R3 | RK3588 VP9 max-resolution differs from RK3399's 4096×2304 | 4096-max confirmed; height limit unchecked. Open empirically; cap conservatively at 2304 first. | +| R4 | `dirac` RK3399 cross-test fixture status | NOT regression-impacted — patch only ADDS to `vdpu381_coded_fmts[]`, doesn't touch `rk3399_coded_fmts[]` or `rkvdec-vp9.c`. No risk to RK3399 path. **Resolved**. | +| R5 | Casanova posts upstream VP9 series mid-effort → fork divergence | Monitor; coordinate if posted. | +| R6 (new) | RCB sizing oversizes for VP9 (HEVC sized for 8K, VP9 caps at 4K) — wastes SRAM but should still work | First attempt uses HEVC sizes. If sub-optimal, follow-up patch adds VP9-specific RCB descriptor table. Not a correctness issue. | + +## Sequence + +Phase 1: this plan, then Sonnet Phase 5 architect review. +Phase 2: implement Change 1 + 2 in `boltzmann:~/src/linux-rockchip` `linux-rk3588-marfrit` branch. +Phase 3: build module on boltzmann via distcc. +Phase 4: install on ampere + `modprobe -r/modprobe` cycle + sanity v4l2-ctl enumeration. +Phase 5: SW review of code + Sonnet checkpoint. +Phase 6: decode test + byte-compare verification. +Phase 7: persistence — commit to `linux-rk3588-marfrit`, prepare upstream patch via b4, route through kernel-agent. +Phase 8: monitor + extend test vectors; backend (libva) update tracked separately. + +## Open questions remaining for Phase 1 review + +- Q-P1.1: Does mainline `vdpu381_irq_handler` correctly handle the case where the IRQ status doesn't set `VDPU381_STA_INT_DEC_RDY_STA` (BIT 2) on a successful VP9 decode? i.e. does VP9 set a *different* status bit than HEVC? Mitigation: add diagnostic `pr_warn` for first 5 IRQs. +- Q-P1.2: Does `rkvdec_vp9_start` use `dma_alloc_*` that conflicts with rkvdec0's IOMMU mode? Likely no; vb2 handles this. +- Q-P1.3: VP9 needs a `done` callback (`rkvdec_vp9_done` populates probability update for adaptive entropy). Verify the existing `done` runs in the correct workqueue context after IRQ — should "just work" since mainline has the dispatch. diff --git a/phase1_plan_v2.md b/phase1_plan_v2.md new file mode 100644 index 0000000..c10a87d --- /dev/null +++ b/phase1_plan_v2.md @@ -0,0 +1,121 @@ +# Phase 1 plan v2 — VP9 backend port to vdpu381 (post-Janet-review) + +Date: 2026-05-17 ~01:35. Phase 1 v1 was AMENDED by Janet architect review (verdict: AMEND, two BLOCKERs + one AMENDMENT). + +## Why v2 — the Phase-0 wiring-patch thesis is dead + +Janet's review surfaced two source-code blockers that overturn the "VP9 is < 100 lines wiring" claim from Phase 0/Phase 1 v1: + +### Blocker A: `vdpu381_variant.num_regs` is unset + +`rkvdec.c:1707-1715` defines `vdpu381_variant` with no `.num_regs` field — defaults to 0. The legacy VP9 path at `rkvdec-vp9.c:660-661` writes the register file via: + +```c +rkvdec_memcpy_toio(rkvdec->regs, regs, + MIN(sizeof(*regs), sizeof(u32) * rkvdec->variant->num_regs)); +``` + +`MIN(sizeof(*regs), 0) = 0`. **Zero bytes hit MMIO.** If we naively added VP9 to `vdpu381_coded_fmts[]` pointing at `rkvdec_vp9_fmt_ops`, the hardware would never be configured. The watchdog would fire 2s later with no IRQ. + +### Blocker B: vdpu381 register layout is SEGMENTED, legacy VP9 is FLAT + +Legacy `rkvdec_vp9.c:660` writes the entire register file at flat `rkvdec->regs + 0`. vdpu381 paths (`rkvdec-vdpu381-hevc.c`, `rkvdec-vdpu381-h264.c`) use **five distinct register-block offsets** within the MMIO: + +```c +#define OFFSET_COMMON_REGS (8 * sizeof(u32)) // 0x020 +#define OFFSET_CODEC_PARAMS_REGS (64 * sizeof(u32)) // 0x100 +#define OFFSET_COMMON_ADDR_REGS (128 * sizeof(u32)) // 0x200 +#define OFFSET_CODEC_ADDR_REGS (160 * sizeof(u32)) // 0x280 +#define OFFSET_POC_HIGHBIT_REGS (200 * sizeof(u32)) // 0x320 +``` + +Plus the decode trigger at `VDPU381_REG_DEC_E = 0x028` (NOT `RKVDEC_REG_INTERRUPT = 0x004` like legacy). A flat memcpy at offset 0 lands in the wrong fields — would overwrite control/status registers and never hit the codec-params block where the bitstream config goes. + +**Confirmed via BSP `mpp_rkvdec2.c:467 rkvdec2_run`**: writes are dispatched via `mpp_write_req` iterating over `task->w_reqs[i].offset / sizeof(u32)` — segmented per request, matching mainline's vdpu381 model. **Confirmed via BSP `mpp_rkvdec2.c:107 trans_tbl_vp9d[]`**: VP9's DMA-translated register indices are 128..199 (offset 0x200..0x320 = `OFFSET_COMMON_ADDR_REGS` through `OFFSET_POC_HIGHBIT_REGS`) — IDENTICAL to HEVC's range (also 128..199). VP9 uses the same segmented register layout as HEVC on this hardware. + +### Amendment C: IRQ status bit semantics + +BSP `rkvdec2_irq` checks `RKVDEC_IRQ_RAW = BIT(1)`. Mainline `vdpu381_irq_handler` checks `VDPU381_STA_INT_DEC_RDY_STA = BIT(2)`. Same register (0x380), different bits. If VP9 sets BIT(1) instead of BIT(2) on success, every IRQ gets misclassified as ERROR and `rkvdec_iommu_restore` fires. Mitigation: instrument the IRQ handler with `pr_warn(status)` for the first 5 decodes pre-flash, compare to expected `BIT(2) | BIT(1)` mask. + +## Revised architecture + +VP9 on RK3588's vdpu381 IP needs a **new backend file** `rkvdec-vdpu381-vp9.c`, modelled structurally on `rkvdec-vdpu381-hevc.c` (639 lines) but borrowing the SPEC LAYER from legacy `rkvdec-vp9.c`. Estimated 500-900 lines new code. + +### File breakdown + +| File | Action | Lines | +|---|---|---| +| `rkvdec-vdpu381-vp9.c` | NEW: vdpu381 register-block packing + run/start/stop/done ops | ~600-800 | +| `rkvdec-vp9-common.{c,h}` | NEW: extract from legacy `rkvdec-vp9.c` the codec-spec layer (probability tables, frame_ctx state machine, segmap mgmt, V4L2 vp9 helpers, `rkvdec_vp9_priv_tbl`, `rkvdec_vp9_start/stop` body) | ~600 (mostly moved from legacy) | +| `rkvdec-vp9.c` | REFACTOR: replace flat register-dump body with thin shim that calls common helpers; keep the legacy `rkvdec_vp9_fmt_ops` ABI intact for RK3399 | ~400 (down from 1042) | +| `rkvdec-vdpu381-regs.h` | EXTEND: add VP9 register-struct definitions (`struct rkvdec_vp9_common_regs`, `_codec_params_regs`, `_addr_regs`) | +200 | +| `rkvdec.c` | wire: add `vdpu38x_vp9_ctrls`, `rkvdec_vdpu381_vp9_fmt_ops` extern, add to `vdpu381_coded_fmts[]`, set `vdpu381_variant.num_regs` if any legacy code still consults it | +30 | + +### Reuse boundaries + +REUSED FROM LEGACY (in common file): +- VP9 probability tables (`vp9_default_*_probs`) +- `struct rkvdec_vp9_run` augmenting `rkvdec_run` +- `rkvdec_vp9_priv_tbl` (probe + count buffers, segmap) +- `init_v4l2_vp9_count_tbl`, `rkvdec_vp9_done` (entropy update from HW count buffer) +- `rkvdec_vp9_adjust_fmt` (frame-size clamping) +- `rkvdec_vp9_start` / `rkvdec_vp9_stop` (DMA alloc for probe/count/segmap) +- V4L2 vp9 helpers from `` + +REWRITTEN FOR VDPU381 (in new backend file): +- `rkvdec_vdpu381_vp9_run` — register packing into the 5 segmented blocks, trigger at `VDPU381_REG_DEC_E = 0x028` +- Register struct definitions (bit-packed `swreg_*` fields per Rockchip BSP MPP `Vdpu382Vp9dCtx_t` pattern from `mpp/hal/rkdec/vp9d/hal_vp9d_vdpu382.c`) + +REUSED FROM EXISTING VDPU381 INFRASTRUCTURE: +- `vdpu381_irq_handler` (single handler, codec-agnostic — confirmed via BSP `rkvdec2_irq`) +- `vdpu381_rcb_sizes[]` (HEVC-tuned, oversized for 4K VP9 but harmless — sub-optimal SRAM use, not a correctness issue) +- IOMMU (`rkvdec0_mmu`) — `dma_alloc_coherent` works transparently through the IOMMU +- V4L2 control descriptors (`rkvdec_vp9_ctrl_descs[]` line 419) — HW-agnostic + +## Register-struct sourcing + +The bit-packed `struct rkvdec_vp9_common_regs` etc. will be derived from Rockchip BSP MPP `mpp/hal/rkdec/vp9d/hal_vp9d_vdpu382.c` (vdpu382 = sibling IP of vdpu381). Pattern: each `swreg_NNN` becomes a u32 field; bit-packed sub-fields use C bitfield syntax exactly like `struct rkvdec_hevc_sps` in `rkvdec-vdpu381-hevc.c`. + +Concrete starting point: `vp9_hw_regs->common.reg028.swreg_vp9_rd_prob_idx` (BSP line 595) shows the register at common-block offset 0x028 has a `vp9_rd_prob_idx` bit-field. Inventory all such `.common.regNNN` accesses in `hal_vp9d_vdpu382.c` → reproduce as a single `struct rkvdec_vp9_common_regs` in `rkvdec-vdpu381-regs.h`. + +## Test plan (unchanged from v1, plus a new pre-flight) + +Add a Pre-Phase-3 step: with the NEW backend compiled, instrument `vdpu381_irq_handler` to `pr_warn` the status register value for the first 10 IRQs after VP9 enable, to resolve Amendment C empirically (do VP9 IRQs set BIT(2)?). One module build cycle; takes ~5 min added to Phase 3. + +## Test fixture + +- **Test bitstream**: bbb_30s_720p.vp9.webm — need to generate on fresnel via `ffmpeg -i bbb-1080p.mkv -c:v libvpx-vp9 -b:v 1500k -t 30 -vf scale=1280:720 -y bbb_30s_720p.vp9.webm`. Use 720p so it's well under the 4096 max-width. +- **SW reference**: `ffmpeg -c:v vp9` on x86 fresnel — known good. +- **Verification**: SHA the raw NV12 byte stream, not PNG ([feedback_pixel_compare_in_yuv_not_png](../../.claude/projects/-home-mfritsche-src-fresnel-fourier/memory/feedback_pixel_compare_in_yuv_not_png.md)). + +## Updated phase sequence + +1. **Phase 1 v2** (this doc) → Sonnet/Janet re-review for sign-off. +2. **Phase 2** — implement the 3 new files + edits. Boltzmann tree, `linux-rk3588-marfrit` branch, sibling commits to make review easier. +3. **Phase 3** — build module (~5 min via distcc), install on ampere with backup ([feedback_backup_before_module_replace](../../.claude/projects/-home-mfritsche-src-fresnel-fourier/memory/feedback_backup_before_module_replace.md)), `modprobe -r/modprobe`. **Pre-flight**: enable IRQ diagnostic `pr_warn` for first 10 statuses. +4. **Phase 4** — decode test bbb_30s_720p.vp9.webm at frame 100, byte-compare against SW reference. +5. **Phase 5** — Sonnet code review of the actual implementation diff. +6. **Phase 6** — verification iteration; widen test vectors (resolutions, segmentation, super-frames). +7. **Phase 7** — strip diagnostic `pr_warn`, finalise commits, prepare upstream patch series via `b4`, route via kernel-agent (issues #16+). +8. **Phase 8** — backend (`libva-v4l2-request-fourier`) VP9 dispatch path added (sibling, iter33 VP8 pattern). Fluster `VP9-TEST-VECTORS` validation. + +## Risk register v2 + +| # | Risk | Status | +|---|---|---| +| R1 | Same physical IP, two register dialects — hidden mode-switch register? | RESOLVED: BSP `rkvdec_rk3588_hw_ops.init` = `rkvdec2_init` (codec-generic); no mode switch. | +| R2 | libva backend lacks rkvdec VP9 dispatch | Phase 8, independent of kernel patch | +| R3 | RK3588 VP9 max-resolution | 4096-max confirmed; cap entry at 4096×2304 conservatively | +| R4 | dirac RK3399 regression | Mitigated by keeping legacy `rkvdec_vp9_fmt_ops` ABI intact via the common-file split; legacy path unchanged at the ops level. Refactor risk = the common-file extraction itself. Cross-test required before commit | +| R5 | Casanova upstream VP9 series mid-effort | Monitor weekly | +| R6 | RCB sizing oversizes for VP9 (HEVC sized for 8K) | Resolved as PROCEED-with-note (sub-optimal SRAM, not a correctness issue) | +| R7 (new) | Bit-packed register struct may not match HW exactly — endianness, bit-ordering, packing | Use exact BSP layout from MPP `hal_vp9d_vdpu382.c`; instrument `pr_warn` writes for first decode to verify expected register values | +| R8 (new) | `rkvdec_vp9.c` refactor extracts code that legacy path still needs as inlined — breaks RK3399 build | Compile-test rk3399 defconfig after the refactor; before VP9-on-vdpu381 work begins | + +## Open questions for v2 review + +- Q-P1v2.1: Should we extract `rkvdec-vp9-common.c` upfront as a clean separate patch (easier review, but no immediate user benefit), or inline the reuse via `#include` directives inside `rkvdec-vdpu381-vp9.c` (less clean, but a single self-contained patch)? +- Q-P1v2.2: Can the new file use `static inline` helpers in a shared header to avoid the common-file refactor entirely? +- Q-P1v2.3: Is there value in cross-validating against the BSP MPP `Vdpu382Vp9dCtx_t` semi-public ABI before declaring our register struct definitive? + +Sending to Janet for the AMEND-cycle re-review.