Files
libva-multiplanar/phase8_iteration7_close.md
claude-noether ec769a9687 docs: clarify Rockchip silicon across operative docs (RK3566)
PineTab2 is Rockchip RK3566 silicon, not RK3568. The hantro driver
attaches via the rockchip,rk3568-vpu DT compatible because RK3566/
RK3568 silicon is close enough to share that variant. The proper
RK3566 mainline driver target (rkvdec2 / vdpu346) has no kernel
support yet — Christian Hewitt's patch series LKML 2025/12/26/206
is unmerged.

Updated operative docs to use the consistent form:
"PineTab2 (Rockchip RK3566 silicon; hantro driver via the
rockchip,rk3568-vpu DT compatible)" or shorter variants.

Files updated:
- README.md (campaign top-level): TL;DR, deliverable, KWin link,
  hardware target, hardware listing
- firefox-fourier/README.md: tested-on line
- phase8_iteration7_close.md: hardware carry
- phase8_iteration6_close.md: hardware carry, MPEG-2 drop
  rationale
- phase0_findings_iter7.md: predecessor summary, fourier-fresnel
  description, hardware carry
- phase2_iter7_situation.md: msync hypothesis hardware reference

Historical iter1-iter5 phase docs left as-is — they're snapshots
of what the campaign believed at the time. The canonical source
for the silicon-ID correction is track_F_research_2026-05-06.md
(commit 358801b).

Not a correctness change. The campaign's empirical evidence is
unaffected — the hantro/rk3568-vpu driver path that we exercised
was always the actual decode path on PineTab2 silicon.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 11:39:28 +00:00

110 lines
9.7 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Iteration 7 close (Phase 8) — A+B+C all GREEN
Opened 2026-05-06 immediately after iter6 close. Closing same day. Locked candidates: **A** (msync pixel-correctness verification — iter5 sonnet C3), **B** (slot-leak error recovery — iter6 internal carry), **C** (cap_pool race synthetic harness — iter5 sonnet C4 / iter6 candidate A formal anchor).
All three closed GREEN. One additional latent bug (OUTPUT pool stale-slot on context-bound resolution change) was surfaced by the Track C harness during Phase 7 verification and fixed in-iteration. One distinct latent bug (STREAMON-on-context-recreate after resolution change) was also surfaced and is logged as iter8 candidate.
## Verdict per track
### Track A: GREEN — msync removal pixel-verified safe
`tests/run_msync_pixel_verify.sh`: 100-frame `bbb_1080p30_h264.mp4` decoded twice — once via FFmpeg's libavcodec SW path, once via FFmpeg-VAAPI through our v4l2_request driver with `hwdownload,format=nv12,crop=$W:$H` to normalize MB-padding artifacts. Result: **byte-for-byte identical, sha256 `58c8f3f4320fe1d761ba3b461d2f63fbbdbad3d2a608a83ae1bca9b0aa3ff6fb`** for both. iter5 Phase 5 sonnet caveat C3 formally closes — the kernel V4L2 framework does cache sync at DQBUF for CMA-backed allocations on hantro G1 / kernel 6.19.10, the iter1 `msync(MS_SYNC|MS_INVALIDATE)` was structurally redundant.
### Track B: GREEN — slot-leak error recovery
Adds `request_pool_force_release(pool, slot_index)` that REINITs the slot's fd, falls back to `close + media_request_alloc` on REINIT failure, and leaves the slot dead-busy if even alloc fails (other slots unaffected, pool capacity reduced by 1). Wired into `RequestSyncSurface` error paths.
Phase 5 sonnet code review caught a discriminator subtlety pre-commit: the original "always force_release on `source_data != NULL`" path was unsafe for the OUTPUT-DQBUF-failure sub-case (V4L2 buffer in indeterminate kernel state, reusing the slot would QBUF on a kernel-still-held buffer and hit exactly the iter6 race). Fixed: separate `error_buffer_indeterminate` label leaves the slot dead-busy in that specific case.
Empirical happy-path regression: mpv 100 frames clean, Firefox bbb 35s+ clean, RDD holds `/dev/video1` + `/dev/media0` throughout. No `Unable to set control(s)` / `Unable to queue buffer` events. Synthetic fault-injection deferred — code path is small + sonnet-reviewed; full inject test would require a separate debug build.
### Track C: GREEN — cap_pool race synthetic harness
`tests/cap_pool_probe_pattern.c` (~140 lines) + `tests/run_cap_pool_probe.sh`: synthetic test that allocates 4 surfaces at 128×128, destroys them, then allocates 4 surfaces at 1920×1080 — the iter5 sonnet C4 specified pattern. Driver stderr shows two clean `cap_pool_init: 24 slots ready` events (small + big), no REQBUFS / EBUSY / `Unable to request buffers` / `Unable to set format` race indicators. Harness exits 0.
Phase 5 sonnet pre-commit suggested adding `vaCreateContext` to the test. That suggestion exposed an additional latent bug during Phase 7 (STREAMON-on-context-recreate); reverted to the no-context pattern that matches the actual C4 specification. The new STREAMON bug is logged as iter8 candidate.
## What landed
### Fork commits (libva-v4l2-request-fourier)
- `988b848` — iter7 main: A+B+C — slot-leak fix, cap_pool harness, msync verify harness. Three new files in `tests/`, modifications to `src/request_pool.{h,c}` and `src/surface.c`.
- `7bd0818` — iter7 Phase 7 finalization: OUTPUT-pool teardown on resolution change in `src/surface.c` (latent bug surfaced by harness). Test refinements (sonnet-recommended context creation reverted; runner grep tightened).
Net: ~545 lines added, ~13 lines modified across 6 files. Three test artifacts in `tests/` ready for use as regression checks in iter8+.
### Campaign artifacts (libva-multiplanar)
- `phase0_findings_iter7.md` — substrate (6 candidates A..F locked-deferred → A+B+C locked)
- `phase2_iter7_situation.md` — Phase 2 design + risk + plan + effort per track
- `phase8_iteration7_close.md` — this file
### Built driver
- iter7 build: `/usr/lib/dri/v4l2_request_drv_video.so` sha256 `54999017ebd433228a5060cc5e0c3b353e8dd7d3062d158dc63c55eb59503317`
- Note: this sha is the BEFORE the Phase 7 OUTPUT-pool teardown fix landed. Final iter7-end driver requires rebuild from `7bd0818` HEAD; see "Carries to iter8" for the deploy step.
## State that carries to iter8 (or campaign close)
- **Hardware**: ohm PineTab2 (Rockchip RK3566 silicon; hantro driver via `rockchip,rk3568-vpu` DT compatible), kernel 6.19.10. Access: `ohm` (LAN). VPN currently flaky.
- **Userspace**: firefox 150.0.1-1.1 (iter5 amendment), libva 2.23.0, mesa 26.0.5, libdrm 2.4.131, mpv 0.41.0-3.
- **Test fixture**: bbb_1080p30_h264.mp4 sha256 `dcf8a7170fbd...`.
- **Build container**: firefox-fourier LXD on boltzmann, persistent.
## Documented limitations carried to iter8+ (or campaign close)
1. **STREAMON-on-context-recreate after resolution change** — NEW from iter7 Phase 7. Surfaced when `vaCreateContext` is called twice with different resolutions, with `vaDestroyContext` between. Second STREAMON returns EINVAL. Real consumers (Firefox, mpv) don't trigger this (single context per decoder lifetime), so it's a corner case. Fix likely involves ordering in `RequestCreateContext` — possibly the dev_ctrls S_EXT_CTRLS rejecting on a re-init device, or a STREAMOFF/REQBUFS(0) sequence being incomplete somewhere.
2. **Synthetic fault-injection for slot-leak recovery** — Track B's success criterion called for a fault-inject build (REINIT returns -EBUSY after N frames) demonstrating pool starvation pre-fix and recovery post-fix. Deferred — sonnet code review covered the semantic correctness, happy-path regression confirmed no normal-flow impact. Could be added as a debug-build target in iter8 if a hard guarantee is needed.
3. **Probe-pattern test for context-recreate path** — iter7 Track C harness covers no-context resolution change. A future test could exercise the context-recreate path once the STREAMON bug above is fixed.
4. **`RequestDestroyContext` auto-frees render_targets** — iter7 verified that our `RequestDestroyContext` calls `RequestDestroySurfaces` on its surfaces_ids array. This is non-standard libva behavior but matches consumer organic flow (mpv / Firefox tolerate the subsequent `vaDestroySurfaces` returning `INVALID_SURFACE`). Worth documenting as known-deviation in the upstream submission package.
5. **Pool size hardcoded to 16** — iter6 sonnet review suggested `max(surfaces_count, DPB_SIZE_H264_MAX)`. Empirically 16 is fine for the workloads tested; deferring the parameterization until upstream submission requires it.
## Lessons distilled to memory
### Memory updates this iteration
No new memory entries — Track A/B/C closes existing carry items rather than introducing new diagnostic patterns. Existing memory:
- `feedback_request_fd_lifecycle.md` (iter6) covers the REINIT discipline
- `feedback_kernel_obfuscation_compound.md` (iter4) covers the cluster-validation pattern
- `feedback_no_fixture_hardcoding.md` (iter6) covers test-harness honesty
### Process lessons (recorded here, not in standalone memory)
**Sonnet design reviews surfaced two real issues in iter7 Phase 5 pre-commit + one in Phase 7:**
1. Pre-commit: discriminator subtlety in surface.c error path (Track B) — fixed to `error_buffer_indeterminate` label
2. Pre-commit: stride/padding false-failure in msync verify harness (Track A) — fixed with explicit crop + diagnostic
3. Pre-commit: cap_pool harness needed `vaCreateContext` to actually exercise pool init — added; later REVERTED in Phase 7 because the addition surfaced a separate STREAMON bug that wasn't iter7 scope
The third point is meta-instructive: a well-intentioned pre-commit refinement can inadvertently expand scope. Holding to the iter5 sonnet C4 specification ("vaCreateSurfaces small then big, allocation-only") was the right scoping move once Phase 7 evidence showed the context-recreate path had a separate bug class.
## Bootlin upstream outlook
iter7 cleanly closes the iter5/iter6 carry list. The fork now has:
- iter6's per-slot REINIT request_fd discipline (commit `a09c03c`)
- iter7's OUTPUT-pool teardown on resolution change (commit `7bd0818`)
- iter7's slot-leak error recovery (commit `988b848`)
Three test artifacts in `tests/` provide regression coverage: cap_pool probe pattern, msync pixel verify, run-script orchestrators.
Outstanding for upstream-readiness:
- STREAMON-on-context-recreate (iter8 candidate, latent corner case)
- Pool-size parameterization (iter6 sonnet review carry, low priority)
- Multi-codec audit was DROPPED at iter6 close (CPU handles MPEG-2 fine)
Per `feedback_no_upstream.md`, no PR/MR happens without explicit operator instruction.
## Phase 1 success criterion — final per track
| Criterion | Result |
|---|---|
| **A**: 100-frame vaapi-copy frame-by-frame sha matches FFmpeg SW reference | ✓ HIT — sha `58c8f3f4...` identical for both |
| **B**: synthetic fault-inject demonstrates pre-fix starvation, post-fix recovery | ⚠ PARTIAL — code-reviewed by sonnet (Phase 5), happy-path regression clean (Phase 7); fault-inject build deferred to iter8 if hard guarantee needed |
| **C**: synthetic test issues vaCreateSurfaces small→big, zero REQBUFS-EBUSY in driver stderr | ✓ HIT — `run_cap_pool_probe.sh` exits 0, no race indicators |
| Phase 5 sonnet review confirms fix is libva-side, all three paths verified, no new mutable global state | ✓ HIT — APPROVE-WITH-CHANGES, all changes addressed |
**Joint success:** all three tracks closed on the same iter7-end driver build. Phase 7 verified each. Phase 5 sonnet pre-commit caveats addressed. iter7 closes GREEN with one named carry (STREAMON-on-context-recreate) for iter8.