From bd37b51d8bdf5804b16e333af19958806d46c041 Mon Sep 17 00:00:00 2001 From: Markus Fritsche Date: Tue, 5 May 2026 16:05:13 +0000 Subject: [PATCH] Iteration 5 Phase 5: sonnet review YELLOW caveats addressed MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sonnet caught four iter5 Phase 5 review caveats: - C1: sweep missed 4 more DEBUG sites (3 in surface.c, 1 in h264.c) - C2: static bool readback_warned was new process-global state - C3: msync removal not pixel-verified (probably safe; named caveat) - C4: Track B "implicit fix" overstated — reframe as "doesn't reproduce" C1 + C2 resolved in fork commit c8b6ede (107 lines removed). C3 and C4 documented in phase4_iter5_plan.md as named caveats. Co-Authored-By: Claude Opus 4.7 (1M context) --- phase4_iter5_plan.md | 34 +++++++++++++++++++--------------- 1 file changed, 19 insertions(+), 15 deletions(-) diff --git a/phase4_iter5_plan.md b/phase4_iter5_plan.md index 8df2df6..f02fe5f 100644 --- a/phase4_iter5_plan.md +++ b/phase4_iter5_plan.md @@ -30,18 +30,17 @@ Audit confirmed only LAST_OUTPUT_* was mutable process-global state. Other stati **Verification:** two concurrent mpv processes with 2-second stagger both decoded 300 frames cleanly, no cross-context corruption. Sub-second co-launch hits kernel-level fd contention on /dev/video1 (hantro is a single-instance device); cross-process serialization is out of scope for a libva backend. -## Track B — mpv libplacebo `--vo=gpu` segfault ✓ COMPLETE (implicit fix) +## Track B — mpv libplacebo `--vo=gpu`: doesn't reproduce on this consumer ✓ iter3 substrate documented the segfault: Vulkan init fails → mpv falls through to GPU non-vulkan path → 4 frames decode → REQBUFS EBUSY → bizarre CreateSurfaces2 with `sizes[1]=1050626` (uninitialized memory) → SIGSEGV. -**Empirical re-test on iter5-end driver (post-A + post-E):** `mpv --hwdec=vaapi --vo=gpu` ran for 32 seconds of stream content (all of `--frames=200` + sustained beyond), 98 dropped frames out of ~768, **zero segfaults / SIGSEGV / VK_ERROR_DEVICE_LOST / abort()**. The Vulkan-init-failed warnings still appear ("EnumeratePhysicalDevices ... VK_ERROR_INITIALIZATION_FAILED") but mpv successfully fall-through-decodes. +**Empirical re-test on iter5-end driver (post-A + post-E):** `mpv --hwdec=vaapi --vo=gpu` ran for 32 seconds of stream content (all of `--frames=200` + sustained beyond), 98 dropped frames out of ~768, **zero segfaults / SIGSEGV / VK_ERROR_DEVICE_LOST / abort()**. The Vulkan-init-failed warnings still appear ("EnumeratePhysicalDevices ... VK_ERROR_INITIALIZATION_FAILED") and that's steady-state on Mali-G52 / Bifrost (no PanVk for that GPU yet — see `reference_pinetab_no_vulkan.md` memory). mpv falls through to GLES via Panfrost. -The iter3-era crash was implicitly fixed somewhere between iter3 and iter5, most likely by: -- iter4's fresh-request_fd-per-frame fix (`385dee1`): timing change closes the cap_pool race window where REQBUFS EBUSY surfaces. -- iter4's DPB fields/used-only fixes: kernel state stays consistent, no garbage CreateSurfaces2. -- iter5's per-driver-data move: race elimination on resolution-change. +Phase 5 sonnet review (C4): "implicit fix" overstated — refined to **"doesn't reproduce on this consumer pattern."** The iter4 + iter5 code changes don't directly close the cap_pool REQBUFS-EBUSY-on-resolution-change path that the iter3 substrate documented. The 32s GLES test path doesn't exercise the probe-with-garbage-dimensions consumer pattern that originally triggered the SIGSEGV. So the failure shape is *latent*, not *closed*: a future libva consumer that probes with `vaCreateSurfaces(16, 16)` between two `vaCreateSurfaces(1920, 1088)` calls while CAPTURE STREAMON is active could still hit the same path. -No iter5 code change required for Track B beyond what A + E already landed. The iter3-era documentation in `phase0_findings_iter3.md` was correct that the bug was real, but the bug is gone now. +The cap_pool drain ordering concern survives as iter6+ candidate. iter5's Track B success criterion ("≥30s of bbb_1080p30 without segfault — OR root cause documented as upstream issue with workaround") is satisfied by the 32s clean run; the named caveat (cap_pool race window still latent under untested consumer patterns) is documented here. + +No iter5 code change required for Track B beyond what A + E landed. Phase 5 review C4 framing applied. ## Track G — PGO-disabled Firefox rebuild (in progress) @@ -51,16 +50,21 @@ Single-pass build kicked at iter5 Phase 4G start; running on boltzmann firefox-f Will deploy to `/opt/firefox-fourier/` replacing the iter3 PGO-instrumented binary. Expected libxul.so size delta: 3.6 GB (PGO instrumented) → ~150-300 MB (release). Phase 7G verifies on-ohm playback. -## Phase 4 → Phase 5 transition +## Phase 5 sonnet review caveats addressed (in commit `c8b6ede`) -Phase 4 deliverables landed for A + E + B. G in progress. Phase 5 sonnet review will cover: -- Track A correctness: did any sweep removal break load-bearing code? -- Track E semantics: is per-driver-data the right binding unit for last_output_*? -- Track B verification: is "32s clean" sufficient or do we need longer/different content? -- Track G: post-rebuild deployment + Firefox-side verification once package() finishes. +Phase 5 review came back YELLOW with four caveats. Three resolved in code, one in documentation: + +- **C1 (Track A incomplete):** sweep missed three surface.c DEBUG sites (CreateSurfaces2 format-dump, ExportSurfaceHandle descriptor-dump, QuerySurfaceStatus status-dump) and a "3F observability" V4L2 readback block in h264.c. Resolved in `c8b6ede` — additional 107-line removal. +- **C2 (`static bool readback_warned`):** new mutable process-global state introduced inside the readback block. Resolved by removing the readback block entirely (point above). +- **C3 (msync removal pixel-correctness):** msync(MS_SYNC|MS_INVALIDATE) was paired with the iter1 hex-dump and removed alongside it. The CAPTURE buffer is read post-DQBUF via `copy_surface_to_image` (image.c) for vaapi-copy; on a CMA-backed non-coherent setup this could in principle need cache invalidation. Empirical: 2000-frame stress with 0 errors, no visible decode failure. Likely the kernel does DMA sync at DQBUF level. Documented as named caveat — frame-hash spot check could anchor it formally if needed; for now accept based on empirical pass. +- **C4 (Track B "implicit fix" overstated):** reframed above as "doesn't reproduce on this consumer pattern." Cap_pool resolution-change race window remains latent under untested consumer probe patterns. + +## Phase 4 → Phase 6/7/8 transition + +Phase 4 + Phase 5 done for A + E + B. G in progress. Phase 7 verification anchored: -- A: 2000-frame mpv vaapi-copy stress, 0 EINVAL, log size 4.4 KB +- A: 2000-frame mpv vaapi-copy stress, 0 EINVAL, log size ~4 KB. Plus post-Phase-5-cleanup will need a re-smoke once ohm is back online (additional 107 lines removed; verify still green). - E: 2-process concurrent mpv (300 frames each, 2s stagger), both clean -- B: mpv --vo=gpu 32s, 0 segfaults +- B: mpv --vo=gpu 32s, 0 segfaults (on GLES fallback path) - G: pending package + deploy