diff --git a/phase0_findings_iter9.md b/phase0_findings_iter9.md new file mode 100644 index 0000000..66c7ac8 --- /dev/null +++ b/phase0_findings_iter9.md @@ -0,0 +1,152 @@ +# Iteration 9 — Phase 0 (substrate / motivation / inventory) → Phase 1 lock + +Opens 2026-05-13 immediately after iter8 PARTIAL close ([`phase8_iteration8_close.md`](phase8_iteration8_close.md), commit `3ed1e45`). User confirmed proceed to iter9. + +## Empirical surprise from iter9 Phase 0 deep-strace + +I dumped the FULL 560-byte DECODE_PARAMS for the first P-frame of libva vs kdirect with `strace -s 8192`: + +| Region | libva | kdirect | Diff | +|---|---|---|---| +| DPB[0] bytes 0-7 (reference_ts) | `30c2ea5cd622af18` (giant gettimeofday ns) | `f82a000000000000` (0x2af8 = 10968 ns) | **DIFF** | +| DPB[0] bytes 8-31 (pic_num/frame_num/fields/tfoc/bfoc/flags) | `00 00 00 00 00 00 03 00 ... 00 00 01 00 00 00 01 00 03 00 00 00` | identical | match | +| DPB[1..15] | all zero | all zero | match | +| Post-DPB bytes 512-559 | `01 00 01 00 04 00 01 00 ...` | identical | **match** | + +**The ONLY wire-byte diff in DECODE_PARAMS is `reference_ts` magnitude.** Post-DPB fields, DPB entry contents (after α-2 fixed POC), pic_num, frame_num, fields, flags — all byte-identical between libva and kdirect. + +Combined with iter8's other findings: +- SPS: identical except `constraint_set_flags` (rkvdec ignores per Phase 5b CRIT-1) +- PPS: identical (verified in this Phase 0) +- SCALING_MATRIX: identical (always-flat default) + +**This leaves `reference_ts` (and adjacent OUTPUT QBUF timestamp) as the single remaining wire-byte hypothesis.** + +## Locked research question (iter9, 2026-05-13) + +> *"Does replacing libva's gettimeofday-based timestamp scheme with a monotonic per-context counter (matching kdirect's small-value pattern) unblock libva's H.264 decode on rkvdec? After fix: `libva_h264.yuv == kdirect_h264.yuv` byte-identical."* + +### Pass/fail (boolean) + +1. **H.264 libva == kdirect**: `cmp -s libva_h264.yuv kdirect_h264.yuv` returns 0. Hash matches `1e7a0bc9…`. +2. **VP9 unchanged**: `4f1565e8…`. +3. **MPEG-2 unchanged**: `19eefbf4…`. +4. **HEVC unchanged**: `06b2c5a0…` (Bug 5 still deferred). +5. **VP8 unchanged**: `bcc57ed5…` (Bug 6 still deferred). +6. **Control-payload anchors hold for 4 non-H.264 codecs**. + +Clean iter9 close = all 6 PASS. If criterion 1 still fails, iter9 PARTIAL close with timestamp eliminated — at which point the search space for Bug 4 is effectively exhausted on the libva wire-payload side, and iter10+ would shift to slice-data encoding or kernel-internal investigation. + +## Substrate state at iter9 open + +| Property | Value | +|---|---| +| Kernel | `linux-fresnel-fourier 7.0-1` (unchanged) | +| Fork tip | `0226684` (iter8 close: γ + IMP-1 + α-2 POC strip removal) | +| Backend installed | `b6a3958a5bca945164262339dea5cc28f17accce13d57bd9f0c5a5dabbdf1b53` | +| Test fixtures | unchanged | +| Bug 4 narrowing | 5 eliminations (libva-readback, slot-binding, stale-residue, constraint_set_flags, POC sentinel) | +| Bug 4 remaining wire diff | reference_ts magnitude (giant vs small) | + +## Mechanism the question targets + +`picture.c::RequestEndPicture` line 440: +```c +gettimeofday(&surface_object->timestamp, NULL); +``` + +This produces a real-clock timeval (~1.78 × 10^9 sec = 1.78 × 10^18 ns) as the OUTPUT QBUF timestamp. The kernel stores this on the CAPTURE buffer via `V4L2_BUF_FLAG_TIMESTAMP_COPY` after decode. + +`h264.c::h264_fill_dpb` line 268-269: +```c +timestamp = v4l2_timeval_to_ns(&surface->timestamp); +dpb->reference_ts = timestamp; +``` + +Sends the same giant ns value as reference_ts for inter-frame references. + +In principle the kernel's `vb2_find_buffer_by_timestamp` does an exact 64-bit ns match and should not care about magnitude. But empirically, libva fails and kdirect (which uses ffmpeg-v4l2request's internal counter generating tiny ns values like `0x2af8 = 10968`) succeeds. + +Possible mechanisms: +- **M-A**: Kernel `vb2_find_buffer_by_timestamp` has a comparison that fails for very large values (e.g., overflow on a u32 truncation, or signed comparison treating bit 63 as negative). +- **M-B**: The OUTPUT QBUF's timestamp gets truncated/transformed by the V4L2 framework before being stored on the CAPTURE buffer, but the DPB.reference_ts is left at full resolution. The kernel then compares full-resolution reference_ts against truncated stored ts — never matches. +- **M-C**: `gettimeofday` and `v4l2_timeval_to_ns` produce slightly different ns values (e.g., due to a re-read or rounding), making OUTPUT QBUF's ts and DPB.reference_ts not byte-equal. +- **M-D**: Some other reason a small counter works but a giant one doesn't. + +## Fix shape + +### α-7: monotonic per-context counter + +Replace `gettimeofday(&surface_object->timestamp, NULL)` with a counter scheme that produces small ns values matching kdirect's pattern. + +Simplest implementation: +- Add `u64 timestamp_counter` to `request_data` (init at 0 in CreateContext). +- In `EndPicture`, increment counter and set `surface->timestamp` from it. + +Code shape: +```c +driver_data->timestamp_counter += 1000; /* 1 µs increments — small magnitude */ +surface_object->timestamp.tv_sec = driver_data->timestamp_counter / 1000000; +surface_object->timestamp.tv_usec = driver_data->timestamp_counter % 1000000; +``` + +Or even simpler — just increment by 1 each frame, giving small values like 1, 2, 3, ... + +LOC estimate: ~10 LOC. + +### Risk + +- **R-1**: Timestamp uniqueness — if the counter wraps in some long-lived process, ts uniqueness fails. For a campaign verifier (3 frames), counter wraps are impossible. For production playback, even 1 µs/frame gives ~140 years of unique values from u64. +- **R-2**: VP9 / VP8 / HEVC / MPEG-2 regression. Timestamp is shared infrastructure; all codecs use this same gettimeofday path. The change is uniform across codecs, so all 5 codecs get the new counter. VP9/MPEG-2 currently PASS direct via libva; switching from gettimeofday to counter should be neutral OR also a positive. +- **R-3**: Some consumer (ffmpeg outside the decoder path) reads the surface timestamp as a real wallclock. Probably no — VAAPI surfaces don't expose timestamps to consumers. + +## In scope + +- `src/request.h` — add `timestamp_counter` to driver_data. +- `src/request.c` or `src/context.c` — init counter to 0 in CreateContext. +- `src/picture.c::RequestEndPicture` — replace `gettimeofday(&surface->timestamp, NULL)` with counter-based assignment. + +## Out of scope + +- Per-codec changes (this is a shared-infrastructure change). +- Kernel patches. + +## Phase 2 source-read targets + +Already done in Phase 0 strace dump. Phase 2 will be brief — just confirm the gettimeofday call site is unique and that no other code reads `surface->timestamp` as a real wallclock. + +## Phase 3 baseline + +Already captured in iter8 Phase 3 + Phase 7c regression sweep. iter9 Phase 3 may just re-anchor. + +## Phase 4 plan shape (predicted) + +Already implicit in α-7. Will be drafted explicitly in Phase 4. + +## Phase 5 review concerns to invite + +- Reviewer should challenge `M-B` plausibility: does the V4L2 framework really transform OUTPUT QBUF timestamp before storing it on CAPTURE? Check `drivers/media/v4l2-core/v4l2-dev.c` or `videobuf2-v4l2.c::vb2_buffer_done`. +- Reviewer should verify the counter-monotonic-counter approach doesn't break MPEG-2 (which works via libva on hantro — different driver) or VP9 (works via libva on rkvdec). +- Per `feedback_wire_vs_behavior.md`: don't claim α-7 success on wire-byte change alone; criterion-1 hash test required. + +## Predicted iter9 cadence + +- Phase 0: this doc. +- Phase 4: 15 min. +- Phase 5: 30 min. +- Phase 6: 15 min. +- Phase 7: 15 min. +- Phase 8: 10 min. + +Total: ~90 min. Quick iteration. + +## What "iter9 PASS" looks like + +If α-7 closes Bug 4: +- iter9 PASS. +- Bug 4 closed. H.264 row goes from PARTIAL to PASS direct. +- Memory rule worth recording: V4L2 stateless decoders may require small/relative reference_ts values; gettimeofday is unsafe. See [[feedback_v4l2_timestamp_scheme]]. + +If α-7 doesn't close: +- iter9 PARTIAL. iter10 candidates: slice-data encoding (DEEP), DPB entry ordering (cosmetic, kernel sorts internally anyway), or pivot to a different bug (Bug 5 HEVC). +- Realistically, iter10 may shift entirely to a kernel-side or fresh investigation since the wire-byte search space is now exhausted.