From a2d5210ee9a8fa4d1b9bb658b9854d76580f1336 Mon Sep 17 00:00:00 2001 From: Markus Fritsche Date: Mon, 4 May 2026 13:56:08 +0000 Subject: [PATCH] Phase 5 review (sonnet): substrate sound, parser correct, surface-export attack plan + 6 unsurfaced findings MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Plan subagent with model: sonnet, open-consultation pattern. Raw verbatim review text — operator transcribes to DokuWiki. Headlines: - Phase 0 substrate: process not defective, but "end-to-end" framing conflated libva engagement with pixel output. New invariant for V4L2 stateless work: pixel-content check via cache-invalidating path is mandatory Phase 0 deliverable. - Phase 4 plan: correctly scoped, surface-export bug was undiagnosable before decode landed (all-zeros NV12 produces solid colors regardless of export geometry). - Slice-header parser: correct, no off-by-one. Two non-bugs to comment (MMCO 5 implicitly handled, POC type 2 implicitly correct). - Phase 7 criterion: right. Both vaapi and vaapi-copy paths required. chromium-fourier 149 should be removed from libva corpus (it uses chromium's own internal V4L2 decoder, bypasses libva entirely). - Slice-header parser in-fork is right call vs upstream VAAPI lobbying. - DEBUG patches stay until DMA-BUF path works end-to-end, then clean sweep before any bootlin snapshot. - FRAME_BASED hardcode is correct for ohm; revisit at fresnel/boltzmann. Surface-export attack plan (5 load-bearing items): 1. Log RequestExportSurfaceHandle descriptor (object size, plane offsets, pitches, fd). 2. Log format_width/height from v4l2_get_format — confirm 1088 not 1080. 3. Run mpv --hwdec=vaapi --msg-level=vd=v --msg-level=vo=v, capture what mesa reports about imported DRM format/modifier/geometry. 4. Verify UV offset matches 1920×1088 (=2,088,960). If format_height is 1080 (=2,073,600), UV plane reads start 8 rows into Y data — most likely structural bug. 5. Read FFmpeg's hwcontext_drm.c for NV12 nb_objects + UV-offset computation pattern. Six previously-unsurfaced findings: 7.1 EACCES on VIDIOC_G_EXT_CTRLS deserves one more probe (try moving readback to before MEDIA_REQUEST_IOC_QUEUE). 7.2 num_ref_idx_l0/l1_default_active_minus1 from VASlice is wrong for multi-slice streams with per-slice override (defer). 7.3 SET_FORMAT_OF_OUTPUT_ONCE global breaks multi-resolution use (latent bug, document). 7.4 // HACK in surface.c is architecturally wrong for multi-codec consumers (latent bug, document). 7.5 Firefox non-IDR first frame (mid-stream seek) handling unverified. 7.6 fourier_attribution cell A wheat verdict mechanism is now Step 2 chromium-internal V4L2 decoder, not Step 1 libva. Co-Authored-By: Claude Opus 4.7 (1M context) --- phase5_review_2026-05-04.md | 293 ++++++++++++++++++++++++++++++++++++ 1 file changed, 293 insertions(+) create mode 100644 phase5_review_2026-05-04.md diff --git a/phase5_review_2026-05-04.md b/phase5_review_2026-05-04.md new file mode 100644 index 0000000..4b5334e --- /dev/null +++ b/phase5_review_2026-05-04.md @@ -0,0 +1,293 @@ +# Phase 5 Second-Model Review — libva-multiplanar, 2026-05-04 + +Reviewer: Plan subagent with `model: sonnet`. Open-consultation pattern per the predecessor cadence (`fourier_attribution` reviewer response). + +Operator transcribes this to DokuWiki for the campaign record. The text below is the reviewer's verbatim output, NOT a curated summary. + +--- + +This review is based on direct reading of all artifacts listed. No summaries were read first. + +--- + +### Question 1: Substrate and Phase 0 verdict journey — should the campaign have gotten there sooner? + +**Short answer: No, the process was not defective, but there is one structural lesson.** + +The three-revision Phase 0 trajectory was: +1. f15ba8b — "picture holds" (contract clean, pixel content unchecked) +2. e892cea — "kernel produces no decoded pixel output, sentinel survives" (sentinel test trusted) +3. This finding — "kernel writes all zeros, sentinel test has cache-coherency bug" + +The root cause of the delay is simple: the patch-0011 sentinel test was the designated pixel-content verification mechanism, and it had a cache-coherency bug that made it report "kernel never wrote" when the kernel had in fact written zeros. That bug predates this campaign (it was authored in the predecessor `ohm_gl_fix` series). The `feedback_dev_process.md` discipline says "acquire data in-session" — which was followed — but the tool used to acquire the data was broken. + +**What was structurally missed**: the predecessor Step 1 series committed patch-0011 with an explicit "Empirical question — does hantro tolerate bit_size fields being zero, or do we need a slice_header parser?" comment, and the empirical answer was supposed to come from the sentinel readback. But the sentinel readback was flawed. A Phase 0 invariant should have been: "the sentinel readback must be validated by a second check (either a real VO or DMA_BUF_IOCTL_SYNC) before being trusted as ground truth." The kernel-trace finding added this discipline, which is good. But it should have been part of the original patch-0011 design. + +**The framing was wrong at the source.** The original 2026-04-26 STUDY claim "vainfo + mpv probes work end-to-end" conflated two layers: "libva engagement" (ioctl lifecycle succeeds) and "pixel output" (decoded frame contains real data). These are not the same thing for a V4L2 stateless decoder, because the kernel will signal DQBUF success even when the hardware bitstream parser silently produces zeros. The predecessor campaign's "end-to-end" language covered only layer one. In future campaigns on V4L2 stateless decoders, "end-to-end" must be operationally defined as "pixel-content verified via a VO that does cache invalidation." + +The three revisions across one day represent discovery, not dysfunction. The process found the root cause. A minor improvement: add "pixel-content check using a cache-invalidating path" as a mandatory Phase 0 deliverable for any V4L2 stateless backend work. + +--- + +### Question 2: Phase 4 plan — should surface-export work have been included upfront? + +**No, one bug at a time is the right cadence here. The plan was appropriately scoped.** + +The `diff_against_ffmpeg.md` plan was correct. It correctly identified the load-bearing fix (bit-size fields → MMIO writes) via a kernel source audit — the right methodology. The surface-export bug was invisible at Phase 4 time because: + +1. The vaapi-copy path (which does surface export differently — it copies pixels to CPU memory first) was the only path that could show pixel output at all during Phase 4. +2. You cannot diagnose a DMA-BUF export bug when the content being exported is all-zeros. The Phase 7 blue frame is blue regardless of whether the export geometry is correct — all-zeros NV12 with wrong colorspace assumptions and all-zeros NV12 with wrong plane offsets both produce solid colors; you can't distinguish them. + +The surface-export bug was only diagnosable after the decode fix landed and a real VO (vaapi path, not vaapi-copy) was tested. Phase 7 provided exactly that gate. This is the correct sequencing. + +If the Phase 4 plan had included surface-export work upfront, it would have been speculative — the bug hadn't been observed yet with real content. Adding speculative work to a Phase 4 plan violates the "contract before code" discipline the campaign correctly values. + +The one thing the Phase 4 plan could have noted: "after the decode fix lands, test both vaapi-copy and vaapi (DMA-BUF) paths — they have different code routes through surface.c." That forward-pointer would have saved nothing time-wise (the Phase 7 tests were run immediately after Phase 6), but it would have been cleaner documentation. + +--- + +### Question 3: Phase 6 implementation — slice-header bit-parser code review + +I read `src/h264_slice_header.c` and `src/h264_slice_header.h` in full. The empirical validation (frame 0 IDR: refmark_bits=2, frame 1 P: refmark_bits=1, frame 2 B: refmark_bits=0, poc_bits=8 matching log2_max_poc_lsb_minus4+4=8) is internally consistent and correct. The structure is sound. + +**Things that are correct:** + +The bit reader (`br_read_u`, `br_read_ue`, `br_read_se`) is correct. The RBSP unescape is correct — it properly handles the `00 00 03` emulation prevention sequence. The 64-byte scan window is generous for slice headers; any realistic H.264 slice header (including CABAC-mode, explicit ref_pic_list_modification, and adaptive dec_ref_pic_marking) fits inside 32 bytes before hitting the slice data. + +The exp-Golomb decoder is correct: `br_read_ue` counts leading zeros, reads `zeros` more bits, returns `(1 << zeros) - 1 + br_read_u(b, zeros)`. The signed mapping in `br_read_se` is also correct: odd values map to positive (ceil(v/2)), even values to negative (-v/2). + +The `pic_order_cnt_bit_size` measurement (measuring from the start of the POC syntax element to the end, encompassing `pic_order_cnt_lsb` for type 0, or `delta_pic_order_cnt[0..1]` for type 1, or zero bits for type 2) correctly mirrors what FFmpeg's `h264_slice.c` computes as `sl->pic_order_cnt_bit_size`. + +The `dec_ref_pic_marking_bit_size` measurement correctly starts at the first bit of `dec_ref_pic_marking()` syntax and ends after all MMCO operations. For IDR: 2 bits (no_output_of_prior_pics_flag + long_term_reference_flag). For non-IDR with nal_ref_idc=0: 0 bits (correct — the `if (ctx->nal_ref_idc != 0)` guard handles this). For non-IDR with nal_ref_idc≠0 and adaptive_ref_pic_marking_mode_flag=0: 1 bit. All match the empirical output from Phase 6. + +**Actual bug: MMCO5 (memory_management_control_operation=5) handling is missing.** + +In `dec_ref_pic_marking()` at lines 339–348, the MMCO loop handles operations 1, 2, 3, 4, 6. It does not handle MMCO 5 (`mark all short-term ref as unused`) which has no parameters — it is a standalone `memory_management_control_operation_equal_5` code with no trailing syntax elements. In the current code, if an MMCO 5 is encountered, the loop body reads no extra fields (because none of the `if (mmco == N)` branches fire), then tests `while (mmco != 0)` — since `mmco=5 != 0`, it loops again and reads the next MMCO value correctly. So MMCO 5 is accidentally handled correctly because it has no parameters. This is not a bug in practice but it is fragile code — a comment should note that MMCO 5 intentionally has no parameters. + +MMCO 6 (assign long_term_frame_idx to current picture) is handled correctly (reads `long_term_frame_idx`). + +**Potential off-by-one in exp-Golomb for zeros=0 case:** + +```c +static uint32_t br_read_ue(struct br *b) +{ + unsigned zeros = 0; + while (br_read_u(b, 1) == 0) { + if (b->error || ++zeros >= 32) + return 0; + } + if (zeros == 0) + return 0; + return (1u << zeros) - 1u + br_read_u(b, zeros); +} +``` + +When `zeros=0`: the while loop exited immediately (the first bit read was 1). `if (zeros == 0) return 0` — returns 0. This is correct: exp-Golomb code `1` = value 0. ✓ + +When `zeros=1`: reads one more bit, returns `(2-1) + bit = 1 + bit`, giving values 1 or 2. Correct: `010` → 1, `011` → 2. ✓ + +The early-exit guard `++zeros >= 32` fires for exp-Golomb codes with 32+ leading zeros — which cannot appear in a valid H.264 bitstream. This is safe defensive coding. + +**Missing: `pic_order_cnt_type == 2` branch in the POC measurement.** + +```c +if (ctx->pic_order_cnt_type == 0) { ... } +else if (ctx->pic_order_cnt_type == 1 && !ctx->delta_pic_order_always_zero_flag) { ... } +// type 2: no syntax elements in slice header — pic_order_cnt_bit_size = 0 +``` + +For type 2, no syntax falls through to the measurement, and `pic_order_cnt_bit_size = 0`. The hardware register `G1_REG_DEC_CTRL6_POC_LENGTH` receives 0. Is that correct? Looking at `hantro_g1_h264_dec.c::set_params()`: it writes `G1_REG_DEC_CTRL6_POC_LENGTH(dec_param->pic_order_cnt_bit_size)`. For POC type 2, the spec says `pic_order_cnt_bit_size` should be 0 (there are no POC bits in the slice header). So returning 0 here is correct, and the code handles it by simply not entering either `if` or `else if` branch. The comment about `type 2 means 0 bits` is absent — worth adding. + +**Edge case: `separate_colour_plane_flag` and `colour_plane_id`.** + +The code reads `colour_plane_id` (2 bits) when `separate_colour_plane_flag` is set. This is correct per spec §7.3.3. For hantro NV12 streams, `chroma_format_idc=1` and `separate_colour_plane_flag=0` always, so this path is never hit. But the code is shaped to handle it gracefully. + +**Edge case: interlaced (field_pic_flag).** + +```c +if (!ctx->frame_mbs_only_flag) { + field_pic_flag = (br_read_u(&b, 1) != 0); + if (field_pic_flag) + (void)br_read_u(&b, 1); /* bottom_field_flag */ +} +``` + +This is correct. For hantro RK3568 in FRAME_BASED mode with the BBB test clip (progressive), `frame_mbs_only_flag=1` and this block is never entered. For interlaced streams, the field_pic_flag correctly gates the `delta_pic_order_cnt_bottom` read in the POC type 0 branch: + +```c +if (ctx->bottom_field_pic_order_in_frame_present_flag && !field_pic_flag) + out->delta_pic_order_cnt_bottom = br_read_se(&b); +``` + +This is correct per §7.3.3: `delta_pic_order_cnt_bottom` is present only for frame pictures (not field). ✓ + +**Edge case: multi-slice streams (for `dec_ref_pic_marking_bit_size`).** + +The commit message notes: "for FRAME_BASED mode that's fine — kernel sees the whole bitstream and parses subsequent slices itself." This is correct for hantro in FRAME_BASED mode, where the kernel receives the full bitstream and uses the `dec_ref_pic_marking_bit_size` only to know how many bits to skip in the *first* slice's header. Subsequent slices are parsed by the hardware. The implementation correctly parses only the first slice header. + +**Missing edge case: SP/SI slices.** + +Slice type 3 (SP) and slice type 4 (SI) are in the `num_ref_idx_l0_active_minus1` override block via the condition `if (st == 0 || st == 3 || st == 1)`. SP (type 3 mod 5 = 3) is included. For SI (type 4 mod 5 = 4), the block doesn't fire — which is correct, because SI slices don't have `num_ref_idx_active_override_flag`. ✓ + +However: in `skip_ref_pic_list_modification()`, the condition `if (st_mod5 != 2 && st_mod5 != 4)` correctly excludes both I (type 2) and SI (type 4) from having ref_pic_list_modification. ✓ + +**Summary of code quality:** The parser is well-structured, correctly handles the common cases relevant to this hardware, and will fail gracefully (returning `-EIO`) on parse errors. The empirical validation output matches the expected values. The two non-issues I found (MMCO 5 implicitly handled, POC type 2 implicitly correct) are worth documenting but are not bugs. There are no off-by-one errors in the exp-Golomb decoder. + +--- + +### Question 4: Phase 7 verification — is the criterion right? + +**The criterion is right. Both paths should be required, and chromium-fourier should be removed from the libva corpus.** + +**On requiring both vaapi-copy and vaapi paths:** + +The Phase 7 finding correctly identifies that vaapi-copy and vaapi (DMA-BUF) are different code paths with different bugs. The vaapi-copy path exercises decode correctness but bypasses the surface-export logic in `RequestExportSurfaceHandle`. The vaapi path exercises the full stack including export. Requiring both is the right call. + +The counterargument would be "vaapi-copy is sufficient for Phase 1 boolean correctness" — but Phase 1's criterion says "providing access to the hardware decoder." For Firefox, the primary real-world path is vaapi (DMA-BUF); Firefox falls back to SW when the DMA-BUF path fails. A libva backend that passes vaapi-copy but fails vaapi is not providing meaningful access to the hardware decoder for Firefox. The criterion should require both. + +**On chromium-fourier 149:** + +Phase 7 definitively established that chromium-fourier 149 uses its own internal V4L2 stateless decoder (`media/gpu/v4l2/v4l2_video_decoder_backend_stateless`), which bypasses libva entirely. It is not a libva consumer for H.264. Including it in the libva corpus is actively misleading — it creates a false positive that could mask libva regressions. + +The correct disposition: remove chromium-fourier from the Phase 7 libva verification corpus, keep it as a performance reference (it demonstrates HW decode via its own path), and note explicitly in the campaign record that "chromium-fourier uses Step 2 patches to enable its own internal V4L2 decoder; libva-v4l2-request is not on that code path." Cell E (vanilla Chromium 149) would be the correct libva validator if Chromium's VAAPI path can be activated — but that's a separate investigation. + +**What the corpus should be after this clarification:** + +- vainfo: smoke test (enumerate-only; always ✓ if driver loads) +- mpv `--hwdec=vaapi-copy --vo=gpu` in a live session: decode correctness check +- mpv `--hwdec=vaapi --vo=gpu` in a live session: surface-export check (the next iteration's target) +- Firefox 150 live session with media.ffmpeg.vaapi.enabled: real-consumer test (currently blocked by the DMA-BUF bug) + +--- + +### Question 5: Architecture decisions worth questioning + +**A. Slice-header parser inside libva-v4l2-request vs upstream VAAPI lobbying** + +The in-fork parser is the right decision for this campaign's scope and timeline. The upstream VAAPI argument would be: "add `idr_pic_id`, `pic_order_cnt_bit_size`, `dec_ref_pic_marking_bit_size` to `VASliceParameterBufferH264` and have the consumer (FFmpeg's decoder, Firefox's RDD, etc.) populate them." This is defensible on principle — VAAPI is already a parsed-parameter API, and these are parsed slice-header values. + +But the practical problems with the upstream route: (1) `VASliceParameterBufferH264` is a stable ABI; adding fields requires a VAAPI minor version bump and the cooperation of Intel, AMD, and ARM GPU driver maintainers; (2) every consumer that currently calls `vaRenderPicture` would need to be updated to populate the new fields; (3) bootlin/libva-v4l2-request is dormant — there's no upstream process to lobby; (4) the alternative backend (GStreamer's gstv4l2slh264dec) carries the equivalent parse logic itself. + +The precedent from FFmpeg is instructive: FFmpeg's `v4l2_request_h264.c` uses values pre-parsed by its own H.264 decoder (`H264SliceContext::ref_pic_marking_bit_size`, `pic_order_cnt_bit_size`) — not from VAAPI. It works because FFmpeg controls both the decoder and the V4L2 backend. In a VAAPI backend, the V4L2-request backend doesn't have access to the decoder's internal state — it only has the VAAPI API surface. So the bit-parser in the backend is the only viable option absent an upstream VAAPI extension. + +The parser adds approximately 250 lines that are self-contained, well-commented, and have empirical validation. The cost is low; the alternative is nonexistent on the relevant timeline. + +**B. DEBUG patches in production code** + +Patches 0010 (hex dump), 0011 (sentinel, now fixed), 0014 (VAPic dump), and the new 3F readback (VIDIOC_G_EXT_CTRLS), plus the rate-limited EACCES warning — these should stay until one condition is met: the surface-export bug is fixed and both mpv vaapi-copy and mpv vaapi (DMA-BUF) paths show the bunny in a live session. + +The reason: the VIDIOC_G_EXT_CTRLS readback currently fails (EACCES) and that failure is not yet understood. "Not actionable from userspace" is a conclusion, but it's not a complete diagnosis. The readback bug suggests either: (a) hantro's kernel driver doesn't allow reading compound H.264 controls from a request fd that's in QUEUEING state (i.e., after MEDIA_REQUEST_IOC_QUEUE but before DQBUF); or (b) there's a permission check in the v4l2-core that blocks this. Until the surface-export bug is fixed and the system is working end-to-end, the DEBUG instrumentation is still load-bearing for diagnosis. + +After the vaapi (DMA-BUF) path is confirmed working, a clean sweep of all DEBUG patches is appropriate before any "this is what we'd show Bootlin" snapshot. At that point: revert 0010, 0011, 0014, 3E, 3F, and the rate-limited EACCES warning. The `utils.c` tee to `/tmp/libva-fourier.log` should also be removed. + +One caveat: the msync fix in 3E (cache-fix CAPTURE dump) is not debug-only if it reveals a real operational issue. But it operates on the mmap of a CAPTURE buffer, which is not used in the production decode path (the DMA-BUF export path exports the buffer fd directly without the CPU-side mmap). So removing 3E from production is safe. + +**C. FRAME_BASED hardcode vs FFmpeg's dynamic query** + +FFmpeg's `v4l2_request_h264_post_frames_ctx` queries `V4L2_CID_STATELESS_H264_DECODE_MODE` and `V4L2_CID_STATELESS_H264_START_CODE` dynamically. The fork hardcodes `DECODE_MODE_FRAME_BASED` and `START_CODE_ANNEX_B` in patch 0002. + +For ohm RK3568 hantro G1/G2, FRAME_BASED is the only supported mode — the kernel driver defines `hantro_g1_h264_dec_run` and there is no slice-based equivalent in the verisilicon driver. The hardcode is functionally correct for this hardware. + +The liability: if the fork is ever used on a device that supports SLICE_BASED mode (e.g., some Allwinner variants of hantro), the hardcode would send wrong controls and decode would fail silently. For the current campaign scope (ohm only, with fresnel and boltzmann as future iterations), this is a deferred risk that is explicitly documented. + +The right time to make this dynamic is when fresnel (RK3399, hantro G1 without G2) is added as a target — at that point, query and store the mode at init time, then branch appropriately in `h264_set_controls`. The query helpers (`v4l2_query_control_default_value`) are already in the fork (commit `58a0e8b`), so the migration is ~5 lines. + +--- + +### Question 6: Surface-export bug pre-attack reading + +Based on my reading of `src/surface.c::RequestExportSurfaceHandle` and the arithmetic above: + +**The descriptor structure itself (num_objects, num_layers, num_planes, object_index) is correct** for single-fd NV12. The pattern — 1 object, 1 layer, 2 planes, both pointing to object_index=0 with different offsets — is the standard way to export NV12 from a single DMA-BUF. FFmpeg's `hwcontext_vaapi.c` (and libva's own test suite) use exactly this structure. + +**The plane offsets are probably correct.** UV offset = `destination_sizes[0]` = `bytesperlines[0] * format_height` = 1920 × 1088 = 2,088,960. This is where hantro G1 places the UV plane in its linear NV12 output (Y first, UV immediately following, DMV/internal state appended after in the oversized allocation). The export `objects[0].size` (3,133,440 = Y_size + UV_size, not the full 3,655,712 sizeimage) is undersized relative to the actual DMA-BUF allocation, but the offsets it describes fall within that 3,133,440 boundary, so Mesa validation should pass. + +**The most likely root cause is the colorspace/range metadata path, not the geometry.** Here is the reasoning: + +When mpv uses `--hwdec=vaapi-copy`, it calls `vaMapBuffer` which triggers a CPU copy of the decoded NV12 frame. mpv knows it's HD (1080p) content and applies BT.709 limited-range as the colorspace for the GL upload. This produces the correct green for all-zeros NV12 during Phase 0, and the bunny for real pixels after Phase 6. + +When mpv uses `--hwdec=vaapi` (DMA-BUF path), it imports the DMA-BUF as an EGL image and creates a GL texture. The colorspace for the YCbCr→RGB conversion must come from somewhere. The `VADRMPRIMESurfaceDescriptor` has no colorspace field. mpv's DMA-BUF import path may default to a different colorspace (BT.601, or full-range vs limited-range) than what mpv's CPU path uses. That would explain the blue vs green color — both from all-zeros NV12, but different matrix. + +**But solid blue from real pixels (the post-Phase-6 Phase 7 result) cannot be explained by colorspace alone.** If colorspace were the only issue, we'd see a color-shifted but recognizable bunny, not a solid field. Solid blue means the luma channel is being read as near-zero. This means either: + +1. The UV offset is wrong by enough that both Y and UV data are garbage — but the arithmetic says the offset is correct. +2. The DMA-BUF fd being exported points to the wrong buffer (e.g., a buffer that hasn't been decoded yet, or the incorrect index). +3. Mesa's EGL DMA-BUF import for single-object NV12 is failing silently and falling back to an error state that renders solid blue. Some Mesa versions require the two NV12 planes in two separate DMA-BUF fds (NV12M format) rather than a single fd with two offsets. + +**The most productive first debug step**: add a `request_log` in `RequestExportSurfaceHandle` that logs `surface_object->destination_index`, the exported fd number, `objects[0].size`, `layers[0].offset[0]`, `layers[0].offset[1]`, `layers[0].pitch[0]`, and `layers[0].pitch[1]`. Then run mpv `--hwdec=vaapi` with `LIBVA_MESSAGING_LEVEL=2` and `--msg-level=vd=v --msg-level=vo=v`. The mpv output will show what format/modifier it's importing, and the libva log will show what we're exporting. Comparing the two will identify any mismatch. + +**The second thing to check in FFmpeg's `hwcontext_v4l2request.c`**: look for how it handles the DMA-BUF export, specifically whether it exports via `vaExportSurfaceHandle` or via a direct `VIDIOC_EXPBUF` call, and whether it sets any EGL colorspace attributes on the image import. FFmpeg's v4l2_request path exports `AV_PIX_FMT_DRM_PRIME` frames — the descriptor building happens in `libavutil/hwcontext_drm.c`, not in the libva-v4l2-request backend. This is worth reading to see if FFmpeg sets `AVDRMFrameDescriptor.nb_objects` to 2 (NV12M-style) vs 1 (packed NV12 style) for hantro output. + +**What to look for in `src/surface.c` specifically before attacking**: check whether `format_height` from `v4l2_get_format` is the MB-aligned height (1088) or the stream-signaled height (1080). If `v4l2_get_format` returns format_height=1080 for a 1080p stream, then `destination_sizes[0] = 1920 × 1080 = 2,073,600`, and the UV offset would be 2,073,600 instead of 2,088,960 — which is 15,360 bytes too early, putting the UV read 8 complete rows into the luma data. That would cause a severe corruption that could appear as near-solid color on large homogeneous regions. This is worth verifying with a `request_log` on `format_width`, `format_height` at the top of `RequestCreateSurfaces2`. + +--- + +### Question 7: Things not surfaced + +**7.1 The EACCES from VIDIOC_G_EXT_CTRLS on the request fd needs investigation before closing the campaign.** + +Commit `6be3f3b` rate-limits this to "log once per process, not actionable." But it is actionable — it means the Tier 3F readback (the control-write verification mechanism) does not function on this rig. This could be a Linux 6.19.x v4l2-core behavior change. The investigation is: check whether the EACCES comes from the request fd being in QUEUEING state at the time of the read (post-MEDIA_REQUEST_IOC_QUEUE, pre-DQBUF). If so, move the readback to before MEDIA_REQUEST_IOC_QUEUE — which is actually the more useful timing anyway (verify the write before firing the request). The current code does readback after `v4l2_set_controls` but before `MEDIA_REQUEST_IOC_QUEUE`, which should be pre-QUEUE state. If EACCES still fires there, the issue is something else (permissions on the request fd, kernel check for compound controls specifically). This deserves one more probe before being called "not actionable." + +**7.2 The `num_ref_idx_l0/l1_default_active_minus1` source is slightly wrong for multi-slice streams.** + +Patch `d41a4b9` populates these from `surface->params.h264.slice.num_ref_idx_l0_active_minus1` (the per-slice VASlice field). The comment in the patch says: "VAAPI doesn't expose the parsed-PPS default fields; the per-slice override is the closest available source (matches PPS default except on streams with explicit per-slice override)." + +For single-slice FRAME_BASED streams this is fine. But for multi-slice streams: if slice N overrides `num_ref_idx_l0_active` with a per-slice value, but slice N+1 does not override it (falls back to the PPS default), the `h264_slice_header_context` passed to the bit-parser for slice N+1 will have the wrong default. This is only an issue in multi-slice FRAME_BASED mode, which hantro does support. For the BBB test clip it doesn't matter. For production use on arbitrary H.264 streams, this is worth noting — though fixing it would require the PPS-default value be recovered from the VAAPI pic_params (which doesn't expose `num_ref_idx_l0_default_active_minus1` directly in `VAPictureParameterBufferH264`). The correct long-term fix is to parse it from the PPS or track it per-picture, not per-slice. + +**7.3 The `static bool SET_FORMAT_OF_OUTPUT_ONCE` global in surface.c is a time bomb.** + +```c +bool SET_FORMAT_OF_OUTPUT_ONCE = false; +``` + +This is a global that prevents `v4l2_set_format` from being called on the OUTPUT queue more than once per process lifetime. If a consumer (Firefox, mpv) creates a new VA context for a different resolution (e.g., starts a second video in a tab), this flag would prevent re-setting the format. The result would be that the second video's OUTPUT format is wrong (still 1920×1088 from the first video), which would produce silent decode failure or corruption. + +For Phase 1 boolean correctness with a single test clip, this doesn't matter. But it's a latent correctness bug that should be noted in a worklist item before the campaign closes. + +**7.4 The `// HACK` block in surface.c calling `v4l2_set_format` before `vaCreateContext` is architecturally wrong.** + +The comment says this is a workaround because `v4l2_set_format` must be called before any `CREATE_BUFS`. But the `v4l2_set_format` call here uses a hardcoded `V4L2_PIX_FMT_H264_SLICE` pixelformat and `width/height` from the surface dimensions. For a VAAPI backend that supports multiple codecs (H.264 + MPEG-2), this would fail for MPEG-2 surfaces. The correct architecture is to set the OUTPUT format in `RequestCreateContext` (which knows the codec) with proper sequencing before the first `CREATE_BUFS`. The WIP STREAMON-defer commit (`44a7327`) was an incomplete attempt at this. The HACK should be documented as a known architectural debt. + +**7.5 The campaign has not yet verified that Firefox falls back gracefully for non-IDR first frames.** + +Phase 7 shows Firefox attempted one frame (IDR, frame 0), got the DMA-BUF result, and fell back. The slice_header parse log shows only one parse entry. If Firefox ever sent a non-IDR first frame (e.g., seeking into the middle of a stream), and the bit-parser had an issue with non-IDR frames, the fallback would be the same SW path. This is not urgent for the current campaign but is worth noting as a future verification case. + +**7.6 The `fourier_attribution` cell A wheat verdict is now better understood but still slightly confused.** + +Cell A (chromium-fourier 149, browser_cpu_median=54%, fps=24.0) showed hardware decode happening, confirmed visually. Phase 7 established that chromium-fourier uses its own internal V4L2 stateless decoder, not libva-v4l2-request. So the "multi-planar libva is the binding decode-side enabler" claim in the `fourier_attribution` conclusion was wrong about the mechanism (it's Step 2's Chromium-internal decoder, not Step 1's libva backend, that enables HW decode for chromium-fourier). The 83 pp browser-CPU difference between Cell A and Cell B (Brave 1.89) is real, but the attribution is "Step 2 chromium internal V4L2 decoder" not "Step 1 libva backend." + +This doesn't invalidate the cell A wheat verdict — HW decode IS happening for chromium-fourier — but it does mean Cell E (vanilla Chromium 149) disambiguation will be interesting: if vanilla Chromium 149 WITHOUT Step 2 also gets HW decode, it would come from its own V4L2 path (potentially). If it doesn't, then Step 2 specifically is what enables it, independent of any libva work. This disambiguation is useful for properly scoping what this campaign's libva deliverable buys for chromium users. + +--- + +### Summary prioritization + +**Load-bearing for the next Phase 4→6 cycle (surface-export fix):** + +1. Add `request_log` to `RequestExportSurfaceHandle` to dump the exported descriptor before starting bisection. +2. Log `format_width` and `format_height` from `v4l2_get_format` at the start of `RequestCreateSurfaces2` — confirm format_height is 1088 (MB-aligned), not 1080 (stream height). +3. Run mpv `--hwdec=vaapi --msg-level=vd=v --msg-level=vo=v` to see what mesa/mpv reports about the imported DRM format, modifier, and plane geometry. +4. Check if the real UV offset in the hantro G1 buffer matches 1920×1088. If format_height from G_FMT is 1080 (not 1088), the UV offset is wrong by 15,360 bytes — this is the most likely structural bug. +5. Read FFmpeg's `hwcontext_drm.c` for how it builds the DRM frame descriptor for NV12 from V4L2 buffers (specifically `nb_objects` and how UV plane offset is computed). + +**Nice-to-have (before campaign close):** + +6. Investigate VIDIOC_G_EXT_CTRLS EACCES — move readback to before MEDIA_REQUEST_IOC_QUEUE if possible. +7. Document the `SET_FORMAT_OF_OUTPUT_ONCE` global as a known limitation. +8. Add MMCO 5 comment to the dec_ref_pic_marking loop. +9. Add POC type 2 comment to the pic_order_cnt_bit_size measurement block. + +**Deferred to after campaign close:** + +10. Dynamic DECODE_MODE query (for fresnel/boltzmann iterations). +11. `num_ref_idx` from PPS instead of VASlice default (multi-slice correctness). +12. VAAPI upstream lobbying for `pic_order_cnt_bit_size` field (future, different campaign). +13. Cleanup of all DEBUG patches. + +### Critical Files for Implementation + +- `/home/mfritsche/src/libva-multiplanar/libva-v4l2-request-fourier/src/surface.c` +- `/home/mfritsche/src/libva-multiplanar/libva-v4l2-request-fourier/src/h264_slice_header.c` +- `/home/mfritsche/src/libva-multiplanar/libva-v4l2-request-fourier/src/h264.c` +- `/home/mfritsche/src/libva-multiplanar/libva-v4l2-request-fourier/src/video.c` +- `/home/mfritsche/src/libva-multiplanar/references/linux-mainline/drivers/media/platform/verisilicon/hantro_g1_h264_dec.c`