Files
libva-multiplanar/phase5_review_2026-05-04.md
marfrit a2d5210ee9 Phase 5 review (sonnet): substrate sound, parser correct, surface-export attack plan + 6 unsurfaced findings
Plan subagent with model: sonnet, open-consultation pattern. Raw
verbatim review text — operator transcribes to DokuWiki.

Headlines:
- Phase 0 substrate: process not defective, but "end-to-end" framing
  conflated libva engagement with pixel output. New invariant for V4L2
  stateless work: pixel-content check via cache-invalidating path is
  mandatory Phase 0 deliverable.
- Phase 4 plan: correctly scoped, surface-export bug was undiagnosable
  before decode landed (all-zeros NV12 produces solid colors regardless
  of export geometry).
- Slice-header parser: correct, no off-by-one. Two non-bugs to comment
  (MMCO 5 implicitly handled, POC type 2 implicitly correct).
- Phase 7 criterion: right. Both vaapi and vaapi-copy paths required.
  chromium-fourier 149 should be removed from libva corpus (it uses
  chromium's own internal V4L2 decoder, bypasses libva entirely).
- Slice-header parser in-fork is right call vs upstream VAAPI lobbying.
- DEBUG patches stay until DMA-BUF path works end-to-end, then clean
  sweep before any bootlin snapshot.
- FRAME_BASED hardcode is correct for ohm; revisit at fresnel/boltzmann.

Surface-export attack plan (5 load-bearing items):
1. Log RequestExportSurfaceHandle descriptor (object size, plane
   offsets, pitches, fd).
2. Log format_width/height from v4l2_get_format — confirm 1088 not 1080.
3. Run mpv --hwdec=vaapi --msg-level=vd=v --msg-level=vo=v, capture
   what mesa reports about imported DRM format/modifier/geometry.
4. Verify UV offset matches 1920×1088 (=2,088,960). If format_height
   is 1080 (=2,073,600), UV plane reads start 8 rows into Y data —
   most likely structural bug.
5. Read FFmpeg's hwcontext_drm.c for NV12 nb_objects + UV-offset
   computation pattern.

Six previously-unsurfaced findings:
7.1 EACCES on VIDIOC_G_EXT_CTRLS deserves one more probe (try moving
    readback to before MEDIA_REQUEST_IOC_QUEUE).
7.2 num_ref_idx_l0/l1_default_active_minus1 from VASlice is wrong for
    multi-slice streams with per-slice override (defer).
7.3 SET_FORMAT_OF_OUTPUT_ONCE global breaks multi-resolution use
    (latent bug, document).
7.4 // HACK in surface.c is architecturally wrong for multi-codec
    consumers (latent bug, document).
7.5 Firefox non-IDR first frame (mid-stream seek) handling unverified.
7.6 fourier_attribution cell A wheat verdict mechanism is now Step 2
    chromium-internal V4L2 decoder, not Step 1 libva.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 13:56:08 +00:00

30 KiB
Raw Permalink Blame History

Phase 5 Second-Model Review — libva-multiplanar, 2026-05-04

Reviewer: Plan subagent with model: sonnet. Open-consultation pattern per the predecessor cadence (fourier_attribution reviewer response).

Operator transcribes this to DokuWiki for the campaign record. The text below is the reviewer's verbatim output, NOT a curated summary.


This review is based on direct reading of all artifacts listed. No summaries were read first.


Question 1: Substrate and Phase 0 verdict journey — should the campaign have gotten there sooner?

Short answer: No, the process was not defective, but there is one structural lesson.

The three-revision Phase 0 trajectory was:

  1. f15ba8b — "picture holds" (contract clean, pixel content unchecked)
  2. e892cea — "kernel produces no decoded pixel output, sentinel survives" (sentinel test trusted)
  3. This finding — "kernel writes all zeros, sentinel test has cache-coherency bug"

The root cause of the delay is simple: the patch-0011 sentinel test was the designated pixel-content verification mechanism, and it had a cache-coherency bug that made it report "kernel never wrote" when the kernel had in fact written zeros. That bug predates this campaign (it was authored in the predecessor ohm_gl_fix series). The feedback_dev_process.md discipline says "acquire data in-session" — which was followed — but the tool used to acquire the data was broken.

What was structurally missed: the predecessor Step 1 series committed patch-0011 with an explicit "Empirical question — does hantro tolerate bit_size fields being zero, or do we need a slice_header parser?" comment, and the empirical answer was supposed to come from the sentinel readback. But the sentinel readback was flawed. A Phase 0 invariant should have been: "the sentinel readback must be validated by a second check (either a real VO or DMA_BUF_IOCTL_SYNC) before being trusted as ground truth." The kernel-trace finding added this discipline, which is good. But it should have been part of the original patch-0011 design.

The framing was wrong at the source. The original 2026-04-26 STUDY claim "vainfo + mpv probes work end-to-end" conflated two layers: "libva engagement" (ioctl lifecycle succeeds) and "pixel output" (decoded frame contains real data). These are not the same thing for a V4L2 stateless decoder, because the kernel will signal DQBUF success even when the hardware bitstream parser silently produces zeros. The predecessor campaign's "end-to-end" language covered only layer one. In future campaigns on V4L2 stateless decoders, "end-to-end" must be operationally defined as "pixel-content verified via a VO that does cache invalidation."

The three revisions across one day represent discovery, not dysfunction. The process found the root cause. A minor improvement: add "pixel-content check using a cache-invalidating path" as a mandatory Phase 0 deliverable for any V4L2 stateless backend work.


Question 2: Phase 4 plan — should surface-export work have been included upfront?

No, one bug at a time is the right cadence here. The plan was appropriately scoped.

The diff_against_ffmpeg.md plan was correct. It correctly identified the load-bearing fix (bit-size fields → MMIO writes) via a kernel source audit — the right methodology. The surface-export bug was invisible at Phase 4 time because:

  1. The vaapi-copy path (which does surface export differently — it copies pixels to CPU memory first) was the only path that could show pixel output at all during Phase 4.
  2. You cannot diagnose a DMA-BUF export bug when the content being exported is all-zeros. The Phase 7 blue frame is blue regardless of whether the export geometry is correct — all-zeros NV12 with wrong colorspace assumptions and all-zeros NV12 with wrong plane offsets both produce solid colors; you can't distinguish them.

The surface-export bug was only diagnosable after the decode fix landed and a real VO (vaapi path, not vaapi-copy) was tested. Phase 7 provided exactly that gate. This is the correct sequencing.

If the Phase 4 plan had included surface-export work upfront, it would have been speculative — the bug hadn't been observed yet with real content. Adding speculative work to a Phase 4 plan violates the "contract before code" discipline the campaign correctly values.

The one thing the Phase 4 plan could have noted: "after the decode fix lands, test both vaapi-copy and vaapi (DMA-BUF) paths — they have different code routes through surface.c." That forward-pointer would have saved nothing time-wise (the Phase 7 tests were run immediately after Phase 6), but it would have been cleaner documentation.


Question 3: Phase 6 implementation — slice-header bit-parser code review

I read src/h264_slice_header.c and src/h264_slice_header.h in full. The empirical validation (frame 0 IDR: refmark_bits=2, frame 1 P: refmark_bits=1, frame 2 B: refmark_bits=0, poc_bits=8 matching log2_max_poc_lsb_minus4+4=8) is internally consistent and correct. The structure is sound.

Things that are correct:

The bit reader (br_read_u, br_read_ue, br_read_se) is correct. The RBSP unescape is correct — it properly handles the 00 00 03 emulation prevention sequence. The 64-byte scan window is generous for slice headers; any realistic H.264 slice header (including CABAC-mode, explicit ref_pic_list_modification, and adaptive dec_ref_pic_marking) fits inside 32 bytes before hitting the slice data.

The exp-Golomb decoder is correct: br_read_ue counts leading zeros, reads zeros more bits, returns (1 << zeros) - 1 + br_read_u(b, zeros). The signed mapping in br_read_se is also correct: odd values map to positive (ceil(v/2)), even values to negative (-v/2).

The pic_order_cnt_bit_size measurement (measuring from the start of the POC syntax element to the end, encompassing pic_order_cnt_lsb for type 0, or delta_pic_order_cnt[0..1] for type 1, or zero bits for type 2) correctly mirrors what FFmpeg's h264_slice.c computes as sl->pic_order_cnt_bit_size.

The dec_ref_pic_marking_bit_size measurement correctly starts at the first bit of dec_ref_pic_marking() syntax and ends after all MMCO operations. For IDR: 2 bits (no_output_of_prior_pics_flag + long_term_reference_flag). For non-IDR with nal_ref_idc=0: 0 bits (correct — the if (ctx->nal_ref_idc != 0) guard handles this). For non-IDR with nal_ref_idc≠0 and adaptive_ref_pic_marking_mode_flag=0: 1 bit. All match the empirical output from Phase 6.

Actual bug: MMCO5 (memory_management_control_operation=5) handling is missing.

In dec_ref_pic_marking() at lines 339348, the MMCO loop handles operations 1, 2, 3, 4, 6. It does not handle MMCO 5 (mark all short-term ref as unused) which has no parameters — it is a standalone memory_management_control_operation_equal_5 code with no trailing syntax elements. In the current code, if an MMCO 5 is encountered, the loop body reads no extra fields (because none of the if (mmco == N) branches fire), then tests while (mmco != 0) — since mmco=5 != 0, it loops again and reads the next MMCO value correctly. So MMCO 5 is accidentally handled correctly because it has no parameters. This is not a bug in practice but it is fragile code — a comment should note that MMCO 5 intentionally has no parameters.

MMCO 6 (assign long_term_frame_idx to current picture) is handled correctly (reads long_term_frame_idx).

Potential off-by-one in exp-Golomb for zeros=0 case:

static uint32_t br_read_ue(struct br *b)
{
    unsigned zeros = 0;
    while (br_read_u(b, 1) == 0) {
        if (b->error || ++zeros >= 32)
            return 0;
    }
    if (zeros == 0)
        return 0;
    return (1u << zeros) - 1u + br_read_u(b, zeros);
}

When zeros=0: the while loop exited immediately (the first bit read was 1). if (zeros == 0) return 0 — returns 0. This is correct: exp-Golomb code 1 = value 0. ✓

When zeros=1: reads one more bit, returns (2-1) + bit = 1 + bit, giving values 1 or 2. Correct: 010 → 1, 011 → 2. ✓

The early-exit guard ++zeros >= 32 fires for exp-Golomb codes with 32+ leading zeros — which cannot appear in a valid H.264 bitstream. This is safe defensive coding.

Missing: pic_order_cnt_type == 2 branch in the POC measurement.

if (ctx->pic_order_cnt_type == 0) { ... }
else if (ctx->pic_order_cnt_type == 1 && !ctx->delta_pic_order_always_zero_flag) { ... }
// type 2: no syntax elements in slice header — pic_order_cnt_bit_size = 0

For type 2, no syntax falls through to the measurement, and pic_order_cnt_bit_size = 0. The hardware register G1_REG_DEC_CTRL6_POC_LENGTH receives 0. Is that correct? Looking at hantro_g1_h264_dec.c::set_params(): it writes G1_REG_DEC_CTRL6_POC_LENGTH(dec_param->pic_order_cnt_bit_size). For POC type 2, the spec says pic_order_cnt_bit_size should be 0 (there are no POC bits in the slice header). So returning 0 here is correct, and the code handles it by simply not entering either if or else if branch. The comment about type 2 means 0 bits is absent — worth adding.

Edge case: separate_colour_plane_flag and colour_plane_id.

The code reads colour_plane_id (2 bits) when separate_colour_plane_flag is set. This is correct per spec §7.3.3. For hantro NV12 streams, chroma_format_idc=1 and separate_colour_plane_flag=0 always, so this path is never hit. But the code is shaped to handle it gracefully.

Edge case: interlaced (field_pic_flag).

if (!ctx->frame_mbs_only_flag) {
    field_pic_flag = (br_read_u(&b, 1) != 0);
    if (field_pic_flag)
        (void)br_read_u(&b, 1); /* bottom_field_flag */
}

This is correct. For hantro RK3568 in FRAME_BASED mode with the BBB test clip (progressive), frame_mbs_only_flag=1 and this block is never entered. For interlaced streams, the field_pic_flag correctly gates the delta_pic_order_cnt_bottom read in the POC type 0 branch:

if (ctx->bottom_field_pic_order_in_frame_present_flag && !field_pic_flag)
    out->delta_pic_order_cnt_bottom = br_read_se(&b);

This is correct per §7.3.3: delta_pic_order_cnt_bottom is present only for frame pictures (not field). ✓

Edge case: multi-slice streams (for dec_ref_pic_marking_bit_size).

The commit message notes: "for FRAME_BASED mode that's fine — kernel sees the whole bitstream and parses subsequent slices itself." This is correct for hantro in FRAME_BASED mode, where the kernel receives the full bitstream and uses the dec_ref_pic_marking_bit_size only to know how many bits to skip in the first slice's header. Subsequent slices are parsed by the hardware. The implementation correctly parses only the first slice header.

Missing edge case: SP/SI slices.

Slice type 3 (SP) and slice type 4 (SI) are in the num_ref_idx_l0_active_minus1 override block via the condition if (st == 0 || st == 3 || st == 1). SP (type 3 mod 5 = 3) is included. For SI (type 4 mod 5 = 4), the block doesn't fire — which is correct, because SI slices don't have num_ref_idx_active_override_flag. ✓

However: in skip_ref_pic_list_modification(), the condition if (st_mod5 != 2 && st_mod5 != 4) correctly excludes both I (type 2) and SI (type 4) from having ref_pic_list_modification. ✓

Summary of code quality: The parser is well-structured, correctly handles the common cases relevant to this hardware, and will fail gracefully (returning -EIO) on parse errors. The empirical validation output matches the expected values. The two non-issues I found (MMCO 5 implicitly handled, POC type 2 implicitly correct) are worth documenting but are not bugs. There are no off-by-one errors in the exp-Golomb decoder.


Question 4: Phase 7 verification — is the criterion right?

The criterion is right. Both paths should be required, and chromium-fourier should be removed from the libva corpus.

On requiring both vaapi-copy and vaapi paths:

The Phase 7 finding correctly identifies that vaapi-copy and vaapi (DMA-BUF) are different code paths with different bugs. The vaapi-copy path exercises decode correctness but bypasses the surface-export logic in RequestExportSurfaceHandle. The vaapi path exercises the full stack including export. Requiring both is the right call.

The counterargument would be "vaapi-copy is sufficient for Phase 1 boolean correctness" — but Phase 1's criterion says "providing access to the hardware decoder." For Firefox, the primary real-world path is vaapi (DMA-BUF); Firefox falls back to SW when the DMA-BUF path fails. A libva backend that passes vaapi-copy but fails vaapi is not providing meaningful access to the hardware decoder for Firefox. The criterion should require both.

On chromium-fourier 149:

Phase 7 definitively established that chromium-fourier 149 uses its own internal V4L2 stateless decoder (media/gpu/v4l2/v4l2_video_decoder_backend_stateless), which bypasses libva entirely. It is not a libva consumer for H.264. Including it in the libva corpus is actively misleading — it creates a false positive that could mask libva regressions.

The correct disposition: remove chromium-fourier from the Phase 7 libva verification corpus, keep it as a performance reference (it demonstrates HW decode via its own path), and note explicitly in the campaign record that "chromium-fourier uses Step 2 patches to enable its own internal V4L2 decoder; libva-v4l2-request is not on that code path." Cell E (vanilla Chromium 149) would be the correct libva validator if Chromium's VAAPI path can be activated — but that's a separate investigation.

What the corpus should be after this clarification:

  • vainfo: smoke test (enumerate-only; always ✓ if driver loads)
  • mpv --hwdec=vaapi-copy --vo=gpu in a live session: decode correctness check
  • mpv --hwdec=vaapi --vo=gpu in a live session: surface-export check (the next iteration's target)
  • Firefox 150 live session with media.ffmpeg.vaapi.enabled: real-consumer test (currently blocked by the DMA-BUF bug)

Question 5: Architecture decisions worth questioning

A. Slice-header parser inside libva-v4l2-request vs upstream VAAPI lobbying

The in-fork parser is the right decision for this campaign's scope and timeline. The upstream VAAPI argument would be: "add idr_pic_id, pic_order_cnt_bit_size, dec_ref_pic_marking_bit_size to VASliceParameterBufferH264 and have the consumer (FFmpeg's decoder, Firefox's RDD, etc.) populate them." This is defensible on principle — VAAPI is already a parsed-parameter API, and these are parsed slice-header values.

But the practical problems with the upstream route: (1) VASliceParameterBufferH264 is a stable ABI; adding fields requires a VAAPI minor version bump and the cooperation of Intel, AMD, and ARM GPU driver maintainers; (2) every consumer that currently calls vaRenderPicture would need to be updated to populate the new fields; (3) bootlin/libva-v4l2-request is dormant — there's no upstream process to lobby; (4) the alternative backend (GStreamer's gstv4l2slh264dec) carries the equivalent parse logic itself.

The precedent from FFmpeg is instructive: FFmpeg's v4l2_request_h264.c uses values pre-parsed by its own H.264 decoder (H264SliceContext::ref_pic_marking_bit_size, pic_order_cnt_bit_size) — not from VAAPI. It works because FFmpeg controls both the decoder and the V4L2 backend. In a VAAPI backend, the V4L2-request backend doesn't have access to the decoder's internal state — it only has the VAAPI API surface. So the bit-parser in the backend is the only viable option absent an upstream VAAPI extension.

The parser adds approximately 250 lines that are self-contained, well-commented, and have empirical validation. The cost is low; the alternative is nonexistent on the relevant timeline.

B. DEBUG patches in production code

Patches 0010 (hex dump), 0011 (sentinel, now fixed), 0014 (VAPic dump), and the new 3F readback (VIDIOC_G_EXT_CTRLS), plus the rate-limited EACCES warning — these should stay until one condition is met: the surface-export bug is fixed and both mpv vaapi-copy and mpv vaapi (DMA-BUF) paths show the bunny in a live session.

The reason: the VIDIOC_G_EXT_CTRLS readback currently fails (EACCES) and that failure is not yet understood. "Not actionable from userspace" is a conclusion, but it's not a complete diagnosis. The readback bug suggests either: (a) hantro's kernel driver doesn't allow reading compound H.264 controls from a request fd that's in QUEUEING state (i.e., after MEDIA_REQUEST_IOC_QUEUE but before DQBUF); or (b) there's a permission check in the v4l2-core that blocks this. Until the surface-export bug is fixed and the system is working end-to-end, the DEBUG instrumentation is still load-bearing for diagnosis.

After the vaapi (DMA-BUF) path is confirmed working, a clean sweep of all DEBUG patches is appropriate before any "this is what we'd show Bootlin" snapshot. At that point: revert 0010, 0011, 0014, 3E, 3F, and the rate-limited EACCES warning. The utils.c tee to /tmp/libva-fourier.log should also be removed.

One caveat: the msync fix in 3E (cache-fix CAPTURE dump) is not debug-only if it reveals a real operational issue. But it operates on the mmap of a CAPTURE buffer, which is not used in the production decode path (the DMA-BUF export path exports the buffer fd directly without the CPU-side mmap). So removing 3E from production is safe.

C. FRAME_BASED hardcode vs FFmpeg's dynamic query

FFmpeg's v4l2_request_h264_post_frames_ctx queries V4L2_CID_STATELESS_H264_DECODE_MODE and V4L2_CID_STATELESS_H264_START_CODE dynamically. The fork hardcodes DECODE_MODE_FRAME_BASED and START_CODE_ANNEX_B in patch 0002.

For ohm RK3568 hantro G1/G2, FRAME_BASED is the only supported mode — the kernel driver defines hantro_g1_h264_dec_run and there is no slice-based equivalent in the verisilicon driver. The hardcode is functionally correct for this hardware.

The liability: if the fork is ever used on a device that supports SLICE_BASED mode (e.g., some Allwinner variants of hantro), the hardcode would send wrong controls and decode would fail silently. For the current campaign scope (ohm only, with fresnel and boltzmann as future iterations), this is a deferred risk that is explicitly documented.

The right time to make this dynamic is when fresnel (RK3399, hantro G1 without G2) is added as a target — at that point, query and store the mode at init time, then branch appropriately in h264_set_controls. The query helpers (v4l2_query_control_default_value) are already in the fork (commit 58a0e8b), so the migration is ~5 lines.


Question 6: Surface-export bug pre-attack reading

Based on my reading of src/surface.c::RequestExportSurfaceHandle and the arithmetic above:

The descriptor structure itself (num_objects, num_layers, num_planes, object_index) is correct for single-fd NV12. The pattern — 1 object, 1 layer, 2 planes, both pointing to object_index=0 with different offsets — is the standard way to export NV12 from a single DMA-BUF. FFmpeg's hwcontext_vaapi.c (and libva's own test suite) use exactly this structure.

The plane offsets are probably correct. UV offset = destination_sizes[0] = bytesperlines[0] * format_height = 1920 × 1088 = 2,088,960. This is where hantro G1 places the UV plane in its linear NV12 output (Y first, UV immediately following, DMV/internal state appended after in the oversized allocation). The export objects[0].size (3,133,440 = Y_size + UV_size, not the full 3,655,712 sizeimage) is undersized relative to the actual DMA-BUF allocation, but the offsets it describes fall within that 3,133,440 boundary, so Mesa validation should pass.

The most likely root cause is the colorspace/range metadata path, not the geometry. Here is the reasoning:

When mpv uses --hwdec=vaapi-copy, it calls vaMapBuffer which triggers a CPU copy of the decoded NV12 frame. mpv knows it's HD (1080p) content and applies BT.709 limited-range as the colorspace for the GL upload. This produces the correct green for all-zeros NV12 during Phase 0, and the bunny for real pixels after Phase 6.

When mpv uses --hwdec=vaapi (DMA-BUF path), it imports the DMA-BUF as an EGL image and creates a GL texture. The colorspace for the YCbCr→RGB conversion must come from somewhere. The VADRMPRIMESurfaceDescriptor has no colorspace field. mpv's DMA-BUF import path may default to a different colorspace (BT.601, or full-range vs limited-range) than what mpv's CPU path uses. That would explain the blue vs green color — both from all-zeros NV12, but different matrix.

But solid blue from real pixels (the post-Phase-6 Phase 7 result) cannot be explained by colorspace alone. If colorspace were the only issue, we'd see a color-shifted but recognizable bunny, not a solid field. Solid blue means the luma channel is being read as near-zero. This means either:

  1. The UV offset is wrong by enough that both Y and UV data are garbage — but the arithmetic says the offset is correct.
  2. The DMA-BUF fd being exported points to the wrong buffer (e.g., a buffer that hasn't been decoded yet, or the incorrect index).
  3. Mesa's EGL DMA-BUF import for single-object NV12 is failing silently and falling back to an error state that renders solid blue. Some Mesa versions require the two NV12 planes in two separate DMA-BUF fds (NV12M format) rather than a single fd with two offsets.

The most productive first debug step: add a request_log in RequestExportSurfaceHandle that logs surface_object->destination_index, the exported fd number, objects[0].size, layers[0].offset[0], layers[0].offset[1], layers[0].pitch[0], and layers[0].pitch[1]. Then run mpv --hwdec=vaapi with LIBVA_MESSAGING_LEVEL=2 and --msg-level=vd=v --msg-level=vo=v. The mpv output will show what format/modifier it's importing, and the libva log will show what we're exporting. Comparing the two will identify any mismatch.

The second thing to check in FFmpeg's hwcontext_v4l2request.c: look for how it handles the DMA-BUF export, specifically whether it exports via vaExportSurfaceHandle or via a direct VIDIOC_EXPBUF call, and whether it sets any EGL colorspace attributes on the image import. FFmpeg's v4l2_request path exports AV_PIX_FMT_DRM_PRIME frames — the descriptor building happens in libavutil/hwcontext_drm.c, not in the libva-v4l2-request backend. This is worth reading to see if FFmpeg sets AVDRMFrameDescriptor.nb_objects to 2 (NV12M-style) vs 1 (packed NV12 style) for hantro output.

What to look for in src/surface.c specifically before attacking: check whether format_height from v4l2_get_format is the MB-aligned height (1088) or the stream-signaled height (1080). If v4l2_get_format returns format_height=1080 for a 1080p stream, then destination_sizes[0] = 1920 × 1080 = 2,073,600, and the UV offset would be 2,073,600 instead of 2,088,960 — which is 15,360 bytes too early, putting the UV read 8 complete rows into the luma data. That would cause a severe corruption that could appear as near-solid color on large homogeneous regions. This is worth verifying with a request_log on format_width, format_height at the top of RequestCreateSurfaces2.


Question 7: Things not surfaced

7.1 The EACCES from VIDIOC_G_EXT_CTRLS on the request fd needs investigation before closing the campaign.

Commit 6be3f3b rate-limits this to "log once per process, not actionable." But it is actionable — it means the Tier 3F readback (the control-write verification mechanism) does not function on this rig. This could be a Linux 6.19.x v4l2-core behavior change. The investigation is: check whether the EACCES comes from the request fd being in QUEUEING state at the time of the read (post-MEDIA_REQUEST_IOC_QUEUE, pre-DQBUF). If so, move the readback to before MEDIA_REQUEST_IOC_QUEUE — which is actually the more useful timing anyway (verify the write before firing the request). The current code does readback after v4l2_set_controls but before MEDIA_REQUEST_IOC_QUEUE, which should be pre-QUEUE state. If EACCES still fires there, the issue is something else (permissions on the request fd, kernel check for compound controls specifically). This deserves one more probe before being called "not actionable."

7.2 The num_ref_idx_l0/l1_default_active_minus1 source is slightly wrong for multi-slice streams.

Patch d41a4b9 populates these from surface->params.h264.slice.num_ref_idx_l0_active_minus1 (the per-slice VASlice field). The comment in the patch says: "VAAPI doesn't expose the parsed-PPS default fields; the per-slice override is the closest available source (matches PPS default except on streams with explicit per-slice override)."

For single-slice FRAME_BASED streams this is fine. But for multi-slice streams: if slice N overrides num_ref_idx_l0_active with a per-slice value, but slice N+1 does not override it (falls back to the PPS default), the h264_slice_header_context passed to the bit-parser for slice N+1 will have the wrong default. This is only an issue in multi-slice FRAME_BASED mode, which hantro does support. For the BBB test clip it doesn't matter. For production use on arbitrary H.264 streams, this is worth noting — though fixing it would require the PPS-default value be recovered from the VAAPI pic_params (which doesn't expose num_ref_idx_l0_default_active_minus1 directly in VAPictureParameterBufferH264). The correct long-term fix is to parse it from the PPS or track it per-picture, not per-slice.

7.3 The static bool SET_FORMAT_OF_OUTPUT_ONCE global in surface.c is a time bomb.

bool SET_FORMAT_OF_OUTPUT_ONCE = false;

This is a global that prevents v4l2_set_format from being called on the OUTPUT queue more than once per process lifetime. If a consumer (Firefox, mpv) creates a new VA context for a different resolution (e.g., starts a second video in a tab), this flag would prevent re-setting the format. The result would be that the second video's OUTPUT format is wrong (still 1920×1088 from the first video), which would produce silent decode failure or corruption.

For Phase 1 boolean correctness with a single test clip, this doesn't matter. But it's a latent correctness bug that should be noted in a worklist item before the campaign closes.

7.4 The // HACK block in surface.c calling v4l2_set_format before vaCreateContext is architecturally wrong.

The comment says this is a workaround because v4l2_set_format must be called before any CREATE_BUFS. But the v4l2_set_format call here uses a hardcoded V4L2_PIX_FMT_H264_SLICE pixelformat and width/height from the surface dimensions. For a VAAPI backend that supports multiple codecs (H.264 + MPEG-2), this would fail for MPEG-2 surfaces. The correct architecture is to set the OUTPUT format in RequestCreateContext (which knows the codec) with proper sequencing before the first CREATE_BUFS. The WIP STREAMON-defer commit (44a7327) was an incomplete attempt at this. The HACK should be documented as a known architectural debt.

7.5 The campaign has not yet verified that Firefox falls back gracefully for non-IDR first frames.

Phase 7 shows Firefox attempted one frame (IDR, frame 0), got the DMA-BUF result, and fell back. The slice_header parse log shows only one parse entry. If Firefox ever sent a non-IDR first frame (e.g., seeking into the middle of a stream), and the bit-parser had an issue with non-IDR frames, the fallback would be the same SW path. This is not urgent for the current campaign but is worth noting as a future verification case.

7.6 The fourier_attribution cell A wheat verdict is now better understood but still slightly confused.

Cell A (chromium-fourier 149, browser_cpu_median=54%, fps=24.0) showed hardware decode happening, confirmed visually. Phase 7 established that chromium-fourier uses its own internal V4L2 stateless decoder, not libva-v4l2-request. So the "multi-planar libva is the binding decode-side enabler" claim in the fourier_attribution conclusion was wrong about the mechanism (it's Step 2's Chromium-internal decoder, not Step 1's libva backend, that enables HW decode for chromium-fourier). The 83 pp browser-CPU difference between Cell A and Cell B (Brave 1.89) is real, but the attribution is "Step 2 chromium internal V4L2 decoder" not "Step 1 libva backend."

This doesn't invalidate the cell A wheat verdict — HW decode IS happening for chromium-fourier — but it does mean Cell E (vanilla Chromium 149) disambiguation will be interesting: if vanilla Chromium 149 WITHOUT Step 2 also gets HW decode, it would come from its own V4L2 path (potentially). If it doesn't, then Step 2 specifically is what enables it, independent of any libva work. This disambiguation is useful for properly scoping what this campaign's libva deliverable buys for chromium users.


Summary prioritization

Load-bearing for the next Phase 4→6 cycle (surface-export fix):

  1. Add request_log to RequestExportSurfaceHandle to dump the exported descriptor before starting bisection.
  2. Log format_width and format_height from v4l2_get_format at the start of RequestCreateSurfaces2 — confirm format_height is 1088 (MB-aligned), not 1080 (stream height).
  3. Run mpv --hwdec=vaapi --msg-level=vd=v --msg-level=vo=v to see what mesa/mpv reports about the imported DRM format, modifier, and plane geometry.
  4. Check if the real UV offset in the hantro G1 buffer matches 1920×1088. If format_height from G_FMT is 1080 (not 1088), the UV offset is wrong by 15,360 bytes — this is the most likely structural bug.
  5. Read FFmpeg's hwcontext_drm.c for how it builds the DRM frame descriptor for NV12 from V4L2 buffers (specifically nb_objects and how UV plane offset is computed).

Nice-to-have (before campaign close):

  1. Investigate VIDIOC_G_EXT_CTRLS EACCES — move readback to before MEDIA_REQUEST_IOC_QUEUE if possible.
  2. Document the SET_FORMAT_OF_OUTPUT_ONCE global as a known limitation.
  3. Add MMCO 5 comment to the dec_ref_pic_marking loop.
  4. Add POC type 2 comment to the pic_order_cnt_bit_size measurement block.

Deferred to after campaign close:

  1. Dynamic DECODE_MODE query (for fresnel/boltzmann iterations).
  2. num_ref_idx from PPS instead of VASlice default (multi-slice correctness).
  3. VAAPI upstream lobbying for pic_order_cnt_bit_size field (future, different campaign).
  4. Cleanup of all DEBUG patches.

Critical Files for Implementation

  • /home/mfritsche/src/libva-multiplanar/libva-v4l2-request-fourier/src/surface.c
  • /home/mfritsche/src/libva-multiplanar/libva-v4l2-request-fourier/src/h264_slice_header.c
  • /home/mfritsche/src/libva-multiplanar/libva-v4l2-request-fourier/src/h264.c
  • /home/mfritsche/src/libva-multiplanar/libva-v4l2-request-fourier/src/video.c
  • /home/mfritsche/src/libva-multiplanar/references/linux-mainline/drivers/media/platform/verisilicon/hantro_g1_h264_dec.c