Files
libva-multiplanar/phase2_iter4_situation.md
T
marfrit ec277eab57 Iteration 4 Phase 2: kernel control validation analysis
Reading v4l2-core/v4l2-ctrls-api.c and v4l2-ctrls-core.c on the cloned
linux-pinetab2 v6.19.10-danctnix1 source: error_idx == count for
S_EXT_CTRLS is intentional kernel obfuscation, not under-reporting.
Line 629 deliberately overwrites error_idx with cs->count after
validate_ctrls failures in set mode, forcing the caller to bail rather
than partial-set.

The escape hatch is VIDIOC_TRY_EXT_CTRLS, which "never modifies controls
[so] error_idx is just set to whatever control has an invalid value"
(quoting v4l2-ctrls-api.c:222-224).

Path forward into Phase 4: amend Y2 instrumentation to retry with
TRY_EXT_CTRLS on S_EXT_CTRLS EINVAL, extract the actual failing control
index. From there, narrow the failing field by comparing frame-11
values against frames 1-10.

Phase 3 baseline anchored from iter3 Phase 7 — same rig, same EINVAL,
deterministic. No re-acquire needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:21:40 +00:00

5.3 KiB
Raw Blame History

Iteration 4 — Phase 2 (situation analysis: kernel V4L2 control validation)

Goal: identify what kernel layer rejects our 11th-frame VIDIOC_S_EXT_CTRLS and how to learn WHICH control is bad. iter3 surfaced the EINVAL with error_idx == num_controls, and iter4's Phase 1 lock points at this defect as the binding question.

Source: linux-pinetab2 v6.19.10-danctnix1 cloned to /build/linux-pinetab2/ on the boltzmann firefox-fourier container. Hantro driver lives at drivers/media/platform/verisilicon/ (moved out of staging by 6.19). Generic V4L2 H.264 helpers at drivers/media/v4l2-core/v4l2-h264.c.

Finding 1 — error_idx == count is intentional kernel obfuscation, not under-reporting

v4l2-ctrls-api.c:629 (within try_set_ext_ctrls_common):

ret = prepare_ext_ctrls(hdl, cs, helpers, vdev, false);
if (!ret)
    ret = validate_ctrls(cs, helpers, vdev, set);
if (ret && set)
    cs->error_idx = cs->count;

If validation fails AND we're calling S_EXT_CTRLS (not TRY_EXT_CTRLS), the kernel deliberately overwrites error_idx with count, forcing the caller to bail cleanly rather than try to partially fix the bad control. The actual failing control is known to validate_ctrls (it set error_idx = i in its loop) but is hidden from the S_EXT caller.

Author comment at lines 209-225 (verbatim):

"It is all fairly theoretical, though. In practice all you can do is to bail out. If error_idx == count, then it is an application bug. ... Note that these rules do not apply to VIDIOC_TRY_EXT_CTRLS: since that never modifies controls the error_idx is just set to whatever control has an invalid value."

So our iter3 Y2 diagnostic was logging exactly what the kernel intended — useless. The escape hatch is VIDIOC_TRY_EXT_CTRLS, which never modifies controls and DOES report the specific failing control.

Finding 2 — H.264 stateless control validators (per-control)

v4l2-ctrls-core.c::std_validate_compound() validates each compound control individually. For our four controls, the per-control validators check (from v4l2-ctrls-core.c:1031-1180):

SPS rejects on:

  • profile_idc < 122 && chroma_format_idc > 1
  • profile_idc < 244 && chroma_format_idc > 2
  • chroma_format_idc > 3
  • bit_depth_luma_minus8 > 6 or bit_depth_chroma_minus8 > 6
  • log2_max_frame_num_minus4 > 12
  • pic_order_cnt_type > 2
  • log2_max_pic_order_cnt_lsb_minus4 > 12
  • max_num_ref_frames > V4L2_H264_REF_LIST_LEN

PPS rejects on:

  • num_slice_groups_minus1 > 7
  • num_ref_idx_l0_default_active_minus1 > (V4L2_H264_REF_LIST_LEN - 1)

DECODE_PARAMS rejects on:

  • nal_ref_idc > 3

SCALING_MATRIX: no reject path (just zero-pad).

Frames 110 pass these validators (they're the same SPS/PPS/decode-mode constants). Frame 11 must violate one of them OR fail somewhere else (e.g. a request_fd state precondition, cluster-internal coupling).

Finding 3 — Hantro driver-level validation runs after S_EXT_CTRLS

hantro_h264.c::hantro_h264_dec_prepare_run() is called from the V4L2 m2m worker, post-MEDIA_REQUEST_IOC_QUEUE. Its EINVAL paths (lines 449/454/459/464) are NULL-checks for missing controls — they fire if the request didn't carry one of the four required H.264 controls. NOT the iter1+2+3 carryover path.

Conclusion: the failure is in the V4L2 control-handler validate_ctrls for one of our four compound controls. Per-control validators (Finding 2) don't seem to fit since fields 110 use the same constants, but error_idx was hidden by the obfuscation rule.

Finding 4 — VIDIOC_TRY_EXT_CTRLS is the diagnostic escape hatch

Per the kernel comment, TRY_EXT_CTRLS ALWAYS reports the specific failing control's index in error_idx, even when set=true would have hidden it. Path forward:

  1. Amend our driver's Y2 instrumentation: on VIDIOC_S_EXT_CTRLS returning -EINVAL with error_idx == num_controls, retry the same controls via VIDIOC_TRY_EXT_CTRLS to extract the real failing index.
  2. Log the precise failing control + its raw fields.
  3. From that, identify the specific field-value combination on frame 11 that the per-control validator rejects.

This is a one-liner in v4l2.c: add another ioctl(video_fd, VIDIOC_TRY_EXT_CTRLS, &controls) call inside the existing EINVAL diagnostic block, then log controls.error_idx from the TRY result.

Implication for Phase 4

Phase 4's plan needs amendment: before any driver-side fix can be authored, we need the precise failing control + field. Steps:

  1. Y2 v3: add TRY_EXT_CTRLS retry on S_EXT_CTRLS EINVAL.
  2. Rebuild driver on ohm (~30 sec via meson+ninja).
  3. Re-run autonomous Phase 7 (/tmp/run_phase7_v2.sh).
  4. Read the new diagnostic — now naming the specific control.
  5. Compare frame-11 field values vs frames 110 to localize the bad field.
  6. Fix in driver. Most likely surfaces are: SPS field that wasn't properly latched after IDR, PPS field that flipped at frame 5, DECODE_PARAMS DPB entry with malformed frame_num/pic_num/flags.

Phase 3 (baseline anchor) is satisfied by iter3's Phase 7 run — same rig, same EINVAL, deterministic.

Stop point

Phase 2 closed. Next: Phase 4 step 1 (Y2 v3 with TRY_EXT_CTRLS retry diagnostic). Phase 3 anchored from iter3. Per the "Stop only if user is needed" rule, no user input required to proceed — but this is a cycle of read→speculate→test that may take several iterations.