From ec277eab5715367aa8da270bfac6dfa35f3e856d Mon Sep 17 00:00:00 2001 From: Markus Fritsche Date: Tue, 5 May 2026 13:21:40 +0000 Subject: [PATCH] Iteration 4 Phase 2: kernel control validation analysis MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Reading v4l2-core/v4l2-ctrls-api.c and v4l2-ctrls-core.c on the cloned linux-pinetab2 v6.19.10-danctnix1 source: error_idx == count for S_EXT_CTRLS is intentional kernel obfuscation, not under-reporting. Line 629 deliberately overwrites error_idx with cs->count after validate_ctrls failures in set mode, forcing the caller to bail rather than partial-set. The escape hatch is VIDIOC_TRY_EXT_CTRLS, which "never modifies controls [so] error_idx is just set to whatever control has an invalid value" (quoting v4l2-ctrls-api.c:222-224). Path forward into Phase 4: amend Y2 instrumentation to retry with TRY_EXT_CTRLS on S_EXT_CTRLS EINVAL, extract the actual failing control index. From there, narrow the failing field by comparing frame-11 values against frames 1-10. Phase 3 baseline anchored from iter3 Phase 7 — same rig, same EINVAL, deterministic. No re-acquire needed. Co-Authored-By: Claude Opus 4.7 (1M context) --- phase2_iter4_situation.md | 83 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 83 insertions(+) create mode 100644 phase2_iter4_situation.md diff --git a/phase2_iter4_situation.md b/phase2_iter4_situation.md new file mode 100644 index 0000000..7e12b9d --- /dev/null +++ b/phase2_iter4_situation.md @@ -0,0 +1,83 @@ +# Iteration 4 — Phase 2 (situation analysis: kernel V4L2 control validation) + +Goal: identify what kernel layer rejects our 11th-frame `VIDIOC_S_EXT_CTRLS` and how to learn WHICH control is bad. iter3 surfaced the EINVAL with `error_idx == num_controls`, and iter4's Phase 1 lock points at this defect as the binding question. + +Source: `linux-pinetab2` v6.19.10-danctnix1 cloned to `/build/linux-pinetab2/` on the boltzmann firefox-fourier container. Hantro driver lives at `drivers/media/platform/verisilicon/` (moved out of staging by 6.19). Generic V4L2 H.264 helpers at `drivers/media/v4l2-core/v4l2-h264.c`. + +## Finding 1 — `error_idx == count` is intentional kernel obfuscation, not under-reporting + +`v4l2-ctrls-api.c:629` (within `try_set_ext_ctrls_common`): + +```c +ret = prepare_ext_ctrls(hdl, cs, helpers, vdev, false); +if (!ret) + ret = validate_ctrls(cs, helpers, vdev, set); +if (ret && set) + cs->error_idx = cs->count; +``` + +If validation fails AND we're calling `S_EXT_CTRLS` (not `TRY_EXT_CTRLS`), the kernel **deliberately overwrites** `error_idx` with `count`, forcing the caller to bail cleanly rather than try to partially fix the bad control. The actual failing control is known to `validate_ctrls` (it set `error_idx = i` in its loop) but is hidden from the S_EXT caller. + +Author comment at lines 209-225 (verbatim): + +> "It is all fairly theoretical, though. In practice all you can do is to bail out. **If error_idx == count, then it is an application bug.** ... Note that these rules do not apply to VIDIOC_TRY_EXT_CTRLS: since that never modifies controls the error_idx is just set to whatever control has an invalid value." + +So our iter3 Y2 diagnostic was logging exactly what the kernel intended — useless. The escape hatch is `VIDIOC_TRY_EXT_CTRLS`, which never modifies controls and DOES report the specific failing control. + +## Finding 2 — H.264 stateless control validators (per-control) + +`v4l2-ctrls-core.c::std_validate_compound()` validates each compound control individually. For our four controls, the per-control validators check (from `v4l2-ctrls-core.c:1031-1180`): + +**SPS** rejects on: +- `profile_idc < 122 && chroma_format_idc > 1` +- `profile_idc < 244 && chroma_format_idc > 2` +- `chroma_format_idc > 3` +- `bit_depth_luma_minus8 > 6` or `bit_depth_chroma_minus8 > 6` +- `log2_max_frame_num_minus4 > 12` +- `pic_order_cnt_type > 2` +- `log2_max_pic_order_cnt_lsb_minus4 > 12` +- `max_num_ref_frames > V4L2_H264_REF_LIST_LEN` + +**PPS** rejects on: +- `num_slice_groups_minus1 > 7` +- `num_ref_idx_l0_default_active_minus1 > (V4L2_H264_REF_LIST_LEN - 1)` + +**DECODE_PARAMS** rejects on: +- `nal_ref_idc > 3` + +**SCALING_MATRIX**: no reject path (just zero-pad). + +Frames 1–10 pass these validators (they're the same SPS/PPS/decode-mode constants). Frame 11 must violate one of them OR fail somewhere else (e.g. a request_fd state precondition, cluster-internal coupling). + +## Finding 3 — Hantro driver-level validation runs *after* `S_EXT_CTRLS` + +`hantro_h264.c::hantro_h264_dec_prepare_run()` is called from the V4L2 m2m worker, post-`MEDIA_REQUEST_IOC_QUEUE`. Its EINVAL paths (lines 449/454/459/464) are NULL-checks for missing controls — they fire if the request didn't carry one of the four required H.264 controls. NOT the iter1+2+3 carryover path. + +Conclusion: the failure is in the V4L2 control-handler `validate_ctrls` for one of our four compound controls. Per-control validators (Finding 2) don't seem to fit since fields 1–10 use the same constants, but `error_idx` was hidden by the obfuscation rule. + +## Finding 4 — `VIDIOC_TRY_EXT_CTRLS` is the diagnostic escape hatch + +Per the kernel comment, `TRY_EXT_CTRLS` ALWAYS reports the specific failing control's index in `error_idx`, even when set=true would have hidden it. Path forward: + +1. Amend our driver's Y2 instrumentation: on `VIDIOC_S_EXT_CTRLS` returning -EINVAL with `error_idx == num_controls`, **retry the same controls via `VIDIOC_TRY_EXT_CTRLS`** to extract the real failing index. +2. Log the precise failing control + its raw fields. +3. From that, identify the specific field-value combination on frame 11 that the per-control validator rejects. + +This is a one-liner in v4l2.c: add another `ioctl(video_fd, VIDIOC_TRY_EXT_CTRLS, &controls)` call inside the existing EINVAL diagnostic block, then log `controls.error_idx` from the TRY result. + +## Implication for Phase 4 + +Phase 4's plan needs amendment: before any driver-side fix can be authored, we need the precise failing control + field. Steps: + +1. Y2 v3: add TRY_EXT_CTRLS retry on S_EXT_CTRLS EINVAL. +2. Rebuild driver on ohm (~30 sec via meson+ninja). +3. Re-run autonomous Phase 7 (`/tmp/run_phase7_v2.sh`). +4. Read the new diagnostic — now naming the specific control. +5. Compare frame-11 field values vs frames 1–10 to localize the bad field. +6. Fix in driver. Most likely surfaces are: SPS field that wasn't properly latched after IDR, PPS field that flipped at frame 5, DECODE_PARAMS DPB entry with malformed `frame_num`/`pic_num`/`flags`. + +Phase 3 (baseline anchor) is satisfied by iter3's Phase 7 run — same rig, same EINVAL, deterministic. + +## Stop point + +Phase 2 closed. Next: Phase 4 step 1 (Y2 v3 with TRY_EXT_CTRLS retry diagnostic). Phase 3 anchored from iter3. Per the "Stop only if user is needed" rule, no user input required to proceed — but this is a cycle of read→speculate→test that may take several iterations.