ec277eab57
Reading v4l2-core/v4l2-ctrls-api.c and v4l2-ctrls-core.c on the cloned linux-pinetab2 v6.19.10-danctnix1 source: error_idx == count for S_EXT_CTRLS is intentional kernel obfuscation, not under-reporting. Line 629 deliberately overwrites error_idx with cs->count after validate_ctrls failures in set mode, forcing the caller to bail rather than partial-set. The escape hatch is VIDIOC_TRY_EXT_CTRLS, which "never modifies controls [so] error_idx is just set to whatever control has an invalid value" (quoting v4l2-ctrls-api.c:222-224). Path forward into Phase 4: amend Y2 instrumentation to retry with TRY_EXT_CTRLS on S_EXT_CTRLS EINVAL, extract the actual failing control index. From there, narrow the failing field by comparing frame-11 values against frames 1-10. Phase 3 baseline anchored from iter3 Phase 7 — same rig, same EINVAL, deterministic. No re-acquire needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
84 lines
5.3 KiB
Markdown
84 lines
5.3 KiB
Markdown
# Iteration 4 — Phase 2 (situation analysis: kernel V4L2 control validation)
|
||
|
||
Goal: identify what kernel layer rejects our 11th-frame `VIDIOC_S_EXT_CTRLS` and how to learn WHICH control is bad. iter3 surfaced the EINVAL with `error_idx == num_controls`, and iter4's Phase 1 lock points at this defect as the binding question.
|
||
|
||
Source: `linux-pinetab2` v6.19.10-danctnix1 cloned to `/build/linux-pinetab2/` on the boltzmann firefox-fourier container. Hantro driver lives at `drivers/media/platform/verisilicon/` (moved out of staging by 6.19). Generic V4L2 H.264 helpers at `drivers/media/v4l2-core/v4l2-h264.c`.
|
||
|
||
## Finding 1 — `error_idx == count` is intentional kernel obfuscation, not under-reporting
|
||
|
||
`v4l2-ctrls-api.c:629` (within `try_set_ext_ctrls_common`):
|
||
|
||
```c
|
||
ret = prepare_ext_ctrls(hdl, cs, helpers, vdev, false);
|
||
if (!ret)
|
||
ret = validate_ctrls(cs, helpers, vdev, set);
|
||
if (ret && set)
|
||
cs->error_idx = cs->count;
|
||
```
|
||
|
||
If validation fails AND we're calling `S_EXT_CTRLS` (not `TRY_EXT_CTRLS`), the kernel **deliberately overwrites** `error_idx` with `count`, forcing the caller to bail cleanly rather than try to partially fix the bad control. The actual failing control is known to `validate_ctrls` (it set `error_idx = i` in its loop) but is hidden from the S_EXT caller.
|
||
|
||
Author comment at lines 209-225 (verbatim):
|
||
|
||
> "It is all fairly theoretical, though. In practice all you can do is to bail out. **If error_idx == count, then it is an application bug.** ... Note that these rules do not apply to VIDIOC_TRY_EXT_CTRLS: since that never modifies controls the error_idx is just set to whatever control has an invalid value."
|
||
|
||
So our iter3 Y2 diagnostic was logging exactly what the kernel intended — useless. The escape hatch is `VIDIOC_TRY_EXT_CTRLS`, which never modifies controls and DOES report the specific failing control.
|
||
|
||
## Finding 2 — H.264 stateless control validators (per-control)
|
||
|
||
`v4l2-ctrls-core.c::std_validate_compound()` validates each compound control individually. For our four controls, the per-control validators check (from `v4l2-ctrls-core.c:1031-1180`):
|
||
|
||
**SPS** rejects on:
|
||
- `profile_idc < 122 && chroma_format_idc > 1`
|
||
- `profile_idc < 244 && chroma_format_idc > 2`
|
||
- `chroma_format_idc > 3`
|
||
- `bit_depth_luma_minus8 > 6` or `bit_depth_chroma_minus8 > 6`
|
||
- `log2_max_frame_num_minus4 > 12`
|
||
- `pic_order_cnt_type > 2`
|
||
- `log2_max_pic_order_cnt_lsb_minus4 > 12`
|
||
- `max_num_ref_frames > V4L2_H264_REF_LIST_LEN`
|
||
|
||
**PPS** rejects on:
|
||
- `num_slice_groups_minus1 > 7`
|
||
- `num_ref_idx_l0_default_active_minus1 > (V4L2_H264_REF_LIST_LEN - 1)`
|
||
|
||
**DECODE_PARAMS** rejects on:
|
||
- `nal_ref_idc > 3`
|
||
|
||
**SCALING_MATRIX**: no reject path (just zero-pad).
|
||
|
||
Frames 1–10 pass these validators (they're the same SPS/PPS/decode-mode constants). Frame 11 must violate one of them OR fail somewhere else (e.g. a request_fd state precondition, cluster-internal coupling).
|
||
|
||
## Finding 3 — Hantro driver-level validation runs *after* `S_EXT_CTRLS`
|
||
|
||
`hantro_h264.c::hantro_h264_dec_prepare_run()` is called from the V4L2 m2m worker, post-`MEDIA_REQUEST_IOC_QUEUE`. Its EINVAL paths (lines 449/454/459/464) are NULL-checks for missing controls — they fire if the request didn't carry one of the four required H.264 controls. NOT the iter1+2+3 carryover path.
|
||
|
||
Conclusion: the failure is in the V4L2 control-handler `validate_ctrls` for one of our four compound controls. Per-control validators (Finding 2) don't seem to fit since fields 1–10 use the same constants, but `error_idx` was hidden by the obfuscation rule.
|
||
|
||
## Finding 4 — `VIDIOC_TRY_EXT_CTRLS` is the diagnostic escape hatch
|
||
|
||
Per the kernel comment, `TRY_EXT_CTRLS` ALWAYS reports the specific failing control's index in `error_idx`, even when set=true would have hidden it. Path forward:
|
||
|
||
1. Amend our driver's Y2 instrumentation: on `VIDIOC_S_EXT_CTRLS` returning -EINVAL with `error_idx == num_controls`, **retry the same controls via `VIDIOC_TRY_EXT_CTRLS`** to extract the real failing index.
|
||
2. Log the precise failing control + its raw fields.
|
||
3. From that, identify the specific field-value combination on frame 11 that the per-control validator rejects.
|
||
|
||
This is a one-liner in v4l2.c: add another `ioctl(video_fd, VIDIOC_TRY_EXT_CTRLS, &controls)` call inside the existing EINVAL diagnostic block, then log `controls.error_idx` from the TRY result.
|
||
|
||
## Implication for Phase 4
|
||
|
||
Phase 4's plan needs amendment: before any driver-side fix can be authored, we need the precise failing control + field. Steps:
|
||
|
||
1. Y2 v3: add TRY_EXT_CTRLS retry on S_EXT_CTRLS EINVAL.
|
||
2. Rebuild driver on ohm (~30 sec via meson+ninja).
|
||
3. Re-run autonomous Phase 7 (`/tmp/run_phase7_v2.sh`).
|
||
4. Read the new diagnostic — now naming the specific control.
|
||
5. Compare frame-11 field values vs frames 1–10 to localize the bad field.
|
||
6. Fix in driver. Most likely surfaces are: SPS field that wasn't properly latched after IDR, PPS field that flipped at frame 5, DECODE_PARAMS DPB entry with malformed `frame_num`/`pic_num`/`flags`.
|
||
|
||
Phase 3 (baseline anchor) is satisfied by iter3's Phase 7 run — same rig, same EINVAL, deterministic.
|
||
|
||
## Stop point
|
||
|
||
Phase 2 closed. Next: Phase 4 step 1 (Y2 v3 with TRY_EXT_CTRLS retry diagnostic). Phase 3 anchored from iter3. Per the "Stop only if user is needed" rule, no user input required to proceed — but this is a cycle of read→speculate→test that may take several iterations.
|