Files
libva-multiplanar/phase2_iter4_situation.md
T
marfrit ec277eab57 Iteration 4 Phase 2: kernel control validation analysis
Reading v4l2-core/v4l2-ctrls-api.c and v4l2-ctrls-core.c on the cloned
linux-pinetab2 v6.19.10-danctnix1 source: error_idx == count for
S_EXT_CTRLS is intentional kernel obfuscation, not under-reporting.
Line 629 deliberately overwrites error_idx with cs->count after
validate_ctrls failures in set mode, forcing the caller to bail rather
than partial-set.

The escape hatch is VIDIOC_TRY_EXT_CTRLS, which "never modifies controls
[so] error_idx is just set to whatever control has an invalid value"
(quoting v4l2-ctrls-api.c:222-224).

Path forward into Phase 4: amend Y2 instrumentation to retry with
TRY_EXT_CTRLS on S_EXT_CTRLS EINVAL, extract the actual failing control
index. From there, narrow the failing field by comparing frame-11
values against frames 1-10.

Phase 3 baseline anchored from iter3 Phase 7 — same rig, same EINVAL,
deterministic. No re-acquire needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:21:40 +00:00

84 lines
5.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Iteration 4 — Phase 2 (situation analysis: kernel V4L2 control validation)
Goal: identify what kernel layer rejects our 11th-frame `VIDIOC_S_EXT_CTRLS` and how to learn WHICH control is bad. iter3 surfaced the EINVAL with `error_idx == num_controls`, and iter4's Phase 1 lock points at this defect as the binding question.
Source: `linux-pinetab2` v6.19.10-danctnix1 cloned to `/build/linux-pinetab2/` on the boltzmann firefox-fourier container. Hantro driver lives at `drivers/media/platform/verisilicon/` (moved out of staging by 6.19). Generic V4L2 H.264 helpers at `drivers/media/v4l2-core/v4l2-h264.c`.
## Finding 1 — `error_idx == count` is intentional kernel obfuscation, not under-reporting
`v4l2-ctrls-api.c:629` (within `try_set_ext_ctrls_common`):
```c
ret = prepare_ext_ctrls(hdl, cs, helpers, vdev, false);
if (!ret)
ret = validate_ctrls(cs, helpers, vdev, set);
if (ret && set)
cs->error_idx = cs->count;
```
If validation fails AND we're calling `S_EXT_CTRLS` (not `TRY_EXT_CTRLS`), the kernel **deliberately overwrites** `error_idx` with `count`, forcing the caller to bail cleanly rather than try to partially fix the bad control. The actual failing control is known to `validate_ctrls` (it set `error_idx = i` in its loop) but is hidden from the S_EXT caller.
Author comment at lines 209-225 (verbatim):
> "It is all fairly theoretical, though. In practice all you can do is to bail out. **If error_idx == count, then it is an application bug.** ... Note that these rules do not apply to VIDIOC_TRY_EXT_CTRLS: since that never modifies controls the error_idx is just set to whatever control has an invalid value."
So our iter3 Y2 diagnostic was logging exactly what the kernel intended — useless. The escape hatch is `VIDIOC_TRY_EXT_CTRLS`, which never modifies controls and DOES report the specific failing control.
## Finding 2 — H.264 stateless control validators (per-control)
`v4l2-ctrls-core.c::std_validate_compound()` validates each compound control individually. For our four controls, the per-control validators check (from `v4l2-ctrls-core.c:1031-1180`):
**SPS** rejects on:
- `profile_idc < 122 && chroma_format_idc > 1`
- `profile_idc < 244 && chroma_format_idc > 2`
- `chroma_format_idc > 3`
- `bit_depth_luma_minus8 > 6` or `bit_depth_chroma_minus8 > 6`
- `log2_max_frame_num_minus4 > 12`
- `pic_order_cnt_type > 2`
- `log2_max_pic_order_cnt_lsb_minus4 > 12`
- `max_num_ref_frames > V4L2_H264_REF_LIST_LEN`
**PPS** rejects on:
- `num_slice_groups_minus1 > 7`
- `num_ref_idx_l0_default_active_minus1 > (V4L2_H264_REF_LIST_LEN - 1)`
**DECODE_PARAMS** rejects on:
- `nal_ref_idc > 3`
**SCALING_MATRIX**: no reject path (just zero-pad).
Frames 110 pass these validators (they're the same SPS/PPS/decode-mode constants). Frame 11 must violate one of them OR fail somewhere else (e.g. a request_fd state precondition, cluster-internal coupling).
## Finding 3 — Hantro driver-level validation runs *after* `S_EXT_CTRLS`
`hantro_h264.c::hantro_h264_dec_prepare_run()` is called from the V4L2 m2m worker, post-`MEDIA_REQUEST_IOC_QUEUE`. Its EINVAL paths (lines 449/454/459/464) are NULL-checks for missing controls — they fire if the request didn't carry one of the four required H.264 controls. NOT the iter1+2+3 carryover path.
Conclusion: the failure is in the V4L2 control-handler `validate_ctrls` for one of our four compound controls. Per-control validators (Finding 2) don't seem to fit since fields 110 use the same constants, but `error_idx` was hidden by the obfuscation rule.
## Finding 4 — `VIDIOC_TRY_EXT_CTRLS` is the diagnostic escape hatch
Per the kernel comment, `TRY_EXT_CTRLS` ALWAYS reports the specific failing control's index in `error_idx`, even when set=true would have hidden it. Path forward:
1. Amend our driver's Y2 instrumentation: on `VIDIOC_S_EXT_CTRLS` returning -EINVAL with `error_idx == num_controls`, **retry the same controls via `VIDIOC_TRY_EXT_CTRLS`** to extract the real failing index.
2. Log the precise failing control + its raw fields.
3. From that, identify the specific field-value combination on frame 11 that the per-control validator rejects.
This is a one-liner in v4l2.c: add another `ioctl(video_fd, VIDIOC_TRY_EXT_CTRLS, &controls)` call inside the existing EINVAL diagnostic block, then log `controls.error_idx` from the TRY result.
## Implication for Phase 4
Phase 4's plan needs amendment: before any driver-side fix can be authored, we need the precise failing control + field. Steps:
1. Y2 v3: add TRY_EXT_CTRLS retry on S_EXT_CTRLS EINVAL.
2. Rebuild driver on ohm (~30 sec via meson+ninja).
3. Re-run autonomous Phase 7 (`/tmp/run_phase7_v2.sh`).
4. Read the new diagnostic — now naming the specific control.
5. Compare frame-11 field values vs frames 110 to localize the bad field.
6. Fix in driver. Most likely surfaces are: SPS field that wasn't properly latched after IDR, PPS field that flipped at frame 5, DECODE_PARAMS DPB entry with malformed `frame_num`/`pic_num`/`flags`.
Phase 3 (baseline anchor) is satisfied by iter3's Phase 7 run — same rig, same EINVAL, deterministic.
## Stop point
Phase 2 closed. Next: Phase 4 step 1 (Y2 v3 with TRY_EXT_CTRLS retry diagnostic). Phase 3 anchored from iter3. Per the "Stop only if user is needed" rule, no user input required to proceed — but this is a cycle of read→speculate→test that may take several iterations.