# QA checks

Quality checks for generated media — our own implementations (local ffmpeg, with a vision/STT call for some).

> Generations are charged in credits (see [Credits & plans](/guide/billing)). Every generation model also accepts `mock: true` for a free placeholder result.

### Full QA Pipeline `qa_full`

- **Provider:** local (ffmpeg + vision: google/gemini-2.5-flash, optional speech-to-text)
- **Endpoint:** none (in-process pipeline, `execQAFull` → `QAPipeline` in `qa_pipeline.go`)
- **MCP action:** `qa` tool, `action: "full"` → routes to `qa_full` (`QA_MODELS.full`)
- **Cost:** 1 credit per run (one vision call + optional transcription). Upper bound — drops to free if no vision client is configured.
- **Timeout:** `5m`

Runs all QA checks on a finished video in one pass. Probes the video, extracts 5 frames (10/30/50/70/90%) once, extracts audio once, then runs ffmpeg checks (overexposure, motion artifacts, audio structural/loudness/tail) plus a single multi-frame Gemini call (person consistency, visual quality, and — when a plan is given — scene-matches-plan). If STT+vision are configured and `plan.vo_text` is present, also runs an in-pipeline transcript word-overlap check. Returns per-check `PASS/FAIL/SKIP/ERROR` and an overall verdict (`FAIL` if any check fails).

**Parameters**

| Name | Type | Req | Default | Notes |
|---|---|---|---|---|
| `video` | string | yes | — | Video file path. Handler reads `inputs["video"]`. |
| `plan` | object | no | — | Shooting plan: `SET, LIGHT, SHOT_TYPE, ACTORS_ACTION, vo_text`. Presence of `SET` enables the scene-matches-plan sub-check; `vo_text` enables the transcript sub-check. |
| `expected_characters` | integer | no | 1 | **Declared in YAML but NOT read by the handler** — person-consistency always runs across all frames regardless. Inert. |

**Mismatch notes:** vision model is hard-coded to `google/gemini-2.5-flash` (no override field). The transcript sub-check uses `simpleTranscriptCompare` (word overlap, no second LLM call), unlike standalone `check_transcript`. Audio checks emit `SKIP` if the video has no audio track.

---

### Same Person Check `check_same_person`

- **Provider:** local (vision: google/gemini-2.5-flash)
- **Endpoint:** none (`execCheckSamePerson` → `CheckSamePerson` in `check_vision.go`)
- **MCP action:** `qa` tool, `action: "person"` → routes to `check_same_person`. The MCP layer maps `image1`→`ref` and `image2`→`test`.
- **Cost:** 1 credit (one vision call)
- **Timeout:** `30s`

Compares facial features between a reference image and a test image (or video — mid-frame auto-extracted via `extractMidFrame`). Sends both to Gemini with `VisionCheckMulti`. Returns `same_person` (bool), `confidence` (0–100), `differences` (list), and `verdict`. PASS requires `same_person == true` AND `confidence >= min_confidence`.

**Parameters**

| Name | Type | Req | Default | Notes |
|---|---|---|---|---|
| `ref` | string | yes | — | Reference image URL (persona_ref). Passed to the API as-is (no base64 conversion). |
| `test` | string | yes | — | Test image path/URL, or a video (mid-frame extracted, ext in `.mp4/.mov/.avi/.mkv/.webm`). |
| `min_confidence` | integer | no | 85 | Min confidence (0–100) for PASS. Handler re-clamps to 85 if `<= 0`. |
| `model` | string | no | `google/gemini-2.5-flash` | Vision model override. |

**Mismatch notes:** YAML/handler fields match exactly. Errors if the vision client is not configured on the server, or if `ref`/`test` is empty.

---

### Scene Matches Plan Check `check_scene_matches_plan`

- **Provider:** local (vision: google/gemini-2.5-flash)
- **Endpoint:** none (`execCheckSceneMatchesPlan` → `CheckSceneMatchesPlan` in `check_vision.go`)
- **MCP action:** `qa` tool, `action: "scene"` → routes to `check_scene_matches_plan`. MCP maps `video`→`in` and passes `plan` through. Both `video` and `plan` are required at the MCP layer.
- **Cost:** 1 credit
- **Timeout:** `30s`

Checks each shooting-plan field (`SET / LIGHT / SHOT_TYPE / ACTORS_ACTION`) against the image. For video input, extracts the mid-frame. Sends the plan as JSON + the image to Gemini (`VisionCheck`). Returns per-field `{verdict, reason}` under `fields`, plus overall `verdict` (`FAIL` if any field fails; the model is instructed to only judge fields present in the plan).

**Parameters**

| Name | Type | Req | Default | Notes |
|---|---|---|---|---|
| `in` | string | yes | — | Image or video path to check. Handler reads `inputs["in"]`. |
| `plan` | object | yes | — | Plan object with `SET, LIGHT, SHOT_TYPE, ACTORS_ACTION`. Handler errors if nil. |
| `model` | string | no | `google/gemini-2.5-flash` | Vision model override. |

**Mismatch notes:** YAML/handler fields match. Note the field name is `in` (not `video`/`image`); the MCP `scene` action takes `video` and remaps it.

---

### Image Description Check `check_image_description`

- **Provider:** local (vision: google/gemini-2.5-flash)
- **Endpoint:** none (`execCheckImageDescription` → `CheckImageDescription` in `check_vision.go`)
- **MCP action:** `qa` tool, `action: "image"` → routes to `check_image_description`. MCP maps `image_url`→`in` and passes `description` through.
- **Cost:** 1 credit
- **Timeout:** `30s`

Sends an image + expected description to Gemini; the model judges whether the image matches. Local files are read and base64-encoded as a `data:image/png` URI; `http`-prefixed inputs are passed as-is. Uses structured output (`VisionCheckStructured` with a `verdict/match/reason/details` schema) and falls back to unstructured `VisionCheck` on error. Returns `verdict (PASS/FAIL)`, `match` (bool), `reason`, and `details` (found/missing elements).

**Parameters**

| Name | Type | Req | Default | Notes |
|---|---|---|---|---|
| `in` | string | yes | — | Image path (local) or URL. |
| `description` | string | yes | — | Expected description text. |
| `model` | string | no | `google/gemini-2.5-flash` | Vision model override. |

**Mismatch notes:** YAML/handler fields match. Caveat: non-http paths are always encoded as `image/png` regardless of real extension — a `.jpg` is still sent with a PNG MIME label (works with Gemini, but technically mislabeled).

---

### Voice Consistency Check `check_voice_consistency`

- **Provider:** local (vision/audio: google/gemini-2.5-flash)
- **Endpoint:** none (`execCheckVoiceConsistency` → `CheckVoiceConsistency` in `check_audio.go`)
- **MCP action:** `qa` tool, `action: "voice"` → routes to `check_voice_consistency`. MCP maps `audio`→`in`.
- **Cost:** 1 credit
- **Timeout:** `30s`

Extracts N short (~3s) audio segments evenly across the file with ffmpeg, base64-encodes them as `data:audio/mpeg` URIs, and sends all segments to Gemini in one structured call to judge whether the same speaker (pitch, timbre, accent, style, gender, age impression) is present throughout. Returns `verdict (PASS/FAIL)`, `same_speaker` (bool), `issues` (list).

**Parameters**

| Name | Type | Req | Default | Notes |
|---|---|---|---|---|
| `in` | string | yes | — | Audio file (mp3/wav/aac). |
| `segments` | integer | no | 3 | Number of segments to compare. Handler overrides only when `> 0`; internally re-clamps `<= 1` to 3. |
| `model` | string | no | `google/gemini-2.5-flash` | Model override. |

**Mismatch notes:** Undocumented short-circuit — audio under 2.0s returns PASS immediately with `note: "audio too short to compare segments"` (no API call). Needs ≥2 extractable segments or it errors.

---

### Transcript Check `check_transcript`

- **Provider:** local (vision: google/gemini-2.5-flash + speech-to-text)
- **Endpoint:** none (`execCheckTranscript` → `CheckTranscriptMatchesPlan` in `check_vision.go`)
- **MCP action:** `qa` tool, `action: "transcript"` → accepts a video OR a pure audio URL plus an optional ISO-639-1 `language` hint and an optional `expected_text`. Omit `expected_text` for transcription-only mode (no compare, verdict `PASS`).
- **Cost:** 1 credit (transcription + comparison)
- **Timeout:** `2m`

Pipeline: extract audio (ffmpeg → mp3; skipped when the input is already audio) → transcribe via our STT step → compare to expected text via an LLM call. If `expected_text` is omitted, it runs in transcription-only mode: no comparison, verdict `PASS`. Returns `actual_transcript`, `duration_sec`, `segments` (`[{start_s, end_s, text}]`), `segment_count`, and — when comparing — `similarity_pct`, `missing_words`, `extra_words`, `verdict`. PASS at `similarity >= 80%`. On any LLM/parse error it falls back to `simpleTranscriptCompare` (word overlap).

**Parameters**

| Name | Type | Req | Default | Notes |
|---|---|---|---|---|
| `video` | string | yes | — | Video **or pure audio** file path/URL; audio extracted automatically when a video is given. |
| `expected_text` | string | no | — | Expected voiceover text. **Optional** — omit → transcription-only mode (no compare, verdict `PASS`). |
| `language` | string | no | — | ISO-639-1 code passed to the STT step (improves accuracy). |
| `vision_model` | string | no | `google/gemini-2.5-flash` | LLM for semantic comparison. |

**Mismatch notes:** The standalone check does a real LLM comparison (`vision.VisionCheck`), whereas the same check inside `qa_full` uses word-overlap only — the two paths differ.

---

### Video Description `describe_video`

- **Provider:** local (multimodal analysis model)
- **Endpoint:** none (`execDescribeVideo` in `describe_video.go`)
- **MCP action:** `qa` tool, `action: "describe"` → routes to `describe_video`. MCP maps `video`→`in` and passes `fps`/`focus` through.
- **Cost:** ≈1 credit per 25 s of video at `fps: 1`, scales with `fps`; minimum 1 credit.
- **Timeout:** `5m`

Watches the whole video and returns a timecoded, scene-by-scene breakdown. The segments partition the video at scene changes (cuts, location changes, clear changes of action); each segment reports `start_s`/`end_s`, `scene` (what visually happens), `speech` (transcribed words, `""` if none), `sounds` (notable SFX/ambient), and `music` (`""` if none). Async: the call returns a `job_id` — poll `get_status` (~every 15 s; a typical run takes 1–3 minutes), then read `segments` and `segment_count` from the result.

**Parameters**

| Name | Type | Req | Default | Notes |
|---|---|---|---|---|
| `video` | string | yes | — | Video URL. Sent to the model as `in`. Max duration 1 hour, and duration × fps must not exceed 3600 (fps 1 → up to 60 min, fps 5 → up to 12 min); larger inputs are rejected. |
| `fps` | integer | no | 1 | Frames sampled per second (1–5). Raise for fast-cut footage; cost scales with `fps`. |
| `focus` | string | no | — | Extra instruction (≤2000 chars), e.g. "focus on product shots" or an expected-shot list to check against. |

**Notes:** the analysis model is pinned server-side (no caller override). Segment text fields are length-capped and the segment list is bounded, so very long or unusual videos return a trimmed but well-formed result.

### `check_audio_loudness`

- **Provider:** local (ffmpeg `loudnorm`)
- **Display name:** Audio Loudness Check
- **Category / mode:** qa_check / sync
- **Cost:** free (`cost_per_unit: 0`)
- **Timeout:** 30s
- **MCP action:** none (internal-only; REST `POST /v1/jobs/check_audio_loudness` or via `qa_full`)
- **Handler:** `execCheckAudioLoudness` → `CheckAudioLoudness` (`check_audio.go`)

Measures integrated loudness and true peak with a single ffmpeg `loudnorm=print_format=json` analysis pass, parses the JSON from ffmpeg stderr (`input_i`, `input_tp`, `input_lra`).

| Param | Type | Req | Default | Notes |
|---|---|---|---|---|
| `in` | string | yes | — | Audio file path/URL to check (materialized & SSRF-checked by the executor). |
| `target_lufs` | number | no | -14 | Target integrated LUFS. **Cannot be set to literal 0** — handler treats 0 as "unset" and substitutes -14. |
| `tolerance` | number | no | 3 | Allowed deviation in LU. 0 → coerced to 3. |
| `max_true_peak` | number | no | -1 | Max true peak in dBTP. 0 → coerced to -1. |

**Verdict:** PASS if `|integrated - target| <= tolerance` AND `true_peak <= max_true_peak`, else FAIL.
**Metrics:** `lufs_integrated`, `true_peak_db`, `lra`, plus echoed `target_lufs`/`tolerance`/`max_true_peak`.

> Note: handler coerces any 0-valued numeric param to its default (see code-vs-YAML mismatches). If the loudnorm JSON block is missing from stderr the call errors instead of returning a verdict.

---

### `check_audio_structural`

- **Provider:** local (ffprobe)
- **Display name:** Audio Structural Check
- **Category / mode:** qa_check / sync
- **Cost:** free (`cost_per_unit: 0`)
- **Timeout:** 30s
- **MCP action:** none (internal-only; REST `POST /v1/jobs/check_audio_structural` or via `qa_full`)
- **Handler:** `execCheckAudioStructural` → `CheckAudioStructural` (`check_audio.go`), via `Probe` (ffprobe)

Probes the file, finds the first audio stream, and checks duration and codec.

| Param | Type | Req | Default | Notes |
|---|---|---|---|---|
| `in` | string | yes | — | Audio file path/URL to check. |

**Verdict:** FAIL if no streams / no audio stream, OR duration &lt; 1.0s, OR codec not in `{mp3, aac, pcm_s16le, flac, vorbis, opus}`; else PASS.
**Metrics:** `duration_sec`, `sample_rate`, `channels`, `codec`, `bitrate_kbps`. Failing reasons listed in `issues`.

> Note: YAML/prompt_guide name the metrics `duration` and `bitrate`; handler emits `duration_sec` and `bitrate_kbps` (= ffprobe `bit_rate` / 1000). Sample-rate and channel values are reported but never cause a FAIL.

---

### `check_audio_tail`

- **Provider:** local (ffmpeg `volumedetect`)
- **Display name:** Audio Tail Check
- **Category / mode:** qa_check / sync
- **Cost:** free (`cost_per_unit: 0`)
- **Timeout:** 30s
- **MCP action:** none (internal-only; REST `POST /v1/jobs/check_audio_tail` or via `qa_full`)
- **Handler:** `execCheckAudioTail` → `CheckAudioTail` (`check_audio.go`)

Detects an abrupt cut-off at the end of audio (the "v1 VO bug"). Splits the trailing `tail_sec` window in two and compares per-half RMS measured with ffmpeg `volumedetect` (`mean_volume`).

| Param | Type | Req | Default | Notes |
|---|---|---|---|---|
| `in` | string | yes | — | Audio file path/URL to check. |
| `tail_sec` | number | no | 1.0 | Seconds of tail to analyze. `<= 0` → coerced to 1.0; clamped down to total duration if shorter. |
| `silence_db` | number | no | -40 | RMS dB threshold below which the tail counts as silent. **Cannot be set to literal 0** — 0 → coerced to -40. |

**Verdict:** PASS if `rms_second_half <= silence_db` (silent) OR `rms_second_half < rms_first_half * 0.7` (fading); else FAIL ("tail not fading").
**Metrics:** `tail_sec`, `silence_db`, `rms_first_half`, `rms_second_half`, `is_silent`, `is_fading`, `total_duration`.

> Note: YAML prose says PASS when the second half is merely "quieter"; the handler is stricter and requires a ≥30% RMS drop (`* 0.7`). An unmeasurable half returns -100 dB (treated as silent → PASS).

---

### `check_motion_artifacts`

- **Provider:** local (ffmpeg `signalstats` YDIF)
- **Display name:** Motion Artifacts Check
- **Category / mode:** qa_check / sync
- **Cost:** free (`cost_per_unit: 0`)
- **Timeout:** 2m
- **MCP action:** none (internal-only; REST `POST /v1/jobs/check_motion_artifacts` or via `qa_full`)
- **Handler:** `execCheckMotionArtifacts` → `CheckMotionArtifacts` (`check_video.go`)

Scans for frame-to-frame luminance-difference spikes that indicate glitches or unintended jump cuts. Parses YDIF from an ffmpeg `signalstats=stat=tout` pass, computes mean, and flags frames where `diff > mean * spike_factor`.

| Param | Type | Req | Default | Notes |
|---|---|---|---|---|
| `in` | string | yes | — | Video file path/URL to check. |
| `spike_factor` | number | no | 4 | A frame whose diff exceeds `mean * spike_factor` is a spike. `<= 0` → coerced to 4. Lower = stricter. |

**Verdict:** PASS if `spikes_count <= 1` (a single spike can be a legitimate transition); FAIL if `> 1`.
**Metrics:** `frames_checked`, `mean_diff`, `max_diff`, `stddev`, `spike_factor`, `spikes_count`, `spike_frames`.

> Note: the handler runs an extra `mestimate`+`metadata=print` pass whose output is discarded — only the `signalstats` YDIF pass is used. If no YDIF lines parse, it returns PASS with a `could not extract frame differences` warning. `spike_frames` are indices into the parsed YDIF list, not absolute video frame numbers.

---

### `overexposure_check`

- **Provider:** local (ffmpeg `signalstats` BRNG)
- **Display name:** Overexposure Check
- **Category / mode:** qa_check / sync
- **Cost:** free (`cost_per_unit: 0`)
- **Timeout:** 2m
- **MCP action:** none (internal-only; REST `POST /v1/jobs/overexposure_check` or via `qa_full`)
- **Handler:** `execOverexposureCheck` → `CheckOverexposure` (`overexposure.go`)

Detects blown-out highlights in an image or video. Samples frames at `sample_fps` and reads `signalstats` BRNG (percent of pixels outside broadcast range) as the clipped-pixel proxy, taking the worst sampled frame.

| Param | Type | Req | Default | Notes |
|---|---|---|---|---|
| `in` | string | yes | — | Image or video path/URL to check. |
| `max_clipped_pct` | number | no | 3.0 | Max % of clipped pixels before FAIL. `<= 0` → coerced to 3.0. |
| `sample_fps` | number | no | 2 | Frames per second to sample (video). `<= 0` → coerced to 2. Read as an int. |

**Verdict:** PASS if `worst_frame_pct <= max_clipped_pct`; else FAIL (suggested fix: apply `highlight_rolloff`, then re-check).
**Metrics:** `worst_frame_pct`, `max_clipped_pct`, `frames_checked`, `max_brng`.

> Note: the YAML describes "clipped pixels at max luminance", but the handler measures BRNG (broadcast-range %), not a true white-clip count — `worst_frame_pct` is a proxy. A discarded `histogram` pass runs first. If `signalstats` returns no BRNG frames, the handler returns PASS with a `signalstats not available` warning (`frames_checked: 0`), which can mask genuine overexposure.