QA checks

Quality checks for generated media — our own implementations (local ffmpeg, with a vision/STT call for some).

Generations are charged in credits (see Credits & plans). Every generation model also accepts mock: true for a free placeholder result.

Full QA Pipeline `qa_full`

Provider: local (ffmpeg + vision: google/gemini-2.5-flash, optional speech-to-text)
Endpoint: none (in-process pipeline, execQAFull → QAPipeline in qa_pipeline.go)
MCP action: qa tool, action: "full" → routes to qa_full (QA_MODELS.full)
Cost: 1 credit per run (one vision call + optional transcription). Upper bound — drops to free if no vision client is configured.
Timeout: 5m

Runs all QA checks on a finished video in one pass. Probes the video, extracts 5 frames (10/30/50/70/90%) once, extracts audio once, then runs ffmpeg checks (overexposure, motion artifacts, audio structural/loudness/tail) plus a single multi-frame Gemini call (person consistency, visual quality, and — when a plan is given — scene-matches-plan). If STT+vision are configured and plan.vo_text is present, also runs an in-pipeline transcript word-overlap check. Returns per-check PASS/FAIL/SKIP/ERROR and an overall verdict (FAIL if any check fails).

Parameters

Name	Type	Req	Default	Notes
`video`	string	yes	—	Video file path. Handler reads `inputs["video"]`.
`plan`	object	no	—	Shooting plan: `SET, LIGHT, SHOT_TYPE, ACTORS_ACTION, vo_text`. Presence of `SET` enables the scene-matches-plan sub-check; `vo_text` enables the transcript sub-check.
`expected_characters`	integer	no	1	Declared in YAML but NOT read by the handler — person-consistency always runs across all frames regardless. Inert.

Mismatch notes: vision model is hard-coded to google/gemini-2.5-flash (no override field). The transcript sub-check uses simpleTranscriptCompare (word overlap, no second LLM call), unlike standalone check_transcript. Audio checks emit SKIP if the video has no audio track.

Same Person Check `check_same_person`

Provider: local (vision: google/gemini-2.5-flash)
Endpoint: none (execCheckSamePerson → CheckSamePerson in check_vision.go)
MCP action: qa tool, action: "person" → routes to check_same_person. The MCP layer maps image1→ref and image2→test.
Cost: 1 credit (one vision call)
Timeout: 30s

Compares facial features between a reference image and a test image (or video — mid-frame auto-extracted via extractMidFrame). Sends both to Gemini with VisionCheckMulti. Returns same_person (bool), confidence (0–100), differences (list), and verdict. PASS requires same_person == true AND confidence >= min_confidence.

Parameters

Name	Type	Req	Default	Notes
`ref`	string	yes	—	Reference image URL (persona_ref). Passed to the API as-is (no base64 conversion).
`test`	string	yes	—	Test image path/URL, or a video (mid-frame extracted, ext in `.mp4/.mov/.avi/.mkv/.webm`).
`min_confidence`	integer	no	85	Min confidence (0–100) for PASS. Handler re-clamps to 85 if `<= 0`.
`model`	string	no	`google/gemini-2.5-flash`	Vision model override.

Mismatch notes: YAML/handler fields match exactly. Errors if the vision client is not configured on the server, or if ref/test is empty.

Scene Matches Plan Check `check_scene_matches_plan`

Provider: local (vision: google/gemini-2.5-flash)
Endpoint: none (execCheckSceneMatchesPlan → CheckSceneMatchesPlan in check_vision.go)
MCP action: qa tool, action: "scene" → routes to check_scene_matches_plan. MCP maps video→in and passes plan through. Both video and plan are required at the MCP layer.
Cost: 1 credit
Timeout: 30s

Checks each shooting-plan field (SET / LIGHT / SHOT_TYPE / ACTORS_ACTION) against the image. For video input, extracts the mid-frame. Sends the plan as JSON + the image to Gemini (VisionCheck). Returns per-field {verdict, reason} under fields, plus overall verdict (FAIL if any field fails; the model is instructed to only judge fields present in the plan).

Parameters

Name	Type	Req	Default	Notes
`in`	string	yes	—	Image or video path to check. Handler reads `inputs["in"]`.
`plan`	object	yes	—	Plan object with `SET, LIGHT, SHOT_TYPE, ACTORS_ACTION`. Handler errors if nil.
`model`	string	no	`google/gemini-2.5-flash`	Vision model override.

Mismatch notes: YAML/handler fields match. Note the field name is in (not video/image); the MCP scene action takes video and remaps it.

Image Description Check `check_image_description`

Provider: local (vision: google/gemini-2.5-flash)
Endpoint: none (execCheckImageDescription → CheckImageDescription in check_vision.go)
MCP action: qa tool, action: "image" → routes to check_image_description. MCP maps image_url→in and passes description through.
Cost: 1 credit
Timeout: 30s

Sends an image + expected description to Gemini; the model judges whether the image matches. Local files are read and base64-encoded as a data:image/png URI; http-prefixed inputs are passed as-is. Uses structured output (VisionCheckStructured with a verdict/match/reason/details schema) and falls back to unstructured VisionCheck on error. Returns verdict (PASS/FAIL), match (bool), reason, and details (found/missing elements).

Parameters

Name	Type	Req	Default	Notes
`in`	string	yes	—	Image path (local) or URL.
`description`	string	yes	—	Expected description text.
`model`	string	no	`google/gemini-2.5-flash`	Vision model override.

Mismatch notes: YAML/handler fields match. Caveat: non-http paths are always encoded as image/png regardless of real extension — a .jpg is still sent with a PNG MIME label (works with Gemini, but technically mislabeled).

Voice Consistency Check `check_voice_consistency`

Provider: local (vision/audio: google/gemini-2.5-flash)
Endpoint: none (execCheckVoiceConsistency → CheckVoiceConsistency in check_audio.go)
MCP action: qa tool, action: "voice" → routes to check_voice_consistency. MCP maps audio→in.
Cost: 1 credit
Timeout: 30s

Extracts N short (~3s) audio segments evenly across the file with ffmpeg, base64-encodes them as data:audio/mpeg URIs, and sends all segments to Gemini in one structured call to judge whether the same speaker (pitch, timbre, accent, style, gender, age impression) is present throughout. Returns verdict (PASS/FAIL), same_speaker (bool), issues (list).

Parameters

Name	Type	Req	Default	Notes
`in`	string	yes	—	Audio file (mp3/wav/aac).
`segments`	integer	no	3	Number of segments to compare. Handler overrides only when `> 0`; internally re-clamps `<= 1` to 3.
`model`	string	no	`google/gemini-2.5-flash`	Model override.

Mismatch notes: Undocumented short-circuit — audio under 2.0s returns PASS immediately with note: "audio too short to compare segments" (no API call). Needs ≥2 extractable segments or it errors.

Transcript Check `check_transcript`

Provider: local (vision: google/gemini-2.5-flash + speech-to-text)
Endpoint: none (execCheckTranscript → CheckTranscriptMatchesPlan in check_vision.go)
MCP action: qa tool, action: "transcript" → accepts a video OR a pure audio URL plus an optional ISO-639-1 language hint and an optional expected_text. Omit expected_text for transcription-only mode (no compare, verdict PASS).
Cost: 1 credit (transcription + comparison)
Timeout: 2m

Pipeline: extract audio (ffmpeg → mp3; skipped when the input is already audio) → transcribe via our STT step → compare to expected text via an LLM call. If expected_text is omitted, it runs in transcription-only mode: no comparison, verdict PASS. Returns actual_transcript, duration_sec, segments ([{start_s, end_s, text}]), segment_count, and — when comparing — similarity_pct, missing_words, extra_words, verdict. PASS at similarity >= 80%. On any LLM/parse error it falls back to simpleTranscriptCompare (word overlap).

Parameters

Name	Type	Req	Default	Notes
`video`	string	yes	—	Video or pure audio file path/URL; audio extracted automatically when a video is given.
`expected_text`	string	no	—	Expected voiceover text. Optional — omit → transcription-only mode (no compare, verdict `PASS`).
`language`	string	no	—	ISO-639-1 code passed to the STT step (improves accuracy).
`vision_model`	string	no	`google/gemini-2.5-flash`	LLM for semantic comparison.

Mismatch notes: The standalone check does a real LLM comparison (vision.VisionCheck), whereas the same check inside qa_full uses word-overlap only — the two paths differ.

Video Description `describe_video`

Provider: local (multimodal analysis model)
Endpoint: none (execDescribeVideo in describe_video.go)
MCP action: qa tool, action: "describe" → routes to describe_video. MCP maps video→in and passes fps/focus through.
Cost: ≈1 credit per 25 s of video at fps: 1, scales with fps; minimum 1 credit.
Timeout: 5m

Watches the whole video and returns a timecoded, scene-by-scene breakdown. The segments partition the video at scene changes (cuts, location changes, clear changes of action); each segment reports start_s/end_s, scene (what visually happens), speech (transcribed words, "" if none), sounds (notable SFX/ambient), and music ("" if none). Async: the call returns a job_id — poll get_status (~every 15 s; a typical run takes 1–3 minutes), then read segments and segment_count from the result.

Parameters

Name	Type	Req	Default	Notes
`video`	string	yes	—	Video URL. Sent to the model as `in`. Max duration 1 hour, and duration × fps must not exceed 3600 (fps 1 → up to 60 min, fps 5 → up to 12 min); larger inputs are rejected.
`fps`	integer	no	1	Frames sampled per second (1–5). Raise for fast-cut footage; cost scales with `fps`.
`focus`	string	no	—	Extra instruction (≤2000 chars), e.g. "focus on product shots" or an expected-shot list to check against.

Notes: the analysis model is pinned server-side (no caller override). Segment text fields are length-capped and the segment list is bounded, so very long or unusual videos return a trimmed but well-formed result.

`check_audio_loudness`

Provider: local (ffmpeg loudnorm)
Display name: Audio Loudness Check
Category / mode: qa_check / sync
Cost: free (cost_per_unit: 0)
Timeout: 30s
MCP action: none (internal-only; REST POST /v1/jobs/check_audio_loudness or via qa_full)
Handler: execCheckAudioLoudness → CheckAudioLoudness (check_audio.go)

Measures integrated loudness and true peak with a single ffmpeg loudnorm=print_format=json analysis pass, parses the JSON from ffmpeg stderr (input_i, input_tp, input_lra).

Param	Type	Req	Default	Notes
`in`	string	yes	—	Audio file path/URL to check (materialized & SSRF-checked by the executor).
`target_lufs`	number	no	-14	Target integrated LUFS. Cannot be set to literal 0 — handler treats 0 as "unset" and substitutes -14.
`tolerance`	number	no	3	Allowed deviation in LU. 0 → coerced to 3.
`max_true_peak`	number	no	-1	Max true peak in dBTP. 0 → coerced to -1.

Verdict: PASS if |integrated - target| <= tolerance AND true_peak <= max_true_peak, else FAIL. Metrics: lufs_integrated, true_peak_db, lra, plus echoed target_lufs/tolerance/max_true_peak.

Note: handler coerces any 0-valued numeric param to its default (see code-vs-YAML mismatches). If the loudnorm JSON block is missing from stderr the call errors instead of returning a verdict.

`check_audio_structural`

Provider: local (ffprobe)
Display name: Audio Structural Check
Category / mode: qa_check / sync
Cost: free (cost_per_unit: 0)
Timeout: 30s
MCP action: none (internal-only; REST POST /v1/jobs/check_audio_structural or via qa_full)
Handler: execCheckAudioStructural → CheckAudioStructural (check_audio.go), via Probe (ffprobe)

Probes the file, finds the first audio stream, and checks duration and codec.

Param	Type	Req	Default	Notes
`in`	string	yes	—	Audio file path/URL to check.

Verdict: FAIL if no streams / no audio stream, OR duration < 1.0s, OR codec not in {mp3, aac, pcm_s16le, flac, vorbis, opus}; else PASS. Metrics: duration_sec, sample_rate, channels, codec, bitrate_kbps. Failing reasons listed in issues.

Note: YAML/prompt_guide name the metrics duration and bitrate; handler emits duration_sec and bitrate_kbps (= ffprobe bit_rate / 1000). Sample-rate and channel values are reported but never cause a FAIL.

`check_audio_tail`

Provider: local (ffmpeg volumedetect)
Display name: Audio Tail Check
Category / mode: qa_check / sync
Cost: free (cost_per_unit: 0)
Timeout: 30s
MCP action: none (internal-only; REST POST /v1/jobs/check_audio_tail or via qa_full)
Handler: execCheckAudioTail → CheckAudioTail (check_audio.go)

Detects an abrupt cut-off at the end of audio (the "v1 VO bug"). Splits the trailing tail_sec window in two and compares per-half RMS measured with ffmpeg volumedetect (mean_volume).

Param	Type	Req	Default	Notes
`in`	string	yes	—	Audio file path/URL to check.
`tail_sec`	number	no	1.0	Seconds of tail to analyze. `<= 0` → coerced to 1.0; clamped down to total duration if shorter.
`silence_db`	number	no	-40	RMS dB threshold below which the tail counts as silent. Cannot be set to literal 0 — 0 → coerced to -40.

Verdict: PASS if rms_second_half <= silence_db (silent) OR rms_second_half < rms_first_half * 0.7 (fading); else FAIL ("tail not fading"). Metrics: tail_sec, silence_db, rms_first_half, rms_second_half, is_silent, is_fading, total_duration.

Note: YAML prose says PASS when the second half is merely "quieter"; the handler is stricter and requires a ≥30% RMS drop (* 0.7). An unmeasurable half returns -100 dB (treated as silent → PASS).

`check_motion_artifacts`

Provider: local (ffmpeg signalstats YDIF)
Display name: Motion Artifacts Check
Category / mode: qa_check / sync
Cost: free (cost_per_unit: 0)
Timeout: 2m
MCP action: none (internal-only; REST POST /v1/jobs/check_motion_artifacts or via qa_full)
Handler: execCheckMotionArtifacts → CheckMotionArtifacts (check_video.go)

Scans for frame-to-frame luminance-difference spikes that indicate glitches or unintended jump cuts. Parses YDIF from an ffmpeg signalstats=stat=tout pass, computes mean, and flags frames where diff > mean * spike_factor.

Param	Type	Req	Default	Notes
`in`	string	yes	—	Video file path/URL to check.
`spike_factor`	number	no	4	A frame whose diff exceeds `mean * spike_factor` is a spike. `<= 0` → coerced to 4. Lower = stricter.

Verdict: PASS if spikes_count <= 1 (a single spike can be a legitimate transition); FAIL if > 1. Metrics: frames_checked, mean_diff, max_diff, stddev, spike_factor, spikes_count, spike_frames.

Note: the handler runs an extra mestimate+metadata=print pass whose output is discarded — only the signalstats YDIF pass is used. If no YDIF lines parse, it returns PASS with a could not extract frame differences warning. spike_frames are indices into the parsed YDIF list, not absolute video frame numbers.

`overexposure_check`

Provider: local (ffmpeg signalstats BRNG)
Display name: Overexposure Check
Category / mode: qa_check / sync
Cost: free (cost_per_unit: 0)
Timeout: 2m
MCP action: none (internal-only; REST POST /v1/jobs/overexposure_check or via qa_full)
Handler: execOverexposureCheck → CheckOverexposure (overexposure.go)

Detects blown-out highlights in an image or video. Samples frames at sample_fps and reads signalstats BRNG (percent of pixels outside broadcast range) as the clipped-pixel proxy, taking the worst sampled frame.

Param	Type	Req	Default	Notes
`in`	string	yes	—	Image or video path/URL to check.
`max_clipped_pct`	number	no	3.0	Max % of clipped pixels before FAIL. `<= 0` → coerced to 3.0.
`sample_fps`	number	no	2	Frames per second to sample (video). `<= 0` → coerced to 2. Read as an int.

Verdict: PASS if worst_frame_pct <= max_clipped_pct; else FAIL (suggested fix: apply highlight_rolloff, then re-check). Metrics: worst_frame_pct, max_clipped_pct, frames_checked, max_brng.

Note: the YAML describes "clipped pixels at max luminance", but the handler measures BRNG (broadcast-range %), not a true white-clip count — worst_frame_pct is a proxy. A discarded histogram pass runs first. If signalstats returns no BRNG frames, the handler returns PASS with a signalstats not available warning (frames_checked: 0), which can mask genuine overexposure.

QA checks ​

Full QA Pipeline qa_full ​

Same Person Check check_same_person ​

Scene Matches Plan Check check_scene_matches_plan ​

Image Description Check check_image_description ​

Voice Consistency Check check_voice_consistency ​

Transcript Check check_transcript ​

Video Description describe_video ​

check_audio_loudness ​

check_audio_structural ​

check_audio_tail ​

check_motion_artifacts ​

overexposure_check ​

QA checks

Full QA Pipeline `qa_full`

Same Person Check `check_same_person`

Scene Matches Plan Check `check_scene_matches_plan`

Image Description Check `check_image_description`

Voice Consistency Check `check_voice_consistency`

Transcript Check `check_transcript`

Video Description `describe_video`

`check_audio_loudness`

`check_audio_structural`

`check_audio_tail`

`check_motion_artifacts`

`overexposure_check`