Skip to content

QA checks

Quality checks for generated media — our own implementations (local ffmpeg, with a vision/STT call for some).

Generations are charged in credits (see Credits & plans). Every generation model also accepts mock: true for a free placeholder result.

Full QA Pipeline qa_full

  • Provider: local (ffmpeg + vision: google/gemini-2.5-flash, optional speech-to-text)
  • Endpoint: none (in-process pipeline, execQAFullQAPipeline in qa_pipeline.go)
  • MCP action: qa tool, action: "full" → routes to qa_full (QA_MODELS.full)
  • Cost: 1 credit per run (one vision call + optional transcription). Upper bound — drops to free if no vision client is configured.
  • Timeout: 5m

Runs all QA checks on a finished video in one pass. Probes the video, extracts 5 frames (10/30/50/70/90%) once, extracts audio once, then runs ffmpeg checks (overexposure, motion artifacts, audio structural/loudness/tail) plus a single multi-frame Gemini call (person consistency, visual quality, and — when a plan is given — scene-matches-plan). If STT+vision are configured and plan.vo_text is present, also runs an in-pipeline transcript word-overlap check. Returns per-check PASS/FAIL/SKIP/ERROR and an overall verdict (FAIL if any check fails).

Parameters

NameTypeReqDefaultNotes
videostringyesVideo file path. Handler reads inputs["video"].
planobjectnoShooting plan: SET, LIGHT, SHOT_TYPE, ACTORS_ACTION, vo_text. Presence of SET enables the scene-matches-plan sub-check; vo_text enables the transcript sub-check.
expected_charactersintegerno1Declared in YAML but NOT read by the handler — person-consistency always runs across all frames regardless. Inert.

Mismatch notes: vision model is hard-coded to google/gemini-2.5-flash (no override field). The transcript sub-check uses simpleTranscriptCompare (word overlap, no second LLM call), unlike standalone check_transcript. Audio checks emit SKIP if the video has no audio track.


Same Person Check check_same_person

  • Provider: local (vision: google/gemini-2.5-flash)
  • Endpoint: none (execCheckSamePersonCheckSamePerson in check_vision.go)
  • MCP action: qa tool, action: "person" → routes to check_same_person. The MCP layer maps image1ref and image2test.
  • Cost: 1 credit (one vision call)
  • Timeout: 30s

Compares facial features between a reference image and a test image (or video — mid-frame auto-extracted via extractMidFrame). Sends both to Gemini with VisionCheckMulti. Returns same_person (bool), confidence (0–100), differences (list), and verdict. PASS requires same_person == true AND confidence >= min_confidence.

Parameters

NameTypeReqDefaultNotes
refstringyesReference image URL (persona_ref). Passed to the API as-is (no base64 conversion).
teststringyesTest image path/URL, or a video (mid-frame extracted, ext in .mp4/.mov/.avi/.mkv/.webm).
min_confidenceintegerno85Min confidence (0–100) for PASS. Handler re-clamps to 85 if <= 0.
modelstringnogoogle/gemini-2.5-flashVision model override.

Mismatch notes: YAML/handler fields match exactly. Errors if the vision client is not configured on the server, or if ref/test is empty.


Scene Matches Plan Check check_scene_matches_plan

  • Provider: local (vision: google/gemini-2.5-flash)
  • Endpoint: none (execCheckSceneMatchesPlanCheckSceneMatchesPlan in check_vision.go)
  • MCP action: qa tool, action: "scene" → routes to check_scene_matches_plan. MCP maps videoin and passes plan through. Both video and plan are required at the MCP layer.
  • Cost: 1 credit
  • Timeout: 30s

Checks each shooting-plan field (SET / LIGHT / SHOT_TYPE / ACTORS_ACTION) against the image. For video input, extracts the mid-frame. Sends the plan as JSON + the image to Gemini (VisionCheck). Returns per-field {verdict, reason} under fields, plus overall verdict (FAIL if any field fails; the model is instructed to only judge fields present in the plan).

Parameters

NameTypeReqDefaultNotes
instringyesImage or video path to check. Handler reads inputs["in"].
planobjectyesPlan object with SET, LIGHT, SHOT_TYPE, ACTORS_ACTION. Handler errors if nil.
modelstringnogoogle/gemini-2.5-flashVision model override.

Mismatch notes: YAML/handler fields match. Note the field name is in (not video/image); the MCP scene action takes video and remaps it.


Image Description Check check_image_description

  • Provider: local (vision: google/gemini-2.5-flash)
  • Endpoint: none (execCheckImageDescriptionCheckImageDescription in check_vision.go)
  • MCP action: qa tool, action: "image" → routes to check_image_description. MCP maps image_urlin and passes description through.
  • Cost: 1 credit
  • Timeout: 30s

Sends an image + expected description to Gemini; the model judges whether the image matches. Local files are read and base64-encoded as a data:image/png URI; http-prefixed inputs are passed as-is. Uses structured output (VisionCheckStructured with a verdict/match/reason/details schema) and falls back to unstructured VisionCheck on error. Returns verdict (PASS/FAIL), match (bool), reason, and details (found/missing elements).

Parameters

NameTypeReqDefaultNotes
instringyesImage path (local) or URL.
descriptionstringyesExpected description text.
modelstringnogoogle/gemini-2.5-flashVision model override.

Mismatch notes: YAML/handler fields match. Caveat: non-http paths are always encoded as image/png regardless of real extension — a .jpg is still sent with a PNG MIME label (works with Gemini, but technically mislabeled).


Voice Consistency Check check_voice_consistency

  • Provider: local (vision/audio: google/gemini-2.5-flash)
  • Endpoint: none (execCheckVoiceConsistencyCheckVoiceConsistency in check_audio.go)
  • MCP action: qa tool, action: "voice" → routes to check_voice_consistency. MCP maps audioin.
  • Cost: 1 credit
  • Timeout: 30s

Extracts N short (~3s) audio segments evenly across the file with ffmpeg, base64-encodes them as data:audio/mpeg URIs, and sends all segments to Gemini in one structured call to judge whether the same speaker (pitch, timbre, accent, style, gender, age impression) is present throughout. Returns verdict (PASS/FAIL), same_speaker (bool), issues (list).

Parameters

NameTypeReqDefaultNotes
instringyesAudio file (mp3/wav/aac).
segmentsintegerno3Number of segments to compare. Handler overrides only when > 0; internally re-clamps <= 1 to 3.
modelstringnogoogle/gemini-2.5-flashModel override.

Mismatch notes: Undocumented short-circuit — audio under 2.0s returns PASS immediately with note: "audio too short to compare segments" (no API call). Needs ≥2 extractable segments or it errors.


Transcript Check check_transcript

  • Provider: local (vision: google/gemini-2.5-flash + speech-to-text)
  • Endpoint: none (execCheckTranscriptCheckTranscriptMatchesPlan in check_vision.go)
  • MCP action: qa tool, action: "transcript" → accepts a video OR a pure audio URL plus an optional ISO-639-1 language hint and an optional expected_text. Omit expected_text for transcription-only mode (no compare, verdict PASS).
  • Cost: 1 credit (transcription + comparison)
  • Timeout: 2m

Pipeline: extract audio (ffmpeg → mp3; skipped when the input is already audio) → transcribe via our STT step → compare to expected text via an LLM call. If expected_text is omitted, it runs in transcription-only mode: no comparison, verdict PASS. Returns actual_transcript, duration_sec, segments ([{start_s, end_s, text}]), segment_count, and — when comparing — similarity_pct, missing_words, extra_words, verdict. PASS at similarity >= 80%. On any LLM/parse error it falls back to simpleTranscriptCompare (word overlap).

Parameters

NameTypeReqDefaultNotes
videostringyesVideo or pure audio file path/URL; audio extracted automatically when a video is given.
expected_textstringnoExpected voiceover text. Optional — omit → transcription-only mode (no compare, verdict PASS).
languagestringnoISO-639-1 code passed to the STT step (improves accuracy).
vision_modelstringnogoogle/gemini-2.5-flashLLM for semantic comparison.

Mismatch notes: The standalone check does a real LLM comparison (vision.VisionCheck), whereas the same check inside qa_full uses word-overlap only — the two paths differ.


Video Description describe_video

  • Provider: local (multimodal analysis model)
  • Endpoint: none (execDescribeVideo in describe_video.go)
  • MCP action: qa tool, action: "describe" → routes to describe_video. MCP maps videoin and passes fps/focus through.
  • Cost: ≈1 credit per 25 s of video at fps: 1, scales with fps; minimum 1 credit.
  • Timeout: 5m

Watches the whole video and returns a timecoded, scene-by-scene breakdown. The segments partition the video at scene changes (cuts, location changes, clear changes of action); each segment reports start_s/end_s, scene (what visually happens), speech (transcribed words, "" if none), sounds (notable SFX/ambient), and music ("" if none). Async: the call returns a job_id — poll get_status (~every 15 s; a typical run takes 1–3 minutes), then read segments and segment_count from the result.

Parameters

NameTypeReqDefaultNotes
videostringyesVideo URL. Sent to the model as in. Max duration 1 hour, and duration × fps must not exceed 3600 (fps 1 → up to 60 min, fps 5 → up to 12 min); larger inputs are rejected.
fpsintegerno1Frames sampled per second (1–5). Raise for fast-cut footage; cost scales with fps.
focusstringnoExtra instruction (≤2000 chars), e.g. "focus on product shots" or an expected-shot list to check against.

Notes: the analysis model is pinned server-side (no caller override). Segment text fields are length-capped and the segment list is bounded, so very long or unusual videos return a trimmed but well-formed result.

check_audio_loudness

  • Provider: local (ffmpeg loudnorm)
  • Display name: Audio Loudness Check
  • Category / mode: qa_check / sync
  • Cost: free (cost_per_unit: 0)
  • Timeout: 30s
  • MCP action: none (internal-only; REST POST /v1/jobs/check_audio_loudness or via qa_full)
  • Handler: execCheckAudioLoudnessCheckAudioLoudness (check_audio.go)

Measures integrated loudness and true peak with a single ffmpeg loudnorm=print_format=json analysis pass, parses the JSON from ffmpeg stderr (input_i, input_tp, input_lra).

ParamTypeReqDefaultNotes
instringyesAudio file path/URL to check (materialized & SSRF-checked by the executor).
target_lufsnumberno-14Target integrated LUFS. Cannot be set to literal 0 — handler treats 0 as "unset" and substitutes -14.
tolerancenumberno3Allowed deviation in LU. 0 → coerced to 3.
max_true_peaknumberno-1Max true peak in dBTP. 0 → coerced to -1.

Verdict: PASS if |integrated - target| <= tolerance AND true_peak <= max_true_peak, else FAIL. Metrics: lufs_integrated, true_peak_db, lra, plus echoed target_lufs/tolerance/max_true_peak.

Note: handler coerces any 0-valued numeric param to its default (see code-vs-YAML mismatches). If the loudnorm JSON block is missing from stderr the call errors instead of returning a verdict.


check_audio_structural

  • Provider: local (ffprobe)
  • Display name: Audio Structural Check
  • Category / mode: qa_check / sync
  • Cost: free (cost_per_unit: 0)
  • Timeout: 30s
  • MCP action: none (internal-only; REST POST /v1/jobs/check_audio_structural or via qa_full)
  • Handler: execCheckAudioStructuralCheckAudioStructural (check_audio.go), via Probe (ffprobe)

Probes the file, finds the first audio stream, and checks duration and codec.

ParamTypeReqDefaultNotes
instringyesAudio file path/URL to check.

Verdict: FAIL if no streams / no audio stream, OR duration < 1.0s, OR codec not in {mp3, aac, pcm_s16le, flac, vorbis, opus}; else PASS. Metrics: duration_sec, sample_rate, channels, codec, bitrate_kbps. Failing reasons listed in issues.

Note: YAML/prompt_guide name the metrics duration and bitrate; handler emits duration_sec and bitrate_kbps (= ffprobe bit_rate / 1000). Sample-rate and channel values are reported but never cause a FAIL.


check_audio_tail

  • Provider: local (ffmpeg volumedetect)
  • Display name: Audio Tail Check
  • Category / mode: qa_check / sync
  • Cost: free (cost_per_unit: 0)
  • Timeout: 30s
  • MCP action: none (internal-only; REST POST /v1/jobs/check_audio_tail or via qa_full)
  • Handler: execCheckAudioTailCheckAudioTail (check_audio.go)

Detects an abrupt cut-off at the end of audio (the "v1 VO bug"). Splits the trailing tail_sec window in two and compares per-half RMS measured with ffmpeg volumedetect (mean_volume).

ParamTypeReqDefaultNotes
instringyesAudio file path/URL to check.
tail_secnumberno1.0Seconds of tail to analyze. <= 0 → coerced to 1.0; clamped down to total duration if shorter.
silence_dbnumberno-40RMS dB threshold below which the tail counts as silent. Cannot be set to literal 0 — 0 → coerced to -40.

Verdict: PASS if rms_second_half <= silence_db (silent) OR rms_second_half < rms_first_half * 0.7 (fading); else FAIL ("tail not fading"). Metrics: tail_sec, silence_db, rms_first_half, rms_second_half, is_silent, is_fading, total_duration.

Note: YAML prose says PASS when the second half is merely "quieter"; the handler is stricter and requires a ≥30% RMS drop (* 0.7). An unmeasurable half returns -100 dB (treated as silent → PASS).


check_motion_artifacts

  • Provider: local (ffmpeg signalstats YDIF)
  • Display name: Motion Artifacts Check
  • Category / mode: qa_check / sync
  • Cost: free (cost_per_unit: 0)
  • Timeout: 2m
  • MCP action: none (internal-only; REST POST /v1/jobs/check_motion_artifacts or via qa_full)
  • Handler: execCheckMotionArtifactsCheckMotionArtifacts (check_video.go)

Scans for frame-to-frame luminance-difference spikes that indicate glitches or unintended jump cuts. Parses YDIF from an ffmpeg signalstats=stat=tout pass, computes mean, and flags frames where diff > mean * spike_factor.

ParamTypeReqDefaultNotes
instringyesVideo file path/URL to check.
spike_factornumberno4A frame whose diff exceeds mean * spike_factor is a spike. <= 0 → coerced to 4. Lower = stricter.

Verdict: PASS if spikes_count <= 1 (a single spike can be a legitimate transition); FAIL if > 1. Metrics: frames_checked, mean_diff, max_diff, stddev, spike_factor, spikes_count, spike_frames.

Note: the handler runs an extra mestimate+metadata=print pass whose output is discarded — only the signalstats YDIF pass is used. If no YDIF lines parse, it returns PASS with a could not extract frame differences warning. spike_frames are indices into the parsed YDIF list, not absolute video frame numbers.


overexposure_check

  • Provider: local (ffmpeg signalstats BRNG)
  • Display name: Overexposure Check
  • Category / mode: qa_check / sync
  • Cost: free (cost_per_unit: 0)
  • Timeout: 2m
  • MCP action: none (internal-only; REST POST /v1/jobs/overexposure_check or via qa_full)
  • Handler: execOverexposureCheckCheckOverexposure (overexposure.go)

Detects blown-out highlights in an image or video. Samples frames at sample_fps and reads signalstats BRNG (percent of pixels outside broadcast range) as the clipped-pixel proxy, taking the worst sampled frame.

ParamTypeReqDefaultNotes
instringyesImage or video path/URL to check.
max_clipped_pctnumberno3.0Max % of clipped pixels before FAIL. <= 0 → coerced to 3.0.
sample_fpsnumberno2Frames per second to sample (video). <= 0 → coerced to 2. Read as an int.

Verdict: PASS if worst_frame_pct <= max_clipped_pct; else FAIL (suggested fix: apply highlight_rolloff, then re-check). Metrics: worst_frame_pct, max_clipped_pct, frames_checked, max_brng.

Note: the YAML describes "clipped pixels at max luminance", but the handler measures BRNG (broadcast-range %), not a true white-clip count — worst_frame_pct is a proxy. A discarded histogram pass runs first. If signalstats returns no BRNG frames, the handler returns PASS with a signalstats not available warning (frames_checked: 0), which can mask genuine overexposure.

Framehood