Skip to content

Video processing & assembly

Auto-captions and lipsync plus local ffmpeg pipelines (free, our implementation).

Generations are charged in credits (see Credits & plans). Every generation model also accepts mock: true for a free placeholder result.

Auto Subtitles captions_auto

Automatically transcribe a video's audio and burn in karaoke-style subtitles with word-level highlighting, customizable Google Fonts, colors, and animation.

Call it viavideo tool, action: "captions" (MCP) · raw: POST /v1/jobs/captions_auto

Cost6 cr per minute of video
Mode / timeoutwebhook / 10m

Parameters — the model's input schema:

ParamTypeRequiredDefaultAllowed / rangeDescription
video_urlstringURL of the video file to add automatic subtitles to (max 100 MB).
languagestringen2-letter code (en, es, fr, de, it, pt, nl, ja, zh, ko, …) or 3-letter ISO code (eng, spa, fra, …)Language code for transcription.
font_namestringMontserratany Google Font name (e.g. Poppins, Bebas Neue, Oswald, Inter, Roboto)Font from fonts.google.com.
font_sizeinteger10020–150Font size in pixels (TikTok style uses larger text).
font_weightstringboldnormal, bold, blackFont weight.
font_colorstringwhitewhite, black, red, green, blue, yellow, orange, purple, pink, brown, gray, cyan, magentaSubtitle text color for non-active words.
highlight_colorstringpurplesame 13 colors as font_colorColor for the currently speaking word (karaoke-style highlight).
stroke_widthinteger30–10Text stroke/outline width in pixels (0 = no stroke).
stroke_colorstringblacksame 13 colors as font_colorText stroke/outline color.
background_colorstringnonethe 13 colors above plus none, transparentBackground color behind text.
background_opacitynumber00.0–1.0Background opacity (0 = transparent, 1 = opaque).
positionstringbottomtop, center, bottomVertical position of subtitles.
y_offsetinteger75-200–200Vertical offset in pixels (positive = down, negative = up).
words_per_subtitleinteger31–12Max words per subtitle segment (1 = single word, 8–12 = full sentences).
enable_animationbooleantruetrue / falseBounce-style entrance animation for subtitles.

Our wrapper params (not part of the model schema): out (required — workdir-relative output path) and mock (optional — test placeholder, no real generation). This model has no format/size mapping (format_field is empty).

Limitsvideo_url max file size 100 MB. Accepted input formats: mp4, mov, webm, m4v, gif. Cost is metered at 6 cr per minute of video. Transcription is via ElevenLabs speech-to-text.

Full Video Assembly video_assemble_full

Categoryvideo_process
Modesync
Timeout10m
Costfree (cost_per_unit: 0)
MCP actionvideo(assemble) (worker video.ts → kind video_assemble_full)

One-call complete assembly: concatenates clips with visual transitions (xfade), mixes audio layers (VO / music / ambient SFX / transition SFX / intro SFX / end SFX), and applies intro fade + ending preset. Replaces assemble_clips + audio_mix in a single job. Implemented by VideoAssembleFull (video_assemble_full.go), dispatched by execVideoAssembleFull. Pre-validates that VO fits inside the assembled duration (hard error if VO is >0.5s longer). When the VO and video durations diverge by more than 3s, the job result gains a warnings array flagging the mismatch.

Parameters (from input_schema, cross-checked against executor.go/video_assemble_full.go):

ParamTypeReqDefaultNotes
clipsarray<object>yesOrdered. Each {path, transition, transition_sfx}.
clips[].pathstringyesClip path.
clips[].transitionstringnocutVisual transition INTO this clip. YAML enum: cut, dissolve, fadeblack, fadewhite, wipeleft, wiperight, smoothleft, blur, flash, distance, circlecrop. Caveat: the underlying AssembleClips only implements cut→concat, dissolve→xfade fade, wipe→wipeleft; every other value falls through to a plain fade xfade. So fadeblack/blur/flash/etc. currently render as a crossfade, not their named effect.
clips[].transition_sfxstringnoSFX path played centered on this cut (-0.15s lead, volume 0.7).
outstringyesOutput video path.
xfade_durationnumberno0.2Visual transition duration (s).
introobjectno{fade_in, fade_in_duration, sfx}.
intro.fade_inboolnofalseHard start unless true.
intro.fade_in_durationnumberno0.3
intro.sfxstringnoIntro whoosh (volume 0.7).
vostringnoVoiceover path (0 dB by default).
vo_levelnumberno0VO volume (dB).
vo_offset_secnumberno0 (min 0)Delay before VO starts — align speech with a later clip. Negative is rejected.
musicstringnoMusic bed path.
music_levelnumberno-24Music volume (dB); handler defaults to −24 if 0.
sfx_ambientstringnoAmbient SFX path.
sfx_levelnumberno-18Handler defaults to −18 if 0.
endingobjectno{type, end_sfx, video_fade, music_fade_start, end_sfx_start, black_tail}.
ending.typestringnosocialPreset enum: social / cinematic / loop. social: fade 0.3s, music fade −0.5s, end_sfx −0.3s. cinematic: fade 1.0s, music −2.0s, sfx −1.0s, 0.5s black tail. loop: no fades/tail. Per-field overrides win over the preset.

Undocumented input: the handler also reads a top-level ending_type string (executor.go:358) before merging ending.type. Not declared in the YAML; nested ending.type overrides it. Prefer the documented nested form.

Output: { ok, outputs:{video, local_path}, metrics:{num_clips, video_duration, output_duration, ending_type, video_fade, music_fade_start, black_tail, xfade_duration, audio_layers}, warnings[] }. The warnings array is present when the VO/video durations diverge by more than 3s.


Assemble Clips assemble_clips

Categoryvideo_process
Modesync
Timeout5m
Costfree (cost_per_unit: 0)
MCP actionnone — internal/REST only. No MCP action maps here; video(assemble) routes to video_assemble_full. Reachable only via direct POST /v1/jobs/assemble_clips or as a building block of video_assemble_full. (proxy.ts maps it to video/assemble for error-hint purposes only.)

Concatenate clips in array order. If all transitions are cut/hold/match-cut, uses the concat demuxer with stream copy (fast, no re-encode); if any dissolve/wipe is present, re-encodes via the xfade filter (libx264, CRF 19). Clips lacking an audio track get a silent track injected first (ensureAudioTrack). Implemented by AssembleClips (assemble_clips.go), dispatched by execAssembleClips.

Parameters (from input_schema, cross-checked against assemble_clips.go):

ParamTypeReqDefaultNotes
clipsarray<object>yesOrdered {path, trans_in, duration}.
clips[].pathstringyesClip path. Rejected if it contains ', newline, or CR (concat-list injection guard).
clips[].trans_instringnocutTransition INTO this clip (first clip's is ignored). YAML enum: cut, dissolve, wipe, match-cut, j-cut, l-cut, hold. Handler: cut/hold/match-cut → stream-copy concat; dissolve → xfade fade; wipe → xfade wipeleft; any other value (incl. j-cut/l-cut) → default fade xfade (plain crossfade, no audio lead/lag).
clips[].durationnumbernoClip duration override in seconds (0 = full clip). Handler reads m["duration"].
outstringyesOutput video path.
xfade_durationnumberno0.1Dissolve/wipe duration (s); handler clamps ≤0 to 0.1.

Duration caveat (documented in YAML): each dissolve/wipe shortens total output by xfade_duration. Plan VO length against the assembled duration, not the raw clip sum.

Output: { ok, outputs:{video, local_path}, metrics:{num_clips, total_duration_sec, transitions_applied, method:"concat_demuxer"|"xfade_filter", ...} }.


Video + Audio Mix video_audio_mix

Categoryvideo_process
Modesync
Timeout5m
Costfree (cost_per_unit: 0)
MCP actionvideo(mix_audio) (worker video.ts → kind video_audio_mix). MCP exposes only tracks: string[], which the worker expands into layers: the FIRST track becomes the VO (level: 0, label: "vo"), the rest are mixed at -24 dB (label: "track2"…), all with start_sec: 0. Custom per-layer level/start_sec/label and keep_original_audio are reachable via direct REST /v1/jobs/video_audio_mix.

Overlay audio layers (VO, music, SFX) onto a video with per-layer dB level and start offset, then amix them. Video stream is copied (-c:v copy); audio re-encoded AAC 192k; output trimmed to the video length. Implemented by AudioMix (audio_mix.go), dispatched by execAudioMix.

Parameters (from input_schema, cross-checked against audio_mix.go):

ParamTypeReqDefaultNotes
videostringyesInput video. (MCP mix_audio maps video_urlvideo.)
outstringyesOutput video path.
layersarray<object>yesEach {path, level, start_sec, label}.
layers[].pathstringyesAudio path.
layers[].levelnumberno0dB (0 = original, −24 = background). Converted to linear via exact 10^(dB/20).
layers[].start_secnumberno0Offset from video start; >0 adds adelay.
layers[].labelstringyesReporting label. Semantically special: label:"vo" triggers a hard error if VO is longer than video (+0.5s) and a tight-timing warning within 0.5s; label:"music" only warns when it exceeds video.
keep_original_audioboolnofalseIf true, mixes the video's existing [0:a] in too.

Output: { ok, outputs:{video, local_path}, metrics:{video_duration_sec, output_duration_sec, layers[], keep_original_audio, warnings[]} }.


Audio Mix audio_mix

Categoryvideo_process
Modesync
Timeout5m
Costfree (cost_per_unit: 0)
MCP actionnone — deprecated alias. Registered in executor.go as "audio_mix": e.execAudioMix with the comment "deprecated name, alias for video_audio_mix". Identical YAML and identical handler to video_audio_mix. Not present in any worker action map; reachable only via direct POST /v1/jobs/audio_mix. Prefer video_audio_mix.

Functionally identical to video_audio_mix above — same AudioMix (audio_mix.go) handler, same parameters (video, out, layers[]{path,level,start_sec,label}, keep_original_audio), same output. Kept for backward compatibility of the old name only. See video_audio_mix for the full parameter table and the label:"vo"/"music" validation behaviour.

Doc note: two YAML files (audio_mix.yaml, video_audio_mix.yaml) document a single implementation. Despite the name, this operates on a video input (requires video + layers), not audio-only mixing — audio-only mixing is the separate audio_only_mix model.


Structural Export structural_export

Categoryvideo_process
Modesync
Timeout5m
Costfree (cost_per_unit: 0)
MCP actionnone — internal/pipeline only. No worker action maps here; reachable via direct POST /v1/jobs/structural_export or as a final encode step in the pipeline.

Final platform-specific structural encode — scale + letterbox-pad to target resolution and re-encode (libx264 -preset slow, +faststart). No creative/color filters. Apply after upscale and caption burn-in. Implemented by StructuralExport (structural_export.go), dispatched by execStructuralExport.

Parameters (from input_schema, cross-checked against structural_export.go):

ParamTypeReqDefaultNotes
instringyesInput video path. Handler reads inputs["in"].
outstringyesOutput video path.
platformstringyes (handler errors if empty)YAML default shortsPreset enum. tiktok/reels/shorts → 1080×1920, 30fps, CRF 19, AAC 192k. youtube-long → 1920×1080, 24fps, CRF 18, AAC 192k. ads → 1080×1920, 30fps, CRF 17, AAC 256k. Unknown value → error listing valid platforms.

Output: { ok, outputs:{video, local_path}, metrics:{platform, resolution, fps, crf, total_duration_sec} }.


Highlight Rolloff highlight_rolloff

Categoryvideo_process
Modesync
Timeout5m
Costfree (cost_per_unit: 0)
MCP actionnone — internal/QA-pipeline only. No worker action maps here; reachable via direct POST /v1/jobs/highlight_rolloff or the QA/fix pipeline. Intended to run only when overexposure_check fails.

Surgical overexposure fix: compresses highlights via a fixed curves filter (all='0/0 0.85/0.85 1/0.92' — values above 85% rolled off to max 92%), audio stream-copied. After encoding it automatically re-runs the overexposure check (3% clipped threshold, 2 fps sampling) and returns the post-fix verdict. This is the only sanctioned creative color operation in the pipeline. Implemented by HighlightRolloff (highlight_rolloff.go), dispatched by execHighlightRolloff.

Parameters (from input_schema, cross-checked against highlight_rolloff.go):

ParamTypeReqDefaultNotes
instringyesInput video path. Handler reads inputs["in"].
outstringyesOutput video path.

No tunable parameters — the curve and the post-check thresholds are hardcoded.

Output: { ok, outputs:{video, local_path}, metrics:{filter, total_duration_sec, post_check, post_verdict} }. Per the YAML guidance, if the source still exceeds 3% clipping after rolloff the source clips are bad and the pipeline should block to Visual Prompting — this routing is pipeline policy, the handler itself only surfaces post_verdict.

Sync Lipsync v3 lipsync_v3

sync-3, Sync.so's most powerful lipsync model, syncs mouth movement to an audio track on a talking-head video using native visual intelligence.

Call it viavideo tool, action: "lipsync" (MCP) · raw: POST /v1/jobs/lipsync_v3

Cost1600 cr per minute of output
Mode / timeoutwebhook / 15m (from our YAML)

Parameters — the model's input schema:

ParamTypeRequiredDefaultAllowed / rangeDescription
video_urlstringURL of the input video (face visible)
audio_urlstringURL of the input audio
sync_modestring (enum)cut_off (model); our video(lipsync) sends loop unless you pass onecut_off, loop, bounce, silence, remapHow to handle audio/video duration mismatch. cut_off trims to the shorter input (drops the tail of longer audio); loop/bounce repeat the video (never drops speech); silence pads with silence; remap speed-adjusts
optionsobjectnested Sync3GenerationOptionsAdditional Sync.so generation options (advanced). Fields: sync_mode (overrides top-level), model_mode (lips/face/head/lipsync/emotion/talking_head), prompt (emotion: happy/sad/angry/disgusted/surprised/neutral), temperature (0–1, ignored by sync-3), active_speaker_detection (object, for multi-person videos), occlusion_detection_enabled (bool, ignored by sync-3)

Our wrapper params (not part of the model schema): out (required — workdir-relative output path) and mock (optional — test placeholder). No format mapping applies (our format_field is empty; sync-3 has no size/resolution field).

Limits:

  • Accepted video formats: mp4, mov, webm, m4v, gif
  • Accepted audio formats: mp3, ogg, wav, m4a, aac
  • Billing is per minute of output video at 1600 cr/min (no published hard cap on duration/resolution/file size).

Framehood