Video processing & assembly

Auto-captions and lipsync plus local ffmpeg pipelines (free, our implementation).

Generations are charged in credits (see Credits & plans). Every generation model also accepts mock: true for a free placeholder result.

Auto Subtitles `captions_auto`

Automatically transcribe a video's audio and burn in karaoke-style subtitles with word-level highlighting, customizable Google Fonts, colors, and animation.

Call it via — video tool, action: "captions" (MCP) · raw: POST /v1/jobs/captions_auto


Cost	6 cr per minute of video
Mode / timeout	webhook / 10m

Parameters — the model's input schema:

Param	Type	Required	Default	Allowed / range	Description
`video_url`	string	✓	—	—	URL of the video file to add automatic subtitles to (max 100 MB).
`language`	string		`en`	2-letter code (`en`, `es`, `fr`, `de`, `it`, `pt`, `nl`, `ja`, `zh`, `ko`, …) or 3-letter ISO code (`eng`, `spa`, `fra`, …)	Language code for transcription.
`font_name`	string		`Montserrat`	any Google Font name (e.g. `Poppins`, `Bebas Neue`, `Oswald`, `Inter`, `Roboto`)	Font from fonts.google.com.
`font_size`	integer		`100`	20–150	Font size in pixels (TikTok style uses larger text).
`font_weight`	string		`bold`	`normal`, `bold`, `black`	Font weight.
`font_color`	string		`white`	`white`, `black`, `red`, `green`, `blue`, `yellow`, `orange`, `purple`, `pink`, `brown`, `gray`, `cyan`, `magenta`	Subtitle text color for non-active words.
`highlight_color`	string		`purple`	same 13 colors as `font_color`	Color for the currently speaking word (karaoke-style highlight).
`stroke_width`	integer		`3`	0–10	Text stroke/outline width in pixels (0 = no stroke).
`stroke_color`	string		`black`	same 13 colors as `font_color`	Text stroke/outline color.
`background_color`	string		`none`	the 13 colors above plus `none`, `transparent`	Background color behind text.
`background_opacity`	number		`0`	0.0–1.0	Background opacity (0 = transparent, 1 = opaque).
`position`	string		`bottom`	`top`, `center`, `bottom`	Vertical position of subtitles.
`y_offset`	integer		`75`	-200–200	Vertical offset in pixels (positive = down, negative = up).
`words_per_subtitle`	integer		`3`	1–12	Max words per subtitle segment (1 = single word, 8–12 = full sentences).
`enable_animation`	boolean		`true`	true / false	Bounce-style entrance animation for subtitles.

Our wrapper params (not part of the model schema): out (required — workdir-relative output path) and mock (optional — test placeholder, no real generation). This model has no format/size mapping (format_field is empty).

Limits — video_url max file size 100 MB. Accepted input formats: mp4, mov, webm, m4v, gif. Cost is metered at 6 cr per minute of video. Transcription is via ElevenLabs speech-to-text.

Full Video Assembly `video_assemble_full`


Category	video_process
Mode	sync
Timeout	10m
Cost	free (`cost_per_unit: 0`)
MCP action	`video(assemble)` (worker `video.ts` → kind `video_assemble_full`)

One-call complete assembly: concatenates clips with visual transitions (xfade), mixes audio layers (VO / music / ambient SFX / transition SFX / intro SFX / end SFX), and applies intro fade + ending preset. Replaces assemble_clips + audio_mix in a single job. Implemented by VideoAssembleFull (video_assemble_full.go), dispatched by execVideoAssembleFull. Pre-validates that VO fits inside the assembled duration (hard error if VO is >0.5s longer). When the VO and video durations diverge by more than 3s, the job result gains a warnings array flagging the mismatch.

Parameters (from input_schema, cross-checked against executor.go/video_assemble_full.go):

Param	Type	Req	Default	Notes
`clips`	array<object>	yes	—	Ordered. Each `{path, transition, transition_sfx}`.
`clips[].path`	string	yes	—	Clip path.
`clips[].transition`	string	no	`cut`	Visual transition INTO this clip. YAML enum: cut, dissolve, fadeblack, fadewhite, wipeleft, wiperight, smoothleft, blur, flash, distance, circlecrop. Caveat: the underlying `AssembleClips` only implements `cut`→concat, `dissolve`→xfade fade, `wipe`→wipeleft; every other value falls through to a plain `fade` xfade. So fadeblack/blur/flash/etc. currently render as a crossfade, not their named effect.
`clips[].transition_sfx`	string	no	—	SFX path played centered on this cut (`-0.15s` lead, volume 0.7).
`out`	string	yes	—	Output video path.
`xfade_duration`	number	no	`0.2`	Visual transition duration (s).
`intro`	object	no	—	`{fade_in, fade_in_duration, sfx}`.
`intro.fade_in`	bool	no	`false`	Hard start unless true.
`intro.fade_in_duration`	number	no	`0.3`
`intro.sfx`	string	no	—	Intro whoosh (volume 0.7).
`vo`	string	no	—	Voiceover path (0 dB by default).
`vo_level`	number	no	`0`	VO volume (dB).
`vo_offset_sec`	number	no	`0` (min 0)	Delay before VO starts — align speech with a later clip. Negative is rejected.
`music`	string	no	—	Music bed path.
`music_level`	number	no	`-24`	Music volume (dB); handler defaults to −24 if 0.
`sfx_ambient`	string	no	—	Ambient SFX path.
`sfx_level`	number	no	`-18`	Handler defaults to −18 if 0.
`ending`	object	no	—	`{type, end_sfx, video_fade, music_fade_start, end_sfx_start, black_tail}`.
`ending.type`	string	no	`social`	Preset enum: social / cinematic / loop. social: fade 0.3s, music fade −0.5s, end_sfx −0.3s. cinematic: fade 1.0s, music −2.0s, sfx −1.0s, 0.5s black tail. loop: no fades/tail. Per-field overrides win over the preset.

Undocumented input: the handler also reads a top-level ending_type string (executor.go:358) before merging ending.type. Not declared in the YAML; nested ending.type overrides it. Prefer the documented nested form.

Output: { ok, outputs:{video, local_path}, metrics:{num_clips, video_duration, output_duration, ending_type, video_fade, music_fade_start, black_tail, xfade_duration, audio_layers}, warnings[] }. The warnings array is present when the VO/video durations diverge by more than 3s.

Assemble Clips `assemble_clips`


Category	video_process
Mode	sync
Timeout	5m
Cost	free (`cost_per_unit: 0`)
MCP action	none — internal/REST only. No MCP action maps here; `video(assemble)` routes to `video_assemble_full`. Reachable only via direct `POST /v1/jobs/assemble_clips` or as a building block of `video_assemble_full`. (proxy.ts maps it to `video/assemble` for error-hint purposes only.)

Concatenate clips in array order. If all transitions are cut/hold/match-cut, uses the concat demuxer with stream copy (fast, no re-encode); if any dissolve/wipe is present, re-encodes via the xfade filter (libx264, CRF 19). Clips lacking an audio track get a silent track injected first (ensureAudioTrack). Implemented by AssembleClips (assemble_clips.go), dispatched by execAssembleClips.

Parameters (from input_schema, cross-checked against assemble_clips.go):

Param	Type	Req	Default	Notes
`clips`	array<object>	yes	—	Ordered `{path, trans_in, duration}`.
`clips[].path`	string	yes	—	Clip path. Rejected if it contains `'`, newline, or CR (concat-list injection guard).
`clips[].trans_in`	string	no	`cut`	Transition INTO this clip (first clip's is ignored). YAML enum: cut, dissolve, wipe, match-cut, j-cut, l-cut, hold. Handler: cut/hold/match-cut → stream-copy concat; dissolve → xfade fade; wipe → xfade wipeleft; any other value (incl. j-cut/l-cut) → default `fade` xfade (plain crossfade, no audio lead/lag).
`clips[].duration`	number	no	—	Clip duration override in seconds (0 = full clip). Handler reads `m["duration"]`.
`out`	string	yes	—	Output video path.
`xfade_duration`	number	no	`0.1`	Dissolve/wipe duration (s); handler clamps ≤0 to 0.1.

Duration caveat (documented in YAML): each dissolve/wipe shortens total output by xfade_duration. Plan VO length against the assembled duration, not the raw clip sum.

Output: { ok, outputs:{video, local_path}, metrics:{num_clips, total_duration_sec, transitions_applied, method:"concat_demuxer"|"xfade_filter", ...} }.

Video + Audio Mix `video_audio_mix`


Category	video_process
Mode	sync
Timeout	5m
Cost	free (`cost_per_unit: 0`)
MCP action	`video(mix_audio)` (worker `video.ts` → kind `video_audio_mix`). MCP exposes only `tracks: string[]`, which the worker expands into `layers`: the FIRST track becomes the VO (`level: 0`, `label: "vo"`), the rest are mixed at `-24 dB` (`label: "track2"…`), all with `start_sec: 0`. Custom per-layer `level`/`start_sec`/`label` and `keep_original_audio` are reachable via direct REST `/v1/jobs/video_audio_mix`.

Overlay audio layers (VO, music, SFX) onto a video with per-layer dB level and start offset, then amix them. Video stream is copied (-c:v copy); audio re-encoded AAC 192k; output trimmed to the video length. Implemented by AudioMix (audio_mix.go), dispatched by execAudioMix.

Parameters (from input_schema, cross-checked against audio_mix.go):

Param	Type	Req	Default	Notes
`video`	string	yes	—	Input video. (MCP `mix_audio` maps `video_url` → `video`.)
`out`	string	yes	—	Output video path.
`layers`	array<object>	yes	—	Each `{path, level, start_sec, label}`.
`layers[].path`	string	yes	—	Audio path.
`layers[].level`	number	no	`0`	dB (0 = original, −24 = background). Converted to linear via exact `10^(dB/20)`.
`layers[].start_sec`	number	no	`0`	Offset from video start; >0 adds `adelay`.
`layers[].label`	string	yes	—	Reporting label. Semantically special: `label:"vo"` triggers a hard error if VO is longer than video (+0.5s) and a tight-timing warning within 0.5s; `label:"music"` only warns when it exceeds video.
`keep_original_audio`	bool	no	`false`	If true, mixes the video's existing `[0:a]` in too.

Output: { ok, outputs:{video, local_path}, metrics:{video_duration_sec, output_duration_sec, layers[], keep_original_audio, warnings[]} }.

Audio Mix `audio_mix`


Category	video_process
Mode	sync
Timeout	5m
Cost	free (`cost_per_unit: 0`)
MCP action	none — deprecated alias. Registered in `executor.go` as `"audio_mix": e.execAudioMix` with the comment "deprecated name, alias for video_audio_mix". Identical YAML and identical handler to `video_audio_mix`. Not present in any worker action map; reachable only via direct `POST /v1/jobs/audio_mix`. Prefer video_audio_mix.

Functionally identical to video_audio_mix above — same AudioMix (audio_mix.go) handler, same parameters (video, out, layers[]{path,level,start_sec,label}, keep_original_audio), same output. Kept for backward compatibility of the old name only. See video_audio_mix for the full parameter table and the label:"vo"/"music" validation behaviour.

Doc note: two YAML files (audio_mix.yaml, video_audio_mix.yaml) document a single implementation. Despite the name, this operates on a video input (requires video + layers), not audio-only mixing — audio-only mixing is the separate audio_only_mix model.

Structural Export `structural_export`


Category	video_process
Mode	sync
Timeout	5m
Cost	free (`cost_per_unit: 0`)
MCP action	none — internal/pipeline only. No worker action maps here; reachable via direct `POST /v1/jobs/structural_export` or as a final encode step in the pipeline.

Final platform-specific structural encode — scale + letterbox-pad to target resolution and re-encode (libx264 -preset slow, +faststart). No creative/color filters. Apply after upscale and caption burn-in. Implemented by StructuralExport (structural_export.go), dispatched by execStructuralExport.

Parameters (from input_schema, cross-checked against structural_export.go):

Param	Type	Req	Default	Notes
`in`	string	yes	—	Input video path. Handler reads `inputs["in"]`.
`out`	string	yes	—	Output video path.
`platform`	string	yes (handler errors if empty)	YAML default `shorts`	Preset enum. tiktok/reels/shorts → 1080×1920, 30fps, CRF 19, AAC 192k. youtube-long → 1920×1080, 24fps, CRF 18, AAC 192k. ads → 1080×1920, 30fps, CRF 17, AAC 256k. Unknown value → error listing valid platforms.

Output: { ok, outputs:{video, local_path}, metrics:{platform, resolution, fps, crf, total_duration_sec} }.

Highlight Rolloff `highlight_rolloff`


Category	video_process
Mode	sync
Timeout	5m
Cost	free (`cost_per_unit: 0`)
MCP action	none — internal/QA-pipeline only. No worker action maps here; reachable via direct `POST /v1/jobs/highlight_rolloff` or the QA/fix pipeline. Intended to run only when `overexposure_check` fails.

Surgical overexposure fix: compresses highlights via a fixed curves filter (all='0/0 0.85/0.85 1/0.92' — values above 85% rolled off to max 92%), audio stream-copied. After encoding it automatically re-runs the overexposure check (3% clipped threshold, 2 fps sampling) and returns the post-fix verdict. This is the only sanctioned creative color operation in the pipeline. Implemented by HighlightRolloff (highlight_rolloff.go), dispatched by execHighlightRolloff.

Parameters (from input_schema, cross-checked against highlight_rolloff.go):

Param	Type	Req	Default	Notes
`in`	string	yes	—	Input video path. Handler reads `inputs["in"]`.
`out`	string	yes	—	Output video path.

No tunable parameters — the curve and the post-check thresholds are hardcoded.

Output: { ok, outputs:{video, local_path}, metrics:{filter, total_duration_sec, post_check, post_verdict} }. Per the YAML guidance, if the source still exceeds 3% clipping after rolloff the source clips are bad and the pipeline should block to Visual Prompting — this routing is pipeline policy, the handler itself only surfaces post_verdict.

Sync Lipsync v3 `lipsync_v3`

sync-3, Sync.so's most powerful lipsync model, syncs mouth movement to an audio track on a talking-head video using native visual intelligence.

Call it via — video tool, action: "lipsync" (MCP) · raw: POST /v1/jobs/lipsync_v3


Cost	1600 cr per minute of output
Mode / timeout	webhook / 15m (from our YAML)

Parameters — the model's input schema:

Param	Type	Required	Default	Allowed / range	Description
`video_url`	string	✓	—	—	URL of the input video (face visible)
`audio_url`	string	✓	—	—	URL of the input audio
`sync_mode`	string (enum)		`cut_off` (model); our `video(lipsync)` sends `loop` unless you pass one	`cut_off`, `loop`, `bounce`, `silence`, `remap`	How to handle audio/video duration mismatch. `cut_off` trims to the shorter input (drops the tail of longer audio); `loop`/`bounce` repeat the video (never drops speech); `silence` pads with silence; `remap` speed-adjusts
`options`	object		—	nested `Sync3GenerationOptions`	Additional Sync.so generation options (advanced). Fields: `sync_mode` (overrides top-level), `model_mode` (`lips`/`face`/`head`/`lipsync`/`emotion`/`talking_head`), `prompt` (emotion: `happy`/`sad`/`angry`/`disgusted`/`surprised`/`neutral`), `temperature` (0–1, ignored by sync-3), `active_speaker_detection` (object, for multi-person videos), `occlusion_detection_enabled` (bool, ignored by sync-3)

Our wrapper params (not part of the model schema): out (required — workdir-relative output path) and mock (optional — test placeholder). No format mapping applies (our format_field is empty; sync-3 has no size/resolution field).

Limits:

Accepted video formats: mp4, mov, webm, m4v, gif
Accepted audio formats: mp3, ogg, wav, m4a, aac
Billing is per minute of output video at 1600 cr/min (no published hard cap on duration/resolution/file size).

Video processing & assembly ​

Auto Subtitles captions_auto ​

Full Video Assembly video_assemble_full ​

Assemble Clips assemble_clips ​

Video + Audio Mix video_audio_mix ​

Audio Mix audio_mix ​

Structural Export structural_export ​

Highlight Rolloff highlight_rolloff ​

Sync Lipsync v3 lipsync_v3 ​