Video processing & assembly
Auto-captions and lipsync plus local ffmpeg pipelines (free, our implementation).
Generations are charged in credits (see Credits & plans). Every generation model also accepts
mock: truefor a free placeholder result.
Auto Subtitles captions_auto
Automatically transcribe a video's audio and burn in karaoke-style subtitles with word-level highlighting, customizable Google Fonts, colors, and animation.
Call it via — video tool, action: "captions" (MCP) · raw: POST /v1/jobs/captions_auto
| Cost | 6 cr per minute of video |
| Mode / timeout | webhook / 10m |
Parameters — the model's input schema:
| Param | Type | Required | Default | Allowed / range | Description |
|---|---|---|---|---|---|
video_url | string | ✓ | — | — | URL of the video file to add automatic subtitles to (max 100 MB). |
language | string | en | 2-letter code (en, es, fr, de, it, pt, nl, ja, zh, ko, …) or 3-letter ISO code (eng, spa, fra, …) | Language code for transcription. | |
font_name | string | Montserrat | any Google Font name (e.g. Poppins, Bebas Neue, Oswald, Inter, Roboto) | Font from fonts.google.com. | |
font_size | integer | 100 | 20–150 | Font size in pixels (TikTok style uses larger text). | |
font_weight | string | bold | normal, bold, black | Font weight. | |
font_color | string | white | white, black, red, green, blue, yellow, orange, purple, pink, brown, gray, cyan, magenta | Subtitle text color for non-active words. | |
highlight_color | string | purple | same 13 colors as font_color | Color for the currently speaking word (karaoke-style highlight). | |
stroke_width | integer | 3 | 0–10 | Text stroke/outline width in pixels (0 = no stroke). | |
stroke_color | string | black | same 13 colors as font_color | Text stroke/outline color. | |
background_color | string | none | the 13 colors above plus none, transparent | Background color behind text. | |
background_opacity | number | 0 | 0.0–1.0 | Background opacity (0 = transparent, 1 = opaque). | |
position | string | bottom | top, center, bottom | Vertical position of subtitles. | |
y_offset | integer | 75 | -200–200 | Vertical offset in pixels (positive = down, negative = up). | |
words_per_subtitle | integer | 3 | 1–12 | Max words per subtitle segment (1 = single word, 8–12 = full sentences). | |
enable_animation | boolean | true | true / false | Bounce-style entrance animation for subtitles. |
Our wrapper params (not part of the model schema): out (required — workdir-relative output path) and mock (optional — test placeholder, no real generation). This model has no format/size mapping (format_field is empty).
Limits — video_url max file size 100 MB. Accepted input formats: mp4, mov, webm, m4v, gif. Cost is metered at 6 cr per minute of video. Transcription is via ElevenLabs speech-to-text.
Full Video Assembly video_assemble_full
| Category | video_process |
| Mode | sync |
| Timeout | 10m |
| Cost | free (cost_per_unit: 0) |
| MCP action | video(assemble) (worker video.ts → kind video_assemble_full) |
One-call complete assembly: concatenates clips with visual transitions (xfade), mixes audio layers (VO / music / ambient SFX / transition SFX / intro SFX / end SFX), and applies intro fade + ending preset. Replaces assemble_clips + audio_mix in a single job. Implemented by VideoAssembleFull (video_assemble_full.go), dispatched by execVideoAssembleFull. Pre-validates that VO fits inside the assembled duration (hard error if VO is >0.5s longer). When the VO and video durations diverge by more than 3s, the job result gains a warnings array flagging the mismatch.
Parameters (from input_schema, cross-checked against executor.go/video_assemble_full.go):
| Param | Type | Req | Default | Notes |
|---|---|---|---|---|
clips | array<object> | yes | — | Ordered. Each {path, transition, transition_sfx}. |
clips[].path | string | yes | — | Clip path. |
clips[].transition | string | no | cut | Visual transition INTO this clip. YAML enum: cut, dissolve, fadeblack, fadewhite, wipeleft, wiperight, smoothleft, blur, flash, distance, circlecrop. Caveat: the underlying AssembleClips only implements cut→concat, dissolve→xfade fade, wipe→wipeleft; every other value falls through to a plain fade xfade. So fadeblack/blur/flash/etc. currently render as a crossfade, not their named effect. |
clips[].transition_sfx | string | no | — | SFX path played centered on this cut (-0.15s lead, volume 0.7). |
out | string | yes | — | Output video path. |
xfade_duration | number | no | 0.2 | Visual transition duration (s). |
intro | object | no | — | {fade_in, fade_in_duration, sfx}. |
intro.fade_in | bool | no | false | Hard start unless true. |
intro.fade_in_duration | number | no | 0.3 | |
intro.sfx | string | no | — | Intro whoosh (volume 0.7). |
vo | string | no | — | Voiceover path (0 dB by default). |
vo_level | number | no | 0 | VO volume (dB). |
vo_offset_sec | number | no | 0 (min 0) | Delay before VO starts — align speech with a later clip. Negative is rejected. |
music | string | no | — | Music bed path. |
music_level | number | no | -24 | Music volume (dB); handler defaults to −24 if 0. |
sfx_ambient | string | no | — | Ambient SFX path. |
sfx_level | number | no | -18 | Handler defaults to −18 if 0. |
ending | object | no | — | {type, end_sfx, video_fade, music_fade_start, end_sfx_start, black_tail}. |
ending.type | string | no | social | Preset enum: social / cinematic / loop. social: fade 0.3s, music fade −0.5s, end_sfx −0.3s. cinematic: fade 1.0s, music −2.0s, sfx −1.0s, 0.5s black tail. loop: no fades/tail. Per-field overrides win over the preset. |
Undocumented input: the handler also reads a top-level
ending_typestring (executor.go:358) before mergingending.type. Not declared in the YAML; nestedending.typeoverrides it. Prefer the documented nested form.
Output: { ok, outputs:{video, local_path}, metrics:{num_clips, video_duration, output_duration, ending_type, video_fade, music_fade_start, black_tail, xfade_duration, audio_layers}, warnings[] }. The warnings array is present when the VO/video durations diverge by more than 3s.
Assemble Clips assemble_clips
| Category | video_process |
| Mode | sync |
| Timeout | 5m |
| Cost | free (cost_per_unit: 0) |
| MCP action | none — internal/REST only. No MCP action maps here; video(assemble) routes to video_assemble_full. Reachable only via direct POST /v1/jobs/assemble_clips or as a building block of video_assemble_full. (proxy.ts maps it to video/assemble for error-hint purposes only.) |
Concatenate clips in array order. If all transitions are cut/hold/match-cut, uses the concat demuxer with stream copy (fast, no re-encode); if any dissolve/wipe is present, re-encodes via the xfade filter (libx264, CRF 19). Clips lacking an audio track get a silent track injected first (ensureAudioTrack). Implemented by AssembleClips (assemble_clips.go), dispatched by execAssembleClips.
Parameters (from input_schema, cross-checked against assemble_clips.go):
| Param | Type | Req | Default | Notes |
|---|---|---|---|---|
clips | array<object> | yes | — | Ordered {path, trans_in, duration}. |
clips[].path | string | yes | — | Clip path. Rejected if it contains ', newline, or CR (concat-list injection guard). |
clips[].trans_in | string | no | cut | Transition INTO this clip (first clip's is ignored). YAML enum: cut, dissolve, wipe, match-cut, j-cut, l-cut, hold. Handler: cut/hold/match-cut → stream-copy concat; dissolve → xfade fade; wipe → xfade wipeleft; any other value (incl. j-cut/l-cut) → default fade xfade (plain crossfade, no audio lead/lag). |
clips[].duration | number | no | — | Clip duration override in seconds (0 = full clip). Handler reads m["duration"]. |
out | string | yes | — | Output video path. |
xfade_duration | number | no | 0.1 | Dissolve/wipe duration (s); handler clamps ≤0 to 0.1. |
Duration caveat (documented in YAML): each dissolve/wipe shortens total output by
xfade_duration. Plan VO length against the assembled duration, not the raw clip sum.
Output: { ok, outputs:{video, local_path}, metrics:{num_clips, total_duration_sec, transitions_applied, method:"concat_demuxer"|"xfade_filter", ...} }.
Video + Audio Mix video_audio_mix
| Category | video_process |
| Mode | sync |
| Timeout | 5m |
| Cost | free (cost_per_unit: 0) |
| MCP action | video(mix_audio) (worker video.ts → kind video_audio_mix). MCP exposes only tracks: string[], which the worker expands into layers: the FIRST track becomes the VO (level: 0, label: "vo"), the rest are mixed at -24 dB (label: "track2"…), all with start_sec: 0. Custom per-layer level/start_sec/label and keep_original_audio are reachable via direct REST /v1/jobs/video_audio_mix. |
Overlay audio layers (VO, music, SFX) onto a video with per-layer dB level and start offset, then amix them. Video stream is copied (-c:v copy); audio re-encoded AAC 192k; output trimmed to the video length. Implemented by AudioMix (audio_mix.go), dispatched by execAudioMix.
Parameters (from input_schema, cross-checked against audio_mix.go):
| Param | Type | Req | Default | Notes |
|---|---|---|---|---|
video | string | yes | — | Input video. (MCP mix_audio maps video_url → video.) |
out | string | yes | — | Output video path. |
layers | array<object> | yes | — | Each {path, level, start_sec, label}. |
layers[].path | string | yes | — | Audio path. |
layers[].level | number | no | 0 | dB (0 = original, −24 = background). Converted to linear via exact 10^(dB/20). |
layers[].start_sec | number | no | 0 | Offset from video start; >0 adds adelay. |
layers[].label | string | yes | — | Reporting label. Semantically special: label:"vo" triggers a hard error if VO is longer than video (+0.5s) and a tight-timing warning within 0.5s; label:"music" only warns when it exceeds video. |
keep_original_audio | bool | no | false | If true, mixes the video's existing [0:a] in too. |
Output: { ok, outputs:{video, local_path}, metrics:{video_duration_sec, output_duration_sec, layers[], keep_original_audio, warnings[]} }.
Audio Mix audio_mix
| Category | video_process |
| Mode | sync |
| Timeout | 5m |
| Cost | free (cost_per_unit: 0) |
| MCP action | none — deprecated alias. Registered in executor.go as "audio_mix": e.execAudioMix with the comment "deprecated name, alias for video_audio_mix". Identical YAML and identical handler to video_audio_mix. Not present in any worker action map; reachable only via direct POST /v1/jobs/audio_mix. Prefer video_audio_mix. |
Functionally identical to video_audio_mix above — same AudioMix (audio_mix.go) handler, same parameters (video, out, layers[]{path,level,start_sec,label}, keep_original_audio), same output. Kept for backward compatibility of the old name only. See video_audio_mix for the full parameter table and the label:"vo"/"music" validation behaviour.
Doc note: two YAML files (
audio_mix.yaml,video_audio_mix.yaml) document a single implementation. Despite the name, this operates on a video input (requiresvideo+layers), not audio-only mixing — audio-only mixing is the separateaudio_only_mixmodel.
Structural Export structural_export
| Category | video_process |
| Mode | sync |
| Timeout | 5m |
| Cost | free (cost_per_unit: 0) |
| MCP action | none — internal/pipeline only. No worker action maps here; reachable via direct POST /v1/jobs/structural_export or as a final encode step in the pipeline. |
Final platform-specific structural encode — scale + letterbox-pad to target resolution and re-encode (libx264 -preset slow, +faststart). No creative/color filters. Apply after upscale and caption burn-in. Implemented by StructuralExport (structural_export.go), dispatched by execStructuralExport.
Parameters (from input_schema, cross-checked against structural_export.go):
| Param | Type | Req | Default | Notes |
|---|---|---|---|---|
in | string | yes | — | Input video path. Handler reads inputs["in"]. |
out | string | yes | — | Output video path. |
platform | string | yes (handler errors if empty) | YAML default shorts | Preset enum. tiktok/reels/shorts → 1080×1920, 30fps, CRF 19, AAC 192k. youtube-long → 1920×1080, 24fps, CRF 18, AAC 192k. ads → 1080×1920, 30fps, CRF 17, AAC 256k. Unknown value → error listing valid platforms. |
Output: { ok, outputs:{video, local_path}, metrics:{platform, resolution, fps, crf, total_duration_sec} }.
Highlight Rolloff highlight_rolloff
| Category | video_process |
| Mode | sync |
| Timeout | 5m |
| Cost | free (cost_per_unit: 0) |
| MCP action | none — internal/QA-pipeline only. No worker action maps here; reachable via direct POST /v1/jobs/highlight_rolloff or the QA/fix pipeline. Intended to run only when overexposure_check fails. |
Surgical overexposure fix: compresses highlights via a fixed curves filter (all='0/0 0.85/0.85 1/0.92' — values above 85% rolled off to max 92%), audio stream-copied. After encoding it automatically re-runs the overexposure check (3% clipped threshold, 2 fps sampling) and returns the post-fix verdict. This is the only sanctioned creative color operation in the pipeline. Implemented by HighlightRolloff (highlight_rolloff.go), dispatched by execHighlightRolloff.
Parameters (from input_schema, cross-checked against highlight_rolloff.go):
| Param | Type | Req | Default | Notes |
|---|---|---|---|---|
in | string | yes | — | Input video path. Handler reads inputs["in"]. |
out | string | yes | — | Output video path. |
No tunable parameters — the curve and the post-check thresholds are hardcoded.
Output: { ok, outputs:{video, local_path}, metrics:{filter, total_duration_sec, post_check, post_verdict} }. Per the YAML guidance, if the source still exceeds 3% clipping after rolloff the source clips are bad and the pipeline should block to Visual Prompting — this routing is pipeline policy, the handler itself only surfaces post_verdict.
Sync Lipsync v3 lipsync_v3
sync-3, Sync.so's most powerful lipsync model, syncs mouth movement to an audio track on a talking-head video using native visual intelligence.
Call it via — video tool, action: "lipsync" (MCP) · raw: POST /v1/jobs/lipsync_v3
| Cost | 1600 cr per minute of output |
| Mode / timeout | webhook / 15m (from our YAML) |
Parameters — the model's input schema:
| Param | Type | Required | Default | Allowed / range | Description |
|---|---|---|---|---|---|
video_url | string | ✓ | — | — | URL of the input video (face visible) |
audio_url | string | ✓ | — | — | URL of the input audio |
sync_mode | string (enum) | cut_off (model); our video(lipsync) sends loop unless you pass one | cut_off, loop, bounce, silence, remap | How to handle audio/video duration mismatch. cut_off trims to the shorter input (drops the tail of longer audio); loop/bounce repeat the video (never drops speech); silence pads with silence; remap speed-adjusts | |
options | object | — | nested Sync3GenerationOptions | Additional Sync.so generation options (advanced). Fields: sync_mode (overrides top-level), model_mode (lips/face/head/lipsync/emotion/talking_head), prompt (emotion: happy/sad/angry/disgusted/surprised/neutral), temperature (0–1, ignored by sync-3), active_speaker_detection (object, for multi-person videos), occlusion_detection_enabled (bool, ignored by sync-3) |
Our wrapper params (not part of the model schema): out (required — workdir-relative output path) and mock (optional — test placeholder). No format mapping applies (our format_field is empty; sync-3 has no size/resolution field).
Limits:
- Accepted video formats:
mp4,mov,webm,m4v,gif - Accepted audio formats:
mp3,ogg,wav,m4a,aac - Billing is per minute of output video at 1600 cr/min (no published hard cap on duration/resolution/file size).