# Video processing & assembly

Auto-captions and lipsync plus local ffmpeg pipelines (free, our implementation).

> Generations are charged in credits (see [Credits & plans](/guide/billing)). Every generation model also accepts `mock: true` for a free placeholder result.

### Auto Subtitles `captions_auto`

Automatically transcribe a video's audio and burn in karaoke-style subtitles with word-level highlighting, customizable Google Fonts, colors, and animation.

**Call it via** — `video` tool, `action: "captions"` (MCP) · raw: `POST /v1/jobs/captions_auto`

| | |
|---|---|
| **Cost** | 6 cr per minute of video |
| **Mode / timeout** | webhook / 10m |

**Parameters** — the model's input schema:

| Param | Type | Required | Default | Allowed / range | Description |
|---|---|---|---|---|---|
| `video_url` | string | ✓ | — | — | URL of the video file to add automatic subtitles to (max 100 MB). |
| `language` | string | | `en` | 2-letter code (`en`, `es`, `fr`, `de`, `it`, `pt`, `nl`, `ja`, `zh`, `ko`, …) or 3-letter ISO code (`eng`, `spa`, `fra`, …) | Language code for transcription. |
| `font_name` | string | | `Montserrat` | any Google Font name (e.g. `Poppins`, `Bebas Neue`, `Oswald`, `Inter`, `Roboto`) | Font from fonts.google.com. |
| `font_size` | integer | | `100` | 20–150 | Font size in pixels (TikTok style uses larger text). |
| `font_weight` | string | | `bold` | `normal`, `bold`, `black` | Font weight. |
| `font_color` | string | | `white` | `white`, `black`, `red`, `green`, `blue`, `yellow`, `orange`, `purple`, `pink`, `brown`, `gray`, `cyan`, `magenta` | Subtitle text color for non-active words. |
| `highlight_color` | string | | `purple` | same 13 colors as `font_color` | Color for the currently speaking word (karaoke-style highlight). |
| `stroke_width` | integer | | `3` | 0–10 | Text stroke/outline width in pixels (0 = no stroke). |
| `stroke_color` | string | | `black` | same 13 colors as `font_color` | Text stroke/outline color. |
| `background_color` | string | | `none` | the 13 colors above plus `none`, `transparent` | Background color behind text. |
| `background_opacity` | number | | `0` | 0.0–1.0 | Background opacity (0 = transparent, 1 = opaque). |
| `position` | string | | `bottom` | `top`, `center`, `bottom` | Vertical position of subtitles. |
| `y_offset` | integer | | `75` | -200–200 | Vertical offset in pixels (positive = down, negative = up). |
| `words_per_subtitle` | integer | | `3` | 1–12 | Max words per subtitle segment (1 = single word, 8–12 = full sentences). |
| `enable_animation` | boolean | | `true` | true / false | Bounce-style entrance animation for subtitles. |

Our wrapper params (not part of the model schema): `out` (required — workdir-relative output path) and `mock` (optional — test placeholder, no real generation). This model has no `format`/size mapping (`format_field` is empty).

**Limits** — `video_url` max file size 100 MB. Accepted input formats: mp4, mov, webm, m4v, gif. Cost is metered at 6 cr per minute of video. Transcription is via ElevenLabs speech-to-text.

### Full Video Assembly `video_assemble_full`

| | |
|---|---|
| **Category** | video_process |
| **Mode** | sync |
| **Timeout** | 10m |
| **Cost** | free (`cost_per_unit: 0`) |
| **MCP action** | `video(assemble)` (worker `video.ts` → kind `video_assemble_full`) |

One-call complete assembly: concatenates clips with visual transitions (xfade), mixes audio layers (VO / music / ambient SFX / transition SFX / intro SFX / end SFX), and applies intro fade + ending preset. Replaces `assemble_clips` + `audio_mix` in a single job. Implemented by `VideoAssembleFull` (`video_assemble_full.go`), dispatched by `execVideoAssembleFull`. Pre-validates that VO fits inside the assembled duration (hard error if VO is >0.5s longer). When the VO and video durations diverge by more than 3s, the job result gains a `warnings` array flagging the mismatch.

**Parameters** (from `input_schema`, cross-checked against `executor.go`/`video_assemble_full.go`):

| Param | Type | Req | Default | Notes |
|---|---|---|---|---|
| `clips` | array&lt;object> | yes | — | Ordered. Each `{path, transition, transition_sfx}`. |
| `clips[].path` | string | yes | — | Clip path. |
| `clips[].transition` | string | no | `cut` | Visual transition INTO this clip. YAML enum: cut, dissolve, fadeblack, fadewhite, wipeleft, wiperight, smoothleft, blur, flash, distance, circlecrop. **Caveat:** the underlying `AssembleClips` only implements `cut`→concat, `dissolve`→xfade fade, `wipe`→wipeleft; every other value falls through to a plain `fade` xfade. So fadeblack/blur/flash/etc. currently render as a crossfade, not their named effect. |
| `clips[].transition_sfx` | string | no | — | SFX path played centered on this cut (`-0.15s` lead, volume 0.7). |
| `out` | string | yes | — | Output video path. |
| `xfade_duration` | number | no | `0.2` | Visual transition duration (s). |
| `intro` | object | no | — | `{fade_in, fade_in_duration, sfx}`. |
| `intro.fade_in` | bool | no | `false` | Hard start unless true. |
| `intro.fade_in_duration` | number | no | `0.3` | |
| `intro.sfx` | string | no | — | Intro whoosh (volume 0.7). |
| `vo` | string | no | — | Voiceover path (0 dB by default). |
| `vo_level` | number | no | `0` | VO volume (dB). |
| `vo_offset_sec` | number | no | `0` (min 0) | Delay before VO starts — align speech with a later clip. Negative is rejected. |
| `music` | string | no | — | Music bed path. |
| `music_level` | number | no | `-24` | Music volume (dB); handler defaults to −24 if 0. |
| `sfx_ambient` | string | no | — | Ambient SFX path. |
| `sfx_level` | number | no | `-18` | Handler defaults to −18 if 0. |
| `ending` | object | no | — | `{type, end_sfx, video_fade, music_fade_start, end_sfx_start, black_tail}`. |
| `ending.type` | string | no | `social` | Preset enum: social / cinematic / loop. social: fade 0.3s, music fade −0.5s, end_sfx −0.3s. cinematic: fade 1.0s, music −2.0s, sfx −1.0s, 0.5s black tail. loop: no fades/tail. Per-field overrides win over the preset. |

> **Undocumented input:** the handler also reads a **top-level `ending_type` string** (`executor.go:358`) before merging `ending.type`. Not declared in the YAML; nested `ending.type` overrides it. Prefer the documented nested form.

**Output:** `{ ok, outputs:{video, local_path}, metrics:{num_clips, video_duration, output_duration, ending_type, video_fade, music_fade_start, black_tail, xfade_duration, audio_layers}, warnings[] }`. The `warnings` array is present when the VO/video durations diverge by more than 3s.

---

### Assemble Clips `assemble_clips`

| | |
|---|---|
| **Category** | video_process |
| **Mode** | sync |
| **Timeout** | 5m |
| **Cost** | free (`cost_per_unit: 0`) |
| **MCP action** | **none — internal/REST only.** No MCP action maps here; `video(assemble)` routes to `video_assemble_full`. Reachable only via direct `POST /v1/jobs/assemble_clips` or as a building block of `video_assemble_full`. (proxy.ts maps it to `video/assemble` for error-hint purposes only.) |

Concatenate clips in array order. If all transitions are cut/hold/match-cut, uses the concat demuxer with **stream copy** (fast, no re-encode); if any dissolve/wipe is present, re-encodes via the `xfade` filter (libx264, CRF 19). Clips lacking an audio track get a silent track injected first (`ensureAudioTrack`). Implemented by `AssembleClips` (`assemble_clips.go`), dispatched by `execAssembleClips`.

**Parameters** (from `input_schema`, cross-checked against `assemble_clips.go`):

| Param | Type | Req | Default | Notes |
|---|---|---|---|---|
| `clips` | array&lt;object> | yes | — | Ordered `{path, trans_in, duration}`. |
| `clips[].path` | string | yes | — | Clip path. Rejected if it contains `'`, newline, or CR (concat-list injection guard). |
| `clips[].trans_in` | string | no | `cut` | Transition INTO this clip (first clip's is ignored). YAML enum: cut, dissolve, wipe, match-cut, j-cut, l-cut, hold. Handler: cut/hold/match-cut → stream-copy concat; dissolve → xfade fade; wipe → xfade wipeleft; **any other value (incl. j-cut/l-cut) → default `fade` xfade** (plain crossfade, no audio lead/lag). |
| `clips[].duration` | number | no | — | Clip duration override in seconds (0 = full clip). Handler reads `m["duration"]`. |
| `out` | string | yes | — | Output video path. |
| `xfade_duration` | number | no | `0.1` | Dissolve/wipe duration (s); handler clamps ≤0 to 0.1. |

> **Duration caveat (documented in YAML):** each dissolve/wipe shortens total output by `xfade_duration`. Plan VO length against the *assembled* duration, not the raw clip sum.

**Output:** `{ ok, outputs:{video, local_path}, metrics:{num_clips, total_duration_sec, transitions_applied, method:"concat_demuxer"|"xfade_filter", ...} }`.

---

### Video + Audio Mix `video_audio_mix`

| | |
|---|---|
| **Category** | video_process |
| **Mode** | sync |
| **Timeout** | 5m |
| **Cost** | free (`cost_per_unit: 0`) |
| **MCP action** | `video(mix_audio)` (worker `video.ts` → kind `video_audio_mix`). **MCP exposes only `tracks: string[]`**, which the worker expands into `layers`: the FIRST track becomes the VO (`level: 0`, `label: "vo"`), the rest are mixed at `-24 dB` (`label: "track2"…`), all with `start_sec: 0`. Custom per-layer `level`/`start_sec`/`label` and `keep_original_audio` are reachable via direct REST `/v1/jobs/video_audio_mix`. |

Overlay audio layers (VO, music, SFX) onto a video with per-layer dB level and start offset, then `amix` them. Video stream is copied (`-c:v copy`); audio re-encoded AAC 192k; output trimmed to the video length. Implemented by `AudioMix` (`audio_mix.go`), dispatched by `execAudioMix`.

**Parameters** (from `input_schema`, cross-checked against `audio_mix.go`):

| Param | Type | Req | Default | Notes |
|---|---|---|---|---|
| `video` | string | yes | — | Input video. (MCP `mix_audio` maps `video_url` → `video`.) |
| `out` | string | yes | — | Output video path. |
| `layers` | array&lt;object> | yes | — | Each `{path, level, start_sec, label}`. |
| `layers[].path` | string | yes | — | Audio path. |
| `layers[].level` | number | no | `0` | dB (0 = original, −24 = background). Converted to linear via exact `10^(dB/20)`. |
| `layers[].start_sec` | number | no | `0` | Offset from video start; >0 adds `adelay`. |
| `layers[].label` | string | yes | — | Reporting label. **Semantically special:** `label:"vo"` triggers a hard error if VO is longer than video (+0.5s) and a tight-timing warning within 0.5s; `label:"music"` only warns when it exceeds video. |
| `keep_original_audio` | bool | no | `false` | If true, mixes the video's existing `[0:a]` in too. |

**Output:** `{ ok, outputs:{video, local_path}, metrics:{video_duration_sec, output_duration_sec, layers[], keep_original_audio, warnings[]} }`.

---

### Audio Mix `audio_mix`

| | |
|---|---|
| **Category** | video_process |
| **Mode** | sync |
| **Timeout** | 5m |
| **Cost** | free (`cost_per_unit: 0`) |
| **MCP action** | **none — deprecated alias.** Registered in `executor.go` as `"audio_mix": e.execAudioMix` with the comment *"deprecated name, alias for video_audio_mix"*. Identical YAML and identical handler to `video_audio_mix`. Not present in any worker action map; reachable only via direct `POST /v1/jobs/audio_mix`. Prefer **video_audio_mix**. |

Functionally identical to **video_audio_mix** above — same `AudioMix` (`audio_mix.go`) handler, same parameters (`video`, `out`, `layers[]{path,level,start_sec,label}`, `keep_original_audio`), same output. Kept for backward compatibility of the old name only. See video_audio_mix for the full parameter table and the `label:"vo"`/`"music"` validation behaviour.

> **Doc note:** two YAML files (`audio_mix.yaml`, `video_audio_mix.yaml`) document a single implementation. Despite the name, this operates on a **video** input (requires `video` + `layers`), not audio-only mixing — audio-only mixing is the separate `audio_only_mix` model.

---

### Structural Export `structural_export`

| | |
|---|---|
| **Category** | video_process |
| **Mode** | sync |
| **Timeout** | 5m |
| **Cost** | free (`cost_per_unit: 0`) |
| **MCP action** | **none — internal/pipeline only.** No worker action maps here; reachable via direct `POST /v1/jobs/structural_export` or as a final encode step in the pipeline. |

Final platform-specific structural encode — scale + letterbox-pad to target resolution and re-encode (libx264 `-preset slow`, `+faststart`). **No creative/color filters.** Apply after upscale and caption burn-in. Implemented by `StructuralExport` (`structural_export.go`), dispatched by `execStructuralExport`.

**Parameters** (from `input_schema`, cross-checked against `structural_export.go`):

| Param | Type | Req | Default | Notes |
|---|---|---|---|---|
| `in` | string | yes | — | Input video path. Handler reads `inputs["in"]`. |
| `out` | string | yes | — | Output video path. |
| `platform` | string | yes (handler errors if empty) | YAML default `shorts` | Preset enum. tiktok/reels/shorts → 1080×1920, 30fps, CRF 19, AAC 192k. youtube-long → 1920×1080, 24fps, CRF 18, AAC 192k. ads → 1080×1920, 30fps, CRF 17, AAC 256k. Unknown value → error listing valid platforms. |

**Output:** `{ ok, outputs:{video, local_path}, metrics:{platform, resolution, fps, crf, total_duration_sec} }`.

---

### Highlight Rolloff `highlight_rolloff`

| | |
|---|---|
| **Category** | video_process |
| **Mode** | sync |
| **Timeout** | 5m |
| **Cost** | free (`cost_per_unit: 0`) |
| **MCP action** | **none — internal/QA-pipeline only.** No worker action maps here; reachable via direct `POST /v1/jobs/highlight_rolloff` or the QA/fix pipeline. Intended to run only when `overexposure_check` fails. |

Surgical overexposure fix: compresses highlights via a fixed `curves` filter (`all='0/0 0.85/0.85 1/0.92'` — values above 85% rolled off to max 92%), audio stream-copied. After encoding it **automatically re-runs the overexposure check** (3% clipped threshold, 2 fps sampling) and returns the post-fix verdict. This is the only sanctioned creative color operation in the pipeline. Implemented by `HighlightRolloff` (`highlight_rolloff.go`), dispatched by `execHighlightRolloff`.

**Parameters** (from `input_schema`, cross-checked against `highlight_rolloff.go`):

| Param | Type | Req | Default | Notes |
|---|---|---|---|---|
| `in` | string | yes | — | Input video path. Handler reads `inputs["in"]`. |
| `out` | string | yes | — | Output video path. |

No tunable parameters — the curve and the post-check thresholds are hardcoded.

**Output:** `{ ok, outputs:{video, local_path}, metrics:{filter, total_duration_sec, post_check, post_verdict} }`. Per the YAML guidance, if the source still exceeds 3% clipping after rolloff the source clips are bad and the pipeline should block to Visual Prompting — this routing is pipeline policy, the handler itself only surfaces `post_verdict`.

### Sync Lipsync v3 `lipsync_v3`

sync-3, Sync.so's most powerful lipsync model, syncs mouth movement to an audio track on a talking-head video using native visual intelligence.

**Call it via** — `video` tool, `action: "lipsync"` (MCP) · raw: `POST /v1/jobs/lipsync_v3`

| | |
|---|---|
| **Cost** | 1600 cr per minute of output |
| **Mode / timeout** | webhook / 15m (from our YAML) |

**Parameters** — the model's input schema:

| Param | Type | Required | Default | Allowed / range | Description |
|---|---|---|---|---|---|
| `video_url` | string | ✓ | — | — | URL of the input video (face visible) |
| `audio_url` | string | ✓ | — | — | URL of the input audio |
| `sync_mode` | string (enum) | | `cut_off` (model); our `video(lipsync)` sends `loop` unless you pass one | `cut_off`, `loop`, `bounce`, `silence`, `remap` | How to handle audio/video duration mismatch. `cut_off` trims to the shorter input (drops the tail of longer audio); `loop`/`bounce` repeat the video (never drops speech); `silence` pads with silence; `remap` speed-adjusts |
| `options` | object | | — | nested `Sync3GenerationOptions` | Additional Sync.so generation options (advanced). Fields: `sync_mode` (overrides top-level), `model_mode` (`lips`/`face`/`head`/`lipsync`/`emotion`/`talking_head`), `prompt` (emotion: `happy`/`sad`/`angry`/`disgusted`/`surprised`/`neutral`), `temperature` (0–1, ignored by sync-3), `active_speaker_detection` (object, for multi-person videos), `occlusion_detection_enabled` (bool, ignored by sync-3) |

Our wrapper params (not part of the model schema): `out` (required — workdir-relative output path) and `mock` (optional — test placeholder). No `format` mapping applies (our `format_field` is empty; sync-3 has no size/resolution field).

**Limits**:
- Accepted video formats: `mp4`, `mov`, `webm`, `m4v`, `gif`
- Accepted audio formats: `mp3`, `ogg`, `wav`, `m4a`, `aac`
- Billing is per minute of output video at 1600 cr/min (no published hard cap on duration/resolution/file size).