# Audio models

Speech, sound effects, and music (model input schemas), plus local audio processing (our ffmpeg implementation, free).

> Generations are charged in credits (see [Credits & plans](/guide/billing)). Every generation model also accepts `mock: true` for a free placeholder result.

### ElevenLabs TTS v3 `elevenlabs_tts_v3`

Expressive text-to-speech with inline audio-tag emotional control and 70+ language support, powered by ElevenLabs' Eleven v3 model.

**Call it via** — `audio(action: "speak")` (MCP `audio` tool) · raw: `POST /v1/jobs/elevenlabs_tts_v3`

| | |
|---|---|
| **Cost** | 20 cr per 1,000 characters |
| **Mode / timeout** | sync / 60s |

**Parameters** — the model's input schema:

| Param | Type | Required | Default | Allowed / range | Description |
|---|---|---|---|---|---|
| `text` | string | ✓ | — | — | Text to convert to speech. Supports inline audio tags like `[laughs]`, `[whispers]`, `[excited]`. |
| `voice` | string |  | `Rachel` | e.g. Aria, Roger, Sarah, Laura, Charlie, George, Callum, River, Liam, Charlotte, Alice, Matilda, Will, Jessica, Eric, Chris, Brian, Daniel, Lily, Bill (or a voice ID) | Voice name or ID. |
| `stability` | float |  | `0.5` | 0–1 | Voice stability. Lower = more expressive variation; higher = more consistent delivery. |
| `similarity_boost` | float |  | `0.75` | 0–1 | How closely the output matches the reference voice. |
| `speed` | float |  | `1` | — | Playback speed multiplier. |
| `language_code` | string |  | — | ISO 639-1 (e.g. en, ru, es, fr, de, ja, ko, zh) | Forces a specific output language. |
| `apply_text_normalization` | enum |  | `auto` | `auto`, `on`, `off` | Controls spelling-out of numbers, abbreviations, etc. |
| `seed` | int |  | — | — | Random seed for reproducibility. |
| `timestamps` | bool |  | `false` | — | When true, returns per-word timestamps in the response. |
| `output_format` | enum |  | `mp3_44100_128` | mp3_22050_32, mp3_44100_32, mp3_44100_64, mp3_44100_96, mp3_44100_128, mp3_44100_192, pcm_8000, pcm_16000, pcm_22050, pcm_24000, pcm_44100, pcm_48000, ulaw_8000, alaw_8000, opus_48000_32, opus_48000_64, opus_48000_96, opus_48000_128, opus_48000_192 | Output codec, sample rate, and bitrate. |

Our wrapper params (not part of the model schema): `out` (required — workdir-relative output path, `.mp3`) and `mock` (optional — test placeholder, no real generation). This model does not use the `format`→size mapping (`format_field` is empty).

**Limits** — Pricing is 20 cr per 1,000 characters (a 500-char paragraph = 10 cr; a 10,000-char story = 200 cr). Supported output formats: MP3 (22.05/44.1 kHz, 32–192 kbps), PCM (8–48 kHz), µ-law/A-law 8 kHz, Opus 48 kHz (32–192 kbps). 70+ languages supported. No hard maximum character count is published.

### ElevenLabs TTS (direct) `elevenlabs_tts_direct`

Converts text into speech using a chosen ElevenLabs `voice_id` (cloned, linked, or library voice) and returns an audio file.

**Call it via** — `audio(speak, actor_id=…)` (routes a configured actor's voice through this model; plain `audio(speak)` without `actor_id` uses `elevenlabs_tts_v3` instead). Also used internally by `video(scene)` for per-line narration. · raw: `POST /v1/jobs/elevenlabs_tts_direct`

| | |
|---|---|
| **Cost** | 20 cr per call |
| **Mode / timeout** | sync / 60s |

**Parameters** — the model's input schema (`voice_id` is a path parameter; the rest are request-body fields):

| Param | Type | Required | Default | Allowed / range | Description |
|---|---|---|---|---|---|
| `voice_id` | string | ✓ | — | — | Path param. ID of the voice to use (from Get Voices). |
| `text` | string | ✓ | — | — | The text that will be converted into speech. |
| `model_id` | string |  | `eleven_multilingual_v2` | any TTS-capable model id | Model identifier; must support text-to-speech. |
| `language_code` | string \| null |  | null | ISO 639-1 | Enforces a language for the model and text normalization. |
| `voice_settings` | object \| null |  | null | see sub-properties | Per-request overrides of the voice's stored settings. |
| `voice_settings.stability` | number |  | 0.5 | 0.0–1.0 | How stable the voice is / randomness between generations. |
| `voice_settings.similarity_boost` | number |  | 0.75 | 0.0–1.0 | How closely the AI adheres to the original voice. |
| `voice_settings.style` | number |  | 0 | 0.0–1.0 | Style exaggeration of the voice. |
| `voice_settings.use_speaker_boost` | boolean |  | true | true/false | Boosts similarity to the original speaker. |
| `voice_settings.speed` | number |  | 1.0 | ~0.7–1.2 | Playback speed; &lt;1 slows, >1 speeds up. |
| `seed` | integer \| null |  | null | 0–4294967295 | Best-effort deterministic sampling. |
| `previous_text` | string \| null |  | null | — | Text preceding this request, for continuity. |
| `next_text` | string \| null |  | null | — | Text following this request, for continuity. |
| `previous_request_ids` | string[] \| null |  | null | max 3 | Request ids of prior samples, for continuity. |
| `next_request_ids` | string[] \| null |  | null | max 3 | Request ids of later samples, for continuity. |
| `pronunciation_dictionary_locators` | object[] \| null |  | null | max 3 | Pronunciation dictionary locators (id, version_id). |
| `apply_text_normalization` | enum |  | `auto` | `auto`, `on`, `off` | Controls number/date spell-out normalization. |
| `apply_language_text_normalization` | boolean |  | false | true/false | Language-specific normalization (Japanese only; raises latency). |
| `output_format` | enum (query) |  | `mp3_44100_128` | `mp3_22050_32`, `mp3_44100_32/64/96/128/192`, `pcm_8000/16000/22050/24000/44100`, `ulaw_8000`, `alaw_8000`, `opus_48000_*`, etc. (28 values) | Query param. `codec_samplerate_bitrate`; mp3_192 needs Creator+, pcm/wav 44.1kHz needs Pro+. |
| `enable_logging` | boolean (query) |  | true | true/false | Query param. false = zero-retention mode (enterprise only). |

Our wrapper params (not part of the model schema): `out` (required — output audio filename, mp3) and `mock` (optional — test placeholder). This model has no `format`→size mapping (`format_field` is empty in our YAML).

**Limits** — model limits: `seed` 0–4294967295; up to 3 `pronunciation_dictionary_locators`, 3 `previous_request_ids`, 3 `next_request_ids` per request; output formats limited to the 28 `output_format` enum values (mp3 192kbps requires Creator tier or above; PCM/WAV at 44.1kHz requires Pro tier or above). No hard maximum text length is published for this endpoint, so no character cap is asserted here (our YAML's "keep under 5000 characters" is guidance, not a confirmed limit).

### ElevenLabs Sound Effects `elevenlabs_sfx`

Generate sound effects (foley, ambience, UI, impacts) from a text description using ElevenLabs' Sound Effects V2 model.

**Call it via** — `audio(sfx)` (the `audio` MCP tool with `action: "sfx"`; pass your description in `prompt`, which the worker maps to the model's `text` field) · raw: `POST /v1/jobs/elevenlabs_sfx`

| | |
|---|---|
| **Cost** | Billed per second of audio |
| **Mode / timeout** | sync / 60s |

**Parameters** — the model's input schema:

| Param | Type | Required | Default | Allowed / range | Description |
|---|---|---|---|---|---|
| `text` | string | ✓ | — | max 450 characters | The text describing the sound effect to generate. |
| `duration_seconds` | number | | none (model decides) | `0.5`–`22` (nullable) | Duration in seconds. If omitted/null, optimal duration is determined from the prompt. |
| `prompt_influence` | number | | `0.3` | `0`–`1` | How closely to follow the prompt. Higher values mean less variation. |
| `output_format` | string (enum) | | `mp3_44100_128` | `mp3_22050_32`, `mp3_44100_32`, `mp3_44100_64`, `mp3_44100_96`, `mp3_44100_128`, `mp3_44100_192`, `pcm_8000`, `pcm_16000`, `pcm_22050`, `pcm_24000`, `pcm_44100`, `pcm_48000`, `ulaw_8000`, `alaw_8000`, `opus_48000_32`, `opus_48000_64`, `opus_48000_96`, `opus_48000_128`, `opus_48000_192` | Output audio format, as `codec_sampleRate_bitrate`. |
| `loop` | boolean | | `false` | `true` / `false` | Whether to create a sound effect that loops smoothly. |

Our wrapper params (not part of the model schema): `out` (required — workdir-relative output path, e.g. `.mp3`) and `mock` (optional — test placeholder). No `format` mapping applies to this model (`format_field` is empty).

**Limits** — model limits:
- `text`: max 450 characters.
- `duration_seconds`: 0.5–22 seconds.
- `prompt_influence`: 0–1.
- Output codecs: MP3 (22.05/44.1 kHz, 32–192 kbps), PCM (8–48 kHz), μ-law/A-law 8 kHz, Opus 48 kHz (32–192 kbps).

### Minimax Music v2.6 `minimax_music`

MiniMax Music 2.6 creates complete tracks with singing, backing music, and detailed arrangements from a style description and optional lyrics.

**Call it via** — `audio(music)` MCP tool · raw: `POST /v1/jobs/minimax_music`

| | |
|---|---|
| **Cost** | 30 cr per call |
| **Mode / timeout** | webhook / 8m (from our YAML) |

**Parameters** — the model's input schema:

| Param | Type | Required | Default | Allowed / range | Description |
|---|---|---|---|---|---|
| `prompt` | string | ✓ | — | 10–2000 chars | Description of the music style, mood, genre, and scenario. |
| `lyrics` | string | | `""` | max 3500 chars | Song lyrics. Use `\n` to separate lines. Supports structure tags: `[Intro]`, `[Verse]`, `[Pre Chorus]`, `[Chorus]`, `[Post Chorus]`, `[Hook]`, `[Bridge]`, `[Interlude]`, `[Transition]`, `[Build Up]`, `[Break]`, `[Inst]`, `[Solo]`, `[Outro]`. Required when `is_instrumental` is false. |
| `lyrics_optimizer` | boolean | | `false` | true / false | When true and `lyrics` is empty, auto-generates lyrics from the prompt. |
| `is_instrumental` | boolean | | `false` | true / false | When true, generates vocal-free instrumental music. |
| `audio_setting` | object | | — | see below | Audio configuration settings (object). |
| `audio_setting.sample_rate` | integer | | `44100` | 16000, 24000, 32000, 44100 | Sample rate of generated audio (Hz). |
| `audio_setting.bitrate` | integer | | `256000` | 32000, 64000, 128000, 256000 | Bitrate of generated audio (bps). |
| `audio_setting.format` | string | | `mp3` | mp3, wav, pcm | Output audio format. |

Our wrapper params (not part of the model schema): `out` (required — workdir-relative output path, e.g. `.mp3`), `mock` (optional — test placeholder). This model has no `format_field`, so our `format` wrapper is not used here.

**Limits** — model limits: `prompt` 10–2000 characters; `lyrics` max 3500 characters; output formats mp3 / wav / pcm; sample rate up to 44100 Hz; bitrate up to 256000 bps. Lyrics are required when `is_instrumental` is false.

### Audio Concat `audio_concat`

| Field | Value |
|---|---|
| Category | audio_process |
| Mode | sync |
| Timeout | 30s |
| Cost | Free (cost_per_unit: 0) |
| Handler | `execAudioConcat` → `AudioConcat` (`internal/ffmpeg/audio_concat.go`) |
| MCP route | `audio(action: "concat")` — maps the tool's `tracks[]` arg to the model's `files` field |

**Description:** Concatenate multiple audio files in order. Accepts a mix of input formats — every input is decoded and re-encoded to the target output format, then joined with ffmpeg's concat demuxer (`-c copy`, no second re-encode).

**Parameters** (from YAML `input_schema`, cross-checked against handler):

| Param | Type | Required | Default | Notes |
|---|---|---|---|---|
| `files` | array of string | yes | — | Ordered list of audio paths (any mix of mp3/wav/aac/flac/ogg). Handler errors if empty; non-string entries rejected. |
| `out` | string | yes | — | Output audio path. |
| `silence_between` | number | no | 0 | Seconds of silence inserted between files (not after the last). Implemented via generated `anullsrc` mono 44.1 kHz segments. |
| `output_format` | string | no | inferred from `out` ext, else mp3 | enum: mp3, aac, wav, flac, ogg. Read by handler ✓. |
| `sample_rate` | integer | no | source rate | Target Hz; applied via `-ar`. Read by handler ✓. |

**Behaviour notes:**
- **Single-file fast path:** with one file and `silence_between <= 0`, if input/output extensions match and no `sample_rate` is given, it byte-copies the file (acts as a pass-through). Otherwise it delegates to `AudioConvert` — i.e. a single file makes this a format converter.
- Codec mapping (via `outputCodecArgs`): wav→pcm_s16le, flac→flac, ogg→libvorbis 192k, aac→aac 192k, default→libmp3lame 192k.
- Concat-list injection is guarded: a file path containing a quote or newline is rejected.
- Returns `outputs.audio` / `outputs.local_path` plus metrics (`num_files`, `total_duration_sec`, `silence_between`).

---

### Audio-Only Mix `audio_only_mix`

| Field | Value |
|---|---|
| Category | audio_process |
| Mode | sync |
| Timeout | 2m |
| Cost | Free (cost_per_unit: 0) |
| Handler | `execAudioOnlyMix` → `AudioOnlyMix` (`internal/ffmpeg/audio_only_mix.go`) |
| MCP route | `audio(action: "mix")` — passes `tracks[]` (and the optional `music` / `music_level`) through |

**Description:** Mix audio files into a single audio file. Two modes: a **flat mix** of 2+ tracks with ffmpeg's `amix` filter, or — when the optional `music` bed is set — a **music-under-voice** mix where `tracks` are the primary program (1+ allowed) and the bed is auto-fit to their length and ducked under them. Unlike `video_audio_mix` (which overlays audio onto a video), this produces a pure audio file with no video track.

**Parameters:**

| Param | Type | Required | Default | Notes |
|---|---|---|---|---|
| `tracks` | array of string | yes | — | Audio paths. Flat mix: min 2, all at equal level. With `music`: the primary program (e.g. voiceover), min 1. |
| `music` | string | no | — | Optional background music bed. When set, the bed is auto-fit to the tracks' length (trimmed if longer, looped if shorter) and ducked under them. |
| `music_level` | number | no | `-18` | Music bed level in dB relative to the voice (used only with `music`). |
| `out` | string | yes | — | Output audio path. |

**Behaviour notes (code-only, not exposed as params):**
- Flat mix: all tracks are mixed at **equal levels**; output is normalized (`amix=...:normalize=1`) to prevent clipping; output duration equals the **longest** input.
- Music-under-voice: the bed never runs past the voice and never drowns it (ducked at `music_level` dB).
- Output is forced to **stereo** (`-ac 2`).
- For per-layer volume / timing offsets onto a video, use `video_audio_mix` instead.

---

### Audio Trim `audio_trim`

| Field | Value |
|---|---|
| Category | audio_process |
| Mode | sync |
| Timeout | 1m |
| Cost | Free (cost_per_unit: 0) |
| MCP route | `audio(action: "trim")` — maps the tool's `audio` arg to the model's `in` field |

**Description:** Cut an audio file to a start time and optional duration — e.g. shorten a long music bed before mixing, or drop a lead-in/lead-out. Output timestamps are rebased to 0, so the result is a clean seekable clip.

**Parameters:**

| Param | Type | Required | Default | Notes |
|---|---|---|---|---|
| `in` | string | yes | — | Input audio path (the MCP `trim` action's `audio` argument). |
| `out` | string | yes | — | Output audio path. |
| `start_sec` | number | no | 0 | Where the kept window starts, in seconds (≥ 0). |
| `duration_sec` | number | no | — | Length of the kept window. Omit (or ≤ 0) to keep everything from `start_sec` to the end. |

---

### Audio Convert `audio_convert`

| Field | Value |
|---|---|
| Category | audio_process |
| Mode | sync |
| Timeout | 30s |
| Cost | Free (cost_per_unit: 0) |
| Handler | `execAudioConvert` → `AudioConvert` (`internal/ffmpeg/audio_convert.go`) |
| MCP route | **None** — internal-only (REST `POST /v1/jobs/audio_convert` or pipeline step). No `audio(...)` action routes here. |

**Description:** Convert an audio file between formats, change sample rate, and/or adjust bitrate. Input format is auto-detected; output is chosen by the format key (see mismatch below) or inferred from the `out` extension.

**Parameters** (from YAML — see mismatch flag):

| Param | Type | Required | Default | Notes |
|---|---|---|---|---|
| `in` | string | yes | — | Input audio path. |
| `out` | string | yes | — | Output audio path; format inferred from extension if no format key set. |
| `output_format` | string | no | inferred from `out` ext | enum: mp3, mp3_128, mp3_320, aac, aac_256, wav, wav_48k, flac, ogg, opus. **⚠ See mismatch.** |
| `sample_rate` | integer | no | original | Target Hz (e.g. 44100, 48000); applied via `-ar`. Read by handler ✓. |

**⚠ YAML ↔ handler mismatch (important):** The YAML declares the format selector as **`output_format`**, but `execAudioConvert` reads **`inputs["format"]`** (executor.go:250), not `output_format`. Consequences:
- A caller passing `output_format` exactly as the YAML documents will have it **silently ignored**; the handler falls back to inferring the format from the `out` file extension.
- The extended enum values that have no matching extension — `mp3_128`, `mp3_320`, `aac_256`, `wav_48k`, `opus` — are only reachable by passing the **undocumented** key `format` (e.g. `format: "mp3_320"`). Format/bitrate table (handler `audioCodecs`): mp3=192k, mp3_128=128k, mp3_320=320k, aac=192k, aac_256=256k, wav/wav_48k=pcm_s16le (wav_48k forces `-ar 48000`), flac=lossless, ogg=libvorbis 192k, opus=libopus 128k.
- Recommendation: either rename the YAML field to `format`, or update the handler to also read `output_format` (as `audio_concat` does), or have the MCP/handler alias the two keys.

**Behaviour notes:** Unknown format → error listing valid keys. Returns `outputs.audio` / `outputs.local_path` plus metrics (`input_duration_sec`, `output_duration_sec`, `format`, `codec`).

---

### Audio Tail Fade `tail_fade`

| Field | Value |
|---|---|
| Category | audio_process |
| Mode | sync |
| Timeout | 30s |
| Cost | Free (cost_per_unit: 0) |
| Handler | `execTailFade` → `TailFade` (`internal/ffmpeg/tail_fade.go`) |
| MCP route | **None** — internal-only (REST `POST /v1/jobs/tail_fade` or pipeline step). No `audio(...)` action routes here. |

**Description:** Add a silence pad and a fade-out at the end of an audio file to prevent an abrupt ending (the "audio cuts off" bug). Intended to run after voiceover generation, before assembly. Purely parameter-driven — no prompt.

**Parameters:**

| Param | Type | Required | Default | Notes |
|---|---|---|---|---|
| `in` | string | yes | — | Input audio path (workdir-relative). |
| `out` | string | yes | — | Output audio path. |
| `pad_sec` | number | no | 0.8 | Seconds of trailing silence added (ffmpeg `apad=pad_dur`). |
| `fade_sec` | number | no | 0.6 | Fade-out duration (ffmpeg `afade=t=out`). |

**Behaviour notes:**
- Defaults are applied when the value is `<= 0`, so passing `0` yields the default (0.8 / 0.6), not a true zero. To disable padding/fade you cannot use this model with 0.
- The fade start point is computed internally as `input_duration + 0.1s` — it is not a parameter.
- Output encoded with `-q:a 2` (VBR ~190 kbps mp3-class quality, format from `out` ext).
- Returns `outputs.audio` / `outputs.local_path` plus metrics (`input_duration_sec`, `output_duration_sec`, `pad_sec`, `fade_sec`, `fade_start_sec`).