Audio models
Speech, sound effects, and music (model input schemas), plus local audio processing (our ffmpeg implementation, free).
Generations are charged in credits (see Credits & plans). Every generation model also accepts
mock: truefor a free placeholder result.
ElevenLabs TTS v3 elevenlabs_tts_v3
Expressive text-to-speech with inline audio-tag emotional control and 70+ language support, powered by ElevenLabs' Eleven v3 model.
Call it via — audio(action: "speak") (MCP audio tool) · raw: POST /v1/jobs/elevenlabs_tts_v3
| Cost | 20 cr per 1,000 characters |
| Mode / timeout | sync / 60s |
Parameters — the model's input schema:
| Param | Type | Required | Default | Allowed / range | Description |
|---|---|---|---|---|---|
text | string | ✓ | — | — | Text to convert to speech. Supports inline audio tags like [laughs], [whispers], [excited]. |
voice | string | Rachel | e.g. Aria, Roger, Sarah, Laura, Charlie, George, Callum, River, Liam, Charlotte, Alice, Matilda, Will, Jessica, Eric, Chris, Brian, Daniel, Lily, Bill (or a voice ID) | Voice name or ID. | |
stability | float | 0.5 | 0–1 | Voice stability. Lower = more expressive variation; higher = more consistent delivery. | |
similarity_boost | float | 0.75 | 0–1 | How closely the output matches the reference voice. | |
speed | float | 1 | — | Playback speed multiplier. | |
language_code | string | — | ISO 639-1 (e.g. en, ru, es, fr, de, ja, ko, zh) | Forces a specific output language. | |
apply_text_normalization | enum | auto | auto, on, off | Controls spelling-out of numbers, abbreviations, etc. | |
seed | int | — | — | Random seed for reproducibility. | |
timestamps | bool | false | — | When true, returns per-word timestamps in the response. | |
output_format | enum | mp3_44100_128 | mp3_22050_32, mp3_44100_32, mp3_44100_64, mp3_44100_96, mp3_44100_128, mp3_44100_192, pcm_8000, pcm_16000, pcm_22050, pcm_24000, pcm_44100, pcm_48000, ulaw_8000, alaw_8000, opus_48000_32, opus_48000_64, opus_48000_96, opus_48000_128, opus_48000_192 | Output codec, sample rate, and bitrate. |
Our wrapper params (not part of the model schema): out (required — workdir-relative output path, .mp3) and mock (optional — test placeholder, no real generation). This model does not use the format→size mapping (format_field is empty).
Limits — Pricing is 20 cr per 1,000 characters (a 500-char paragraph = 10 cr; a 10,000-char story = 200 cr). Supported output formats: MP3 (22.05/44.1 kHz, 32–192 kbps), PCM (8–48 kHz), µ-law/A-law 8 kHz, Opus 48 kHz (32–192 kbps). 70+ languages supported. No hard maximum character count is published.
ElevenLabs TTS (direct) elevenlabs_tts_direct
Converts text into speech using a chosen ElevenLabs voice_id (cloned, linked, or library voice) and returns an audio file.
Call it via — audio(speak, actor_id=…) (routes a configured actor's voice through this model; plain audio(speak) without actor_id uses elevenlabs_tts_v3 instead). Also used internally by video(scene) for per-line narration. · raw: POST /v1/jobs/elevenlabs_tts_direct
| Cost | 20 cr per call |
| Mode / timeout | sync / 60s |
Parameters — the model's input schema (voice_id is a path parameter; the rest are request-body fields):
| Param | Type | Required | Default | Allowed / range | Description |
|---|---|---|---|---|---|
voice_id | string | ✓ | — | — | Path param. ID of the voice to use (from Get Voices). |
text | string | ✓ | — | — | The text that will be converted into speech. |
model_id | string | eleven_multilingual_v2 | any TTS-capable model id | Model identifier; must support text-to-speech. | |
language_code | string | null | null | ISO 639-1 | Enforces a language for the model and text normalization. | |
voice_settings | object | null | null | see sub-properties | Per-request overrides of the voice's stored settings. | |
voice_settings.stability | number | 0.5 | 0.0–1.0 | How stable the voice is / randomness between generations. | |
voice_settings.similarity_boost | number | 0.75 | 0.0–1.0 | How closely the AI adheres to the original voice. | |
voice_settings.style | number | 0 | 0.0–1.0 | Style exaggeration of the voice. | |
voice_settings.use_speaker_boost | boolean | true | true/false | Boosts similarity to the original speaker. | |
voice_settings.speed | number | 1.0 | ~0.7–1.2 | Playback speed; <1 slows, >1 speeds up. | |
seed | integer | null | null | 0–4294967295 | Best-effort deterministic sampling. | |
previous_text | string | null | null | — | Text preceding this request, for continuity. | |
next_text | string | null | null | — | Text following this request, for continuity. | |
previous_request_ids | string[] | null | null | max 3 | Request ids of prior samples, for continuity. | |
next_request_ids | string[] | null | null | max 3 | Request ids of later samples, for continuity. | |
pronunciation_dictionary_locators | object[] | null | null | max 3 | Pronunciation dictionary locators (id, version_id). | |
apply_text_normalization | enum | auto | auto, on, off | Controls number/date spell-out normalization. | |
apply_language_text_normalization | boolean | false | true/false | Language-specific normalization (Japanese only; raises latency). | |
output_format | enum (query) | mp3_44100_128 | mp3_22050_32, mp3_44100_32/64/96/128/192, pcm_8000/16000/22050/24000/44100, ulaw_8000, alaw_8000, opus_48000_*, etc. (28 values) | Query param. codec_samplerate_bitrate; mp3_192 needs Creator+, pcm/wav 44.1kHz needs Pro+. | |
enable_logging | boolean (query) | true | true/false | Query param. false = zero-retention mode (enterprise only). |
Our wrapper params (not part of the model schema): out (required — output audio filename, mp3) and mock (optional — test placeholder). This model has no format→size mapping (format_field is empty in our YAML).
Limits — model limits: seed 0–4294967295; up to 3 pronunciation_dictionary_locators, 3 previous_request_ids, 3 next_request_ids per request; output formats limited to the 28 output_format enum values (mp3 192kbps requires Creator tier or above; PCM/WAV at 44.1kHz requires Pro tier or above). No hard maximum text length is published for this endpoint, so no character cap is asserted here (our YAML's "keep under 5000 characters" is guidance, not a confirmed limit).
ElevenLabs Sound Effects elevenlabs_sfx
Generate sound effects (foley, ambience, UI, impacts) from a text description using ElevenLabs' Sound Effects V2 model.
Call it via — audio(sfx) (the audio MCP tool with action: "sfx"; pass your description in prompt, which the worker maps to the model's text field) · raw: POST /v1/jobs/elevenlabs_sfx
| Cost | Billed per second of audio |
| Mode / timeout | sync / 60s |
Parameters — the model's input schema:
| Param | Type | Required | Default | Allowed / range | Description |
|---|---|---|---|---|---|
text | string | ✓ | — | max 450 characters | The text describing the sound effect to generate. |
duration_seconds | number | none (model decides) | 0.5–22 (nullable) | Duration in seconds. If omitted/null, optimal duration is determined from the prompt. | |
prompt_influence | number | 0.3 | 0–1 | How closely to follow the prompt. Higher values mean less variation. | |
output_format | string (enum) | mp3_44100_128 | mp3_22050_32, mp3_44100_32, mp3_44100_64, mp3_44100_96, mp3_44100_128, mp3_44100_192, pcm_8000, pcm_16000, pcm_22050, pcm_24000, pcm_44100, pcm_48000, ulaw_8000, alaw_8000, opus_48000_32, opus_48000_64, opus_48000_96, opus_48000_128, opus_48000_192 | Output audio format, as codec_sampleRate_bitrate. | |
loop | boolean | false | true / false | Whether to create a sound effect that loops smoothly. |
Our wrapper params (not part of the model schema): out (required — workdir-relative output path, e.g. .mp3) and mock (optional — test placeholder). No format mapping applies to this model (format_field is empty).
Limits — model limits:
text: max 450 characters.duration_seconds: 0.5–22 seconds.prompt_influence: 0–1.- Output codecs: MP3 (22.05/44.1 kHz, 32–192 kbps), PCM (8–48 kHz), μ-law/A-law 8 kHz, Opus 48 kHz (32–192 kbps).
Minimax Music v2.6 minimax_music
MiniMax Music 2.6 creates complete tracks with singing, backing music, and detailed arrangements from a style description and optional lyrics.
Call it via — audio(music) MCP tool · raw: POST /v1/jobs/minimax_music
| Cost | 30 cr per call |
| Mode / timeout | webhook / 8m (from our YAML) |
Parameters — the model's input schema:
| Param | Type | Required | Default | Allowed / range | Description |
|---|---|---|---|---|---|
prompt | string | ✓ | — | 10–2000 chars | Description of the music style, mood, genre, and scenario. |
lyrics | string | "" | max 3500 chars | Song lyrics. Use \n to separate lines. Supports structure tags: [Intro], [Verse], [Pre Chorus], [Chorus], [Post Chorus], [Hook], [Bridge], [Interlude], [Transition], [Build Up], [Break], [Inst], [Solo], [Outro]. Required when is_instrumental is false. | |
lyrics_optimizer | boolean | false | true / false | When true and lyrics is empty, auto-generates lyrics from the prompt. | |
is_instrumental | boolean | false | true / false | When true, generates vocal-free instrumental music. | |
audio_setting | object | — | see below | Audio configuration settings (object). | |
audio_setting.sample_rate | integer | 44100 | 16000, 24000, 32000, 44100 | Sample rate of generated audio (Hz). | |
audio_setting.bitrate | integer | 256000 | 32000, 64000, 128000, 256000 | Bitrate of generated audio (bps). | |
audio_setting.format | string | mp3 | mp3, wav, pcm | Output audio format. |
Our wrapper params (not part of the model schema): out (required — workdir-relative output path, e.g. .mp3), mock (optional — test placeholder). This model has no format_field, so our format wrapper is not used here.
Limits — model limits: prompt 10–2000 characters; lyrics max 3500 characters; output formats mp3 / wav / pcm; sample rate up to 44100 Hz; bitrate up to 256000 bps. Lyrics are required when is_instrumental is false.
Audio Concat audio_concat
| Field | Value |
|---|---|
| Category | audio_process |
| Mode | sync |
| Timeout | 30s |
| Cost | Free (cost_per_unit: 0) |
| Handler | execAudioConcat → AudioConcat (internal/ffmpeg/audio_concat.go) |
| MCP route | audio(action: "concat") — maps the tool's tracks[] arg to the model's files field |
Description: Concatenate multiple audio files in order. Accepts a mix of input formats — every input is decoded and re-encoded to the target output format, then joined with ffmpeg's concat demuxer (-c copy, no second re-encode).
Parameters (from YAML input_schema, cross-checked against handler):
| Param | Type | Required | Default | Notes |
|---|---|---|---|---|
files | array of string | yes | — | Ordered list of audio paths (any mix of mp3/wav/aac/flac/ogg). Handler errors if empty; non-string entries rejected. |
out | string | yes | — | Output audio path. |
silence_between | number | no | 0 | Seconds of silence inserted between files (not after the last). Implemented via generated anullsrc mono 44.1 kHz segments. |
output_format | string | no | inferred from out ext, else mp3 | enum: mp3, aac, wav, flac, ogg. Read by handler ✓. |
sample_rate | integer | no | source rate | Target Hz; applied via -ar. Read by handler ✓. |
Behaviour notes:
- Single-file fast path: with one file and
silence_between <= 0, if input/output extensions match and nosample_rateis given, it byte-copies the file (acts as a pass-through). Otherwise it delegates toAudioConvert— i.e. a single file makes this a format converter. - Codec mapping (via
outputCodecArgs): wav→pcm_s16le, flac→flac, ogg→libvorbis 192k, aac→aac 192k, default→libmp3lame 192k. - Concat-list injection is guarded: a file path containing a quote or newline is rejected.
- Returns
outputs.audio/outputs.local_pathplus metrics (num_files,total_duration_sec,silence_between).
Audio-Only Mix audio_only_mix
| Field | Value |
|---|---|
| Category | audio_process |
| Mode | sync |
| Timeout | 2m |
| Cost | Free (cost_per_unit: 0) |
| Handler | execAudioOnlyMix → AudioOnlyMix (internal/ffmpeg/audio_only_mix.go) |
| MCP route | audio(action: "mix") — passes tracks[] (and the optional music / music_level) through |
Description: Mix audio files into a single audio file. Two modes: a flat mix of 2+ tracks with ffmpeg's amix filter, or — when the optional music bed is set — a music-under-voice mix where tracks are the primary program (1+ allowed) and the bed is auto-fit to their length and ducked under them. Unlike video_audio_mix (which overlays audio onto a video), this produces a pure audio file with no video track.
Parameters:
| Param | Type | Required | Default | Notes |
|---|---|---|---|---|
tracks | array of string | yes | — | Audio paths. Flat mix: min 2, all at equal level. With music: the primary program (e.g. voiceover), min 1. |
music | string | no | — | Optional background music bed. When set, the bed is auto-fit to the tracks' length (trimmed if longer, looped if shorter) and ducked under them. |
music_level | number | no | -18 | Music bed level in dB relative to the voice (used only with music). |
out | string | yes | — | Output audio path. |
Behaviour notes (code-only, not exposed as params):
- Flat mix: all tracks are mixed at equal levels; output is normalized (
amix=...:normalize=1) to prevent clipping; output duration equals the longest input. - Music-under-voice: the bed never runs past the voice and never drowns it (ducked at
music_leveldB). - Output is forced to stereo (
-ac 2). - For per-layer volume / timing offsets onto a video, use
video_audio_mixinstead.
Audio Trim audio_trim
| Field | Value |
|---|---|
| Category | audio_process |
| Mode | sync |
| Timeout | 1m |
| Cost | Free (cost_per_unit: 0) |
| MCP route | audio(action: "trim") — maps the tool's audio arg to the model's in field |
Description: Cut an audio file to a start time and optional duration — e.g. shorten a long music bed before mixing, or drop a lead-in/lead-out. Output timestamps are rebased to 0, so the result is a clean seekable clip.
Parameters:
| Param | Type | Required | Default | Notes |
|---|---|---|---|---|
in | string | yes | — | Input audio path (the MCP trim action's audio argument). |
out | string | yes | — | Output audio path. |
start_sec | number | no | 0 | Where the kept window starts, in seconds (≥ 0). |
duration_sec | number | no | — | Length of the kept window. Omit (or ≤ 0) to keep everything from start_sec to the end. |
Audio Convert audio_convert
| Field | Value |
|---|---|
| Category | audio_process |
| Mode | sync |
| Timeout | 30s |
| Cost | Free (cost_per_unit: 0) |
| Handler | execAudioConvert → AudioConvert (internal/ffmpeg/audio_convert.go) |
| MCP route | None — internal-only (REST POST /v1/jobs/audio_convert or pipeline step). No audio(...) action routes here. |
Description: Convert an audio file between formats, change sample rate, and/or adjust bitrate. Input format is auto-detected; output is chosen by the format key (see mismatch below) or inferred from the out extension.
Parameters (from YAML — see mismatch flag):
| Param | Type | Required | Default | Notes |
|---|---|---|---|---|
in | string | yes | — | Input audio path. |
out | string | yes | — | Output audio path; format inferred from extension if no format key set. |
output_format | string | no | inferred from out ext | enum: mp3, mp3_128, mp3_320, aac, aac_256, wav, wav_48k, flac, ogg, opus. ⚠ See mismatch. |
sample_rate | integer | no | original | Target Hz (e.g. 44100, 48000); applied via -ar. Read by handler ✓. |
⚠ YAML ↔ handler mismatch (important): The YAML declares the format selector as output_format, but execAudioConvert reads inputs["format"] (executor.go:250), not output_format. Consequences:
- A caller passing
output_formatexactly as the YAML documents will have it silently ignored; the handler falls back to inferring the format from theoutfile extension. - The extended enum values that have no matching extension —
mp3_128,mp3_320,aac_256,wav_48k,opus— are only reachable by passing the undocumented keyformat(e.g.format: "mp3_320"). Format/bitrate table (handleraudioCodecs): mp3=192k, mp3_128=128k, mp3_320=320k, aac=192k, aac_256=256k, wav/wav_48k=pcm_s16le (wav_48k forces-ar 48000), flac=lossless, ogg=libvorbis 192k, opus=libopus 128k. - Recommendation: either rename the YAML field to
format, or update the handler to also readoutput_format(asaudio_concatdoes), or have the MCP/handler alias the two keys.
Behaviour notes: Unknown format → error listing valid keys. Returns outputs.audio / outputs.local_path plus metrics (input_duration_sec, output_duration_sec, format, codec).
Audio Tail Fade tail_fade
| Field | Value |
|---|---|
| Category | audio_process |
| Mode | sync |
| Timeout | 30s |
| Cost | Free (cost_per_unit: 0) |
| Handler | execTailFade → TailFade (internal/ffmpeg/tail_fade.go) |
| MCP route | None — internal-only (REST POST /v1/jobs/tail_fade or pipeline step). No audio(...) action routes here. |
Description: Add a silence pad and a fade-out at the end of an audio file to prevent an abrupt ending (the "audio cuts off" bug). Intended to run after voiceover generation, before assembly. Purely parameter-driven — no prompt.
Parameters:
| Param | Type | Required | Default | Notes |
|---|---|---|---|---|
in | string | yes | — | Input audio path (workdir-relative). |
out | string | yes | — | Output audio path. |
pad_sec | number | no | 0.8 | Seconds of trailing silence added (ffmpeg apad=pad_dur). |
fade_sec | number | no | 0.6 | Fade-out duration (ffmpeg afade=t=out). |
Behaviour notes:
- Defaults are applied when the value is
<= 0, so passing0yields the default (0.8 / 0.6), not a true zero. To disable padding/fade you cannot use this model with 0. - The fade start point is computed internally as
input_duration + 0.1s— it is not a parameter. - Output encoded with
-q:a 2(VBR ~190 kbps mp3-class quality, format fromoutext). - Returns
outputs.audio/outputs.local_pathplus metrics (input_duration_sec,output_duration_sec,pad_sec,fade_sec,fade_start_sec).