Audio models

Speech, sound effects, and music (model input schemas), plus local audio processing (our ffmpeg implementation, free).

Generations are charged in credits (see Credits & plans). Every generation model also accepts mock: true for a free placeholder result.

ElevenLabs TTS v3 `elevenlabs_tts_v3`

Expressive text-to-speech with inline audio-tag emotional control and 70+ language support, powered by ElevenLabs' Eleven v3 model.

Call it via — audio(action: "speak") (MCP audio tool) · raw: POST /v1/jobs/elevenlabs_tts_v3


Cost	20 cr per 1,000 characters
Mode / timeout	sync / 60s

Parameters — the model's input schema:

Param	Type	Required	Default	Allowed / range	Description
`text`	string	✓	—	—	Text to convert to speech. Supports inline audio tags like `[laughs]`, `[whispers]`, `[excited]`.
`voice`	string		`Rachel`	e.g. Aria, Roger, Sarah, Laura, Charlie, George, Callum, River, Liam, Charlotte, Alice, Matilda, Will, Jessica, Eric, Chris, Brian, Daniel, Lily, Bill (or a voice ID)	Voice name or ID.
`stability`	float		`0.5`	0–1	Voice stability. Lower = more expressive variation; higher = more consistent delivery.
`similarity_boost`	float		`0.75`	0–1	How closely the output matches the reference voice.
`speed`	float		`1`	—	Playback speed multiplier.
`language_code`	string		—	ISO 639-1 (e.g. en, ru, es, fr, de, ja, ko, zh)	Forces a specific output language.
`apply_text_normalization`	enum		`auto`	`auto`, `on`, `off`	Controls spelling-out of numbers, abbreviations, etc.
`seed`	int		—	—	Random seed for reproducibility.
`timestamps`	bool		`false`	—	When true, returns per-word timestamps in the response.
`output_format`	enum		`mp3_44100_128`	mp3_22050_32, mp3_44100_32, mp3_44100_64, mp3_44100_96, mp3_44100_128, mp3_44100_192, pcm_8000, pcm_16000, pcm_22050, pcm_24000, pcm_44100, pcm_48000, ulaw_8000, alaw_8000, opus_48000_32, opus_48000_64, opus_48000_96, opus_48000_128, opus_48000_192	Output codec, sample rate, and bitrate.

Our wrapper params (not part of the model schema): out (required — workdir-relative output path, .mp3) and mock (optional — test placeholder, no real generation). This model does not use the format→size mapping (format_field is empty).

Limits — Pricing is 20 cr per 1,000 characters (a 500-char paragraph = 10 cr; a 10,000-char story = 200 cr). Supported output formats: MP3 (22.05/44.1 kHz, 32–192 kbps), PCM (8–48 kHz), µ-law/A-law 8 kHz, Opus 48 kHz (32–192 kbps). 70+ languages supported. No hard maximum character count is published.

ElevenLabs TTS (direct) `elevenlabs_tts_direct`

Converts text into speech using a chosen ElevenLabs voice_id (cloned, linked, or library voice) and returns an audio file.

Call it via — audio(speak, actor_id=…) (routes a configured actor's voice through this model; plain audio(speak) without actor_id uses elevenlabs_tts_v3 instead). Also used internally by video(scene) for per-line narration. · raw: POST /v1/jobs/elevenlabs_tts_direct


Cost	20 cr per call
Mode / timeout	sync / 60s

Parameters — the model's input schema (voice_id is a path parameter; the rest are request-body fields):

Param	Type	Required	Default	Allowed / range	Description
`voice_id`	string	✓	—	—	Path param. ID of the voice to use (from Get Voices).
`text`	string	✓	—	—	The text that will be converted into speech.
`model_id`	string		`eleven_multilingual_v2`	any TTS-capable model id	Model identifier; must support text-to-speech.
`language_code`	string \| null		null	ISO 639-1	Enforces a language for the model and text normalization.
`voice_settings`	object \| null		null	see sub-properties	Per-request overrides of the voice's stored settings.
`voice_settings.stability`	number		0.5	0.0–1.0	How stable the voice is / randomness between generations.
`voice_settings.similarity_boost`	number		0.75	0.0–1.0	How closely the AI adheres to the original voice.
`voice_settings.style`	number		0	0.0–1.0	Style exaggeration of the voice.
`voice_settings.use_speaker_boost`	boolean		true	true/false	Boosts similarity to the original speaker.
`voice_settings.speed`	number		1.0	~0.7–1.2	Playback speed; <1 slows, >1 speeds up.
`seed`	integer \| null		null	0–4294967295	Best-effort deterministic sampling.
`previous_text`	string \| null		null	—	Text preceding this request, for continuity.
`next_text`	string \| null		null	—	Text following this request, for continuity.
`previous_request_ids`	string[] \| null		null	max 3	Request ids of prior samples, for continuity.
`next_request_ids`	string[] \| null		null	max 3	Request ids of later samples, for continuity.
`pronunciation_dictionary_locators`	object[] \| null		null	max 3	Pronunciation dictionary locators (id, version_id).
`apply_text_normalization`	enum		`auto`	`auto`, `on`, `off`	Controls number/date spell-out normalization.
`apply_language_text_normalization`	boolean		false	true/false	Language-specific normalization (Japanese only; raises latency).
`output_format`	enum (query)		`mp3_44100_128`	`mp3_22050_32`, `mp3_44100_32/64/96/128/192`, `pcm_8000/16000/22050/24000/44100`, `ulaw_8000`, `alaw_8000`, `opus_48000_*`, etc. (28 values)	Query param. `codec_samplerate_bitrate`; mp3_192 needs Creator+, pcm/wav 44.1kHz needs Pro+.
`enable_logging`	boolean (query)		true	true/false	Query param. false = zero-retention mode (enterprise only).

Our wrapper params (not part of the model schema): out (required — output audio filename, mp3) and mock (optional — test placeholder). This model has no format→size mapping (format_field is empty in our YAML).

Limits — model limits: seed 0–4294967295; up to 3 pronunciation_dictionary_locators, 3 previous_request_ids, 3 next_request_ids per request; output formats limited to the 28 output_format enum values (mp3 192kbps requires Creator tier or above; PCM/WAV at 44.1kHz requires Pro tier or above). No hard maximum text length is published for this endpoint, so no character cap is asserted here (our YAML's "keep under 5000 characters" is guidance, not a confirmed limit).

ElevenLabs Sound Effects `elevenlabs_sfx`

Generate sound effects (foley, ambience, UI, impacts) from a text description using ElevenLabs' Sound Effects V2 model.

Call it via — audio(sfx) (the audio MCP tool with action: "sfx"; pass your description in prompt, which the worker maps to the model's text field) · raw: POST /v1/jobs/elevenlabs_sfx


Cost	Billed per second of audio
Mode / timeout	sync / 60s

Parameters — the model's input schema:

Param	Type	Required	Default	Allowed / range	Description
`text`	string	✓	—	max 450 characters	The text describing the sound effect to generate.
`duration_seconds`	number		none (model decides)	`0.5`–`22` (nullable)	Duration in seconds. If omitted/null, optimal duration is determined from the prompt.
`prompt_influence`	number		`0.3`	`0`–`1`	How closely to follow the prompt. Higher values mean less variation.
`output_format`	string (enum)		`mp3_44100_128`	`mp3_22050_32`, `mp3_44100_32`, `mp3_44100_64`, `mp3_44100_96`, `mp3_44100_128`, `mp3_44100_192`, `pcm_8000`, `pcm_16000`, `pcm_22050`, `pcm_24000`, `pcm_44100`, `pcm_48000`, `ulaw_8000`, `alaw_8000`, `opus_48000_32`, `opus_48000_64`, `opus_48000_96`, `opus_48000_128`, `opus_48000_192`	Output audio format, as `codec_sampleRate_bitrate`.
`loop`	boolean		`false`	`true` / `false`	Whether to create a sound effect that loops smoothly.

Our wrapper params (not part of the model schema): out (required — workdir-relative output path, e.g. .mp3) and mock (optional — test placeholder). No format mapping applies to this model (format_field is empty).

Limits — model limits:

text: max 450 characters.
duration_seconds: 0.5–22 seconds.
prompt_influence: 0–1.
Output codecs: MP3 (22.05/44.1 kHz, 32–192 kbps), PCM (8–48 kHz), μ-law/A-law 8 kHz, Opus 48 kHz (32–192 kbps).

Minimax Music v2.6 `minimax_music`

MiniMax Music 2.6 creates complete tracks with singing, backing music, and detailed arrangements from a style description and optional lyrics.

Call it via — audio(music) MCP tool · raw: POST /v1/jobs/minimax_music


Cost	30 cr per call
Mode / timeout	webhook / 8m (from our YAML)

Parameters — the model's input schema:

Param	Type	Required	Default	Allowed / range	Description
`prompt`	string	✓	—	10–2000 chars	Description of the music style, mood, genre, and scenario.
`lyrics`	string		`""`	max 3500 chars	Song lyrics. Use `\n` to separate lines. Supports structure tags: `[Intro]`, `[Verse]`, `[Pre Chorus]`, `[Chorus]`, `[Post Chorus]`, `[Hook]`, `[Bridge]`, `[Interlude]`, `[Transition]`, `[Build Up]`, `[Break]`, `[Inst]`, `[Solo]`, `[Outro]`. Required when `is_instrumental` is false.
`lyrics_optimizer`	boolean		`false`	true / false	When true and `lyrics` is empty, auto-generates lyrics from the prompt.
`is_instrumental`	boolean		`false`	true / false	When true, generates vocal-free instrumental music.
`audio_setting`	object		—	see below	Audio configuration settings (object).
`audio_setting.sample_rate`	integer		`44100`	16000, 24000, 32000, 44100	Sample rate of generated audio (Hz).
`audio_setting.bitrate`	integer		`256000`	32000, 64000, 128000, 256000	Bitrate of generated audio (bps).
`audio_setting.format`	string		`mp3`	mp3, wav, pcm	Output audio format.

Our wrapper params (not part of the model schema): out (required — workdir-relative output path, e.g. .mp3), mock (optional — test placeholder). This model has no format_field, so our format wrapper is not used here.

Limits — model limits: prompt 10–2000 characters; lyrics max 3500 characters; output formats mp3 / wav / pcm; sample rate up to 44100 Hz; bitrate up to 256000 bps. Lyrics are required when is_instrumental is false.

Audio Concat `audio_concat`

Field	Value
Category	audio_process
Mode	sync
Timeout	30s
Cost	Free (cost_per_unit: 0)
Handler	`execAudioConcat` → `AudioConcat` (`internal/ffmpeg/audio_concat.go`)
MCP route	`audio(action: "concat")` — maps the tool's `tracks[]` arg to the model's `files` field

Description: Concatenate multiple audio files in order. Accepts a mix of input formats — every input is decoded and re-encoded to the target output format, then joined with ffmpeg's concat demuxer (-c copy, no second re-encode).

Parameters (from YAML input_schema, cross-checked against handler):

Param	Type	Required	Default	Notes
`files`	array of string	yes	—	Ordered list of audio paths (any mix of mp3/wav/aac/flac/ogg). Handler errors if empty; non-string entries rejected.
`out`	string	yes	—	Output audio path.
`silence_between`	number	no	0	Seconds of silence inserted between files (not after the last). Implemented via generated `anullsrc` mono 44.1 kHz segments.
`output_format`	string	no	inferred from `out` ext, else mp3	enum: mp3, aac, wav, flac, ogg. Read by handler ✓.
`sample_rate`	integer	no	source rate	Target Hz; applied via `-ar`. Read by handler ✓.

Behaviour notes:

Single-file fast path: with one file and silence_between <= 0, if input/output extensions match and no sample_rate is given, it byte-copies the file (acts as a pass-through). Otherwise it delegates to AudioConvert — i.e. a single file makes this a format converter.
Codec mapping (via outputCodecArgs): wav→pcm_s16le, flac→flac, ogg→libvorbis 192k, aac→aac 192k, default→libmp3lame 192k.
Concat-list injection is guarded: a file path containing a quote or newline is rejected.
Returns outputs.audio / outputs.local_path plus metrics (num_files, total_duration_sec, silence_between).

Audio-Only Mix `audio_only_mix`

Field	Value
Category	audio_process
Mode	sync
Timeout	2m
Cost	Free (cost_per_unit: 0)
Handler	`execAudioOnlyMix` → `AudioOnlyMix` (`internal/ffmpeg/audio_only_mix.go`)
MCP route	`audio(action: "mix")` — passes `tracks[]` (and the optional `music` / `music_level`) through

Description: Mix audio files into a single audio file. Two modes: a flat mix of 2+ tracks with ffmpeg's amix filter, or — when the optional music bed is set — a music-under-voice mix where tracks are the primary program (1+ allowed) and the bed is auto-fit to their length and ducked under them. Unlike video_audio_mix (which overlays audio onto a video), this produces a pure audio file with no video track.

Parameters:

Param	Type	Required	Default	Notes
`tracks`	array of string	yes	—	Audio paths. Flat mix: min 2, all at equal level. With `music`: the primary program (e.g. voiceover), min 1.
`music`	string	no	—	Optional background music bed. When set, the bed is auto-fit to the tracks' length (trimmed if longer, looped if shorter) and ducked under them.
`music_level`	number	no	`-18`	Music bed level in dB relative to the voice (used only with `music`).
`out`	string	yes	—	Output audio path.

Behaviour notes (code-only, not exposed as params):

Flat mix: all tracks are mixed at equal levels; output is normalized (amix=...:normalize=1) to prevent clipping; output duration equals the longest input.
Music-under-voice: the bed never runs past the voice and never drowns it (ducked at music_level dB).
Output is forced to stereo (-ac 2).
For per-layer volume / timing offsets onto a video, use video_audio_mix instead.

Audio Trim `audio_trim`

Field	Value
Category	audio_process
Mode	sync
Timeout	1m
Cost	Free (cost_per_unit: 0)
MCP route	`audio(action: "trim")` — maps the tool's `audio` arg to the model's `in` field

Description: Cut an audio file to a start time and optional duration — e.g. shorten a long music bed before mixing, or drop a lead-in/lead-out. Output timestamps are rebased to 0, so the result is a clean seekable clip.

Parameters:

Param	Type	Required	Default	Notes
`in`	string	yes	—	Input audio path (the MCP `trim` action's `audio` argument).
`out`	string	yes	—	Output audio path.
`start_sec`	number	no	0	Where the kept window starts, in seconds (≥ 0).
`duration_sec`	number	no	—	Length of the kept window. Omit (or ≤ 0) to keep everything from `start_sec` to the end.

Audio Convert `audio_convert`

Field	Value
Category	audio_process
Mode	sync
Timeout	30s
Cost	Free (cost_per_unit: 0)
Handler	`execAudioConvert` → `AudioConvert` (`internal/ffmpeg/audio_convert.go`)
MCP route	None — internal-only (REST `POST /v1/jobs/audio_convert` or pipeline step). No `audio(...)` action routes here.

Description: Convert an audio file between formats, change sample rate, and/or adjust bitrate. Input format is auto-detected; output is chosen by the format key (see mismatch below) or inferred from the out extension.

Parameters (from YAML — see mismatch flag):

Param	Type	Required	Default	Notes
`in`	string	yes	—	Input audio path.
`out`	string	yes	—	Output audio path; format inferred from extension if no format key set.
`output_format`	string	no	inferred from `out` ext	enum: mp3, mp3_128, mp3_320, aac, aac_256, wav, wav_48k, flac, ogg, opus. ⚠ See mismatch.
`sample_rate`	integer	no	original	Target Hz (e.g. 44100, 48000); applied via `-ar`. Read by handler ✓.

⚠ YAML ↔ handler mismatch (important): The YAML declares the format selector as output_format, but execAudioConvert reads inputs["format"] (executor.go:250), not output_format. Consequences:

A caller passing output_format exactly as the YAML documents will have it silently ignored; the handler falls back to inferring the format from the out file extension.
The extended enum values that have no matching extension — mp3_128, mp3_320, aac_256, wav_48k, opus — are only reachable by passing the undocumented key format (e.g. format: "mp3_320"). Format/bitrate table (handler audioCodecs): mp3=192k, mp3_128=128k, mp3_320=320k, aac=192k, aac_256=256k, wav/wav_48k=pcm_s16le (wav_48k forces -ar 48000), flac=lossless, ogg=libvorbis 192k, opus=libopus 128k.
Recommendation: either rename the YAML field to format, or update the handler to also read output_format (as audio_concat does), or have the MCP/handler alias the two keys.

Behaviour notes: Unknown format → error listing valid keys. Returns outputs.audio / outputs.local_path plus metrics (input_duration_sec, output_duration_sec, format, codec).

Audio Tail Fade `tail_fade`

Field	Value
Category	audio_process
Mode	sync
Timeout	30s
Cost	Free (cost_per_unit: 0)
Handler	`execTailFade` → `TailFade` (`internal/ffmpeg/tail_fade.go`)
MCP route	None — internal-only (REST `POST /v1/jobs/tail_fade` or pipeline step). No `audio(...)` action routes here.

Description: Add a silence pad and a fade-out at the end of an audio file to prevent an abrupt ending (the "audio cuts off" bug). Intended to run after voiceover generation, before assembly. Purely parameter-driven — no prompt.

Parameters:

Param	Type	Required	Default	Notes
`in`	string	yes	—	Input audio path (workdir-relative).
`out`	string	yes	—	Output audio path.
`pad_sec`	number	no	0.8	Seconds of trailing silence added (ffmpeg `apad=pad_dur`).
`fade_sec`	number	no	0.6	Fade-out duration (ffmpeg `afade=t=out`).

Behaviour notes:

Defaults are applied when the value is <= 0, so passing 0 yields the default (0.8 / 0.6), not a true zero. To disable padding/fade you cannot use this model with 0.
The fade start point is computed internally as input_duration + 0.1s — it is not a parameter.
Output encoded with -q:a 2 (VBR ~190 kbps mp3-class quality, format from out ext).
Returns outputs.audio / outputs.local_path plus metrics (input_duration_sec, output_duration_sec, pad_sec, fade_sec, fade_start_sec).

Audio models ​

ElevenLabs TTS v3 elevenlabs_tts_v3 ​

ElevenLabs TTS (direct) elevenlabs_tts_direct ​

ElevenLabs Sound Effects elevenlabs_sfx ​

Minimax Music v2.6 minimax_music ​

Audio Concat audio_concat ​

Audio-Only Mix audio_only_mix ​

Audio Trim audio_trim ​

Audio Convert audio_convert ​

Audio Tail Fade tail_fade ​

Audio models

ElevenLabs TTS v3 `elevenlabs_tts_v3`

ElevenLabs TTS (direct) `elevenlabs_tts_direct`

ElevenLabs Sound Effects `elevenlabs_sfx`

Minimax Music v2.6 `minimax_music`

Audio Concat `audio_concat`

Audio-Only Mix `audio_only_mix`

Audio Trim `audio_trim`

Audio Convert `audio_convert`

Audio Tail Fade `tail_fade`