Skip to content

Audio models

Speech, sound effects, and music (model input schemas), plus local audio processing (our ffmpeg implementation, free).

Generations are charged in credits (see Credits & plans). Every generation model also accepts mock: true for a free placeholder result.

ElevenLabs TTS v3 elevenlabs_tts_v3

Expressive text-to-speech with inline audio-tag emotional control and 70+ language support, powered by ElevenLabs' Eleven v3 model.

Call it viaaudio(action: "speak") (MCP audio tool) · raw: POST /v1/jobs/elevenlabs_tts_v3

Cost20 cr per 1,000 characters
Mode / timeoutsync / 60s

Parameters — the model's input schema:

ParamTypeRequiredDefaultAllowed / rangeDescription
textstringText to convert to speech. Supports inline audio tags like [laughs], [whispers], [excited].
voicestringRachele.g. Aria, Roger, Sarah, Laura, Charlie, George, Callum, River, Liam, Charlotte, Alice, Matilda, Will, Jessica, Eric, Chris, Brian, Daniel, Lily, Bill (or a voice ID)Voice name or ID.
stabilityfloat0.50–1Voice stability. Lower = more expressive variation; higher = more consistent delivery.
similarity_boostfloat0.750–1How closely the output matches the reference voice.
speedfloat1Playback speed multiplier.
language_codestringISO 639-1 (e.g. en, ru, es, fr, de, ja, ko, zh)Forces a specific output language.
apply_text_normalizationenumautoauto, on, offControls spelling-out of numbers, abbreviations, etc.
seedintRandom seed for reproducibility.
timestampsboolfalseWhen true, returns per-word timestamps in the response.
output_formatenummp3_44100_128mp3_22050_32, mp3_44100_32, mp3_44100_64, mp3_44100_96, mp3_44100_128, mp3_44100_192, pcm_8000, pcm_16000, pcm_22050, pcm_24000, pcm_44100, pcm_48000, ulaw_8000, alaw_8000, opus_48000_32, opus_48000_64, opus_48000_96, opus_48000_128, opus_48000_192Output codec, sample rate, and bitrate.

Our wrapper params (not part of the model schema): out (required — workdir-relative output path, .mp3) and mock (optional — test placeholder, no real generation). This model does not use the format→size mapping (format_field is empty).

Limits — Pricing is 20 cr per 1,000 characters (a 500-char paragraph = 10 cr; a 10,000-char story = 200 cr). Supported output formats: MP3 (22.05/44.1 kHz, 32–192 kbps), PCM (8–48 kHz), µ-law/A-law 8 kHz, Opus 48 kHz (32–192 kbps). 70+ languages supported. No hard maximum character count is published.

ElevenLabs TTS (direct) elevenlabs_tts_direct

Converts text into speech using a chosen ElevenLabs voice_id (cloned, linked, or library voice) and returns an audio file.

Call it viaaudio(speak, actor_id=…) (routes a configured actor's voice through this model; plain audio(speak) without actor_id uses elevenlabs_tts_v3 instead). Also used internally by video(scene) for per-line narration. · raw: POST /v1/jobs/elevenlabs_tts_direct

Cost20 cr per call
Mode / timeoutsync / 60s

Parameters — the model's input schema (voice_id is a path parameter; the rest are request-body fields):

ParamTypeRequiredDefaultAllowed / rangeDescription
voice_idstringPath param. ID of the voice to use (from Get Voices).
textstringThe text that will be converted into speech.
model_idstringeleven_multilingual_v2any TTS-capable model idModel identifier; must support text-to-speech.
language_codestring | nullnullISO 639-1Enforces a language for the model and text normalization.
voice_settingsobject | nullnullsee sub-propertiesPer-request overrides of the voice's stored settings.
voice_settings.stabilitynumber0.50.0–1.0How stable the voice is / randomness between generations.
voice_settings.similarity_boostnumber0.750.0–1.0How closely the AI adheres to the original voice.
voice_settings.stylenumber00.0–1.0Style exaggeration of the voice.
voice_settings.use_speaker_boostbooleantruetrue/falseBoosts similarity to the original speaker.
voice_settings.speednumber1.0~0.7–1.2Playback speed; <1 slows, >1 speeds up.
seedinteger | nullnull0–4294967295Best-effort deterministic sampling.
previous_textstring | nullnullText preceding this request, for continuity.
next_textstring | nullnullText following this request, for continuity.
previous_request_idsstring[] | nullnullmax 3Request ids of prior samples, for continuity.
next_request_idsstring[] | nullnullmax 3Request ids of later samples, for continuity.
pronunciation_dictionary_locatorsobject[] | nullnullmax 3Pronunciation dictionary locators (id, version_id).
apply_text_normalizationenumautoauto, on, offControls number/date spell-out normalization.
apply_language_text_normalizationbooleanfalsetrue/falseLanguage-specific normalization (Japanese only; raises latency).
output_formatenum (query)mp3_44100_128mp3_22050_32, mp3_44100_32/64/96/128/192, pcm_8000/16000/22050/24000/44100, ulaw_8000, alaw_8000, opus_48000_*, etc. (28 values)Query param. codec_samplerate_bitrate; mp3_192 needs Creator+, pcm/wav 44.1kHz needs Pro+.
enable_loggingboolean (query)truetrue/falseQuery param. false = zero-retention mode (enterprise only).

Our wrapper params (not part of the model schema): out (required — output audio filename, mp3) and mock (optional — test placeholder). This model has no format→size mapping (format_field is empty in our YAML).

Limits — model limits: seed 0–4294967295; up to 3 pronunciation_dictionary_locators, 3 previous_request_ids, 3 next_request_ids per request; output formats limited to the 28 output_format enum values (mp3 192kbps requires Creator tier or above; PCM/WAV at 44.1kHz requires Pro tier or above). No hard maximum text length is published for this endpoint, so no character cap is asserted here (our YAML's "keep under 5000 characters" is guidance, not a confirmed limit).

ElevenLabs Sound Effects elevenlabs_sfx

Generate sound effects (foley, ambience, UI, impacts) from a text description using ElevenLabs' Sound Effects V2 model.

Call it viaaudio(sfx) (the audio MCP tool with action: "sfx"; pass your description in prompt, which the worker maps to the model's text field) · raw: POST /v1/jobs/elevenlabs_sfx

CostBilled per second of audio
Mode / timeoutsync / 60s

Parameters — the model's input schema:

ParamTypeRequiredDefaultAllowed / rangeDescription
textstringmax 450 charactersThe text describing the sound effect to generate.
duration_secondsnumbernone (model decides)0.522 (nullable)Duration in seconds. If omitted/null, optimal duration is determined from the prompt.
prompt_influencenumber0.301How closely to follow the prompt. Higher values mean less variation.
output_formatstring (enum)mp3_44100_128mp3_22050_32, mp3_44100_32, mp3_44100_64, mp3_44100_96, mp3_44100_128, mp3_44100_192, pcm_8000, pcm_16000, pcm_22050, pcm_24000, pcm_44100, pcm_48000, ulaw_8000, alaw_8000, opus_48000_32, opus_48000_64, opus_48000_96, opus_48000_128, opus_48000_192Output audio format, as codec_sampleRate_bitrate.
loopbooleanfalsetrue / falseWhether to create a sound effect that loops smoothly.

Our wrapper params (not part of the model schema): out (required — workdir-relative output path, e.g. .mp3) and mock (optional — test placeholder). No format mapping applies to this model (format_field is empty).

Limits — model limits:

  • text: max 450 characters.
  • duration_seconds: 0.5–22 seconds.
  • prompt_influence: 0–1.
  • Output codecs: MP3 (22.05/44.1 kHz, 32–192 kbps), PCM (8–48 kHz), μ-law/A-law 8 kHz, Opus 48 kHz (32–192 kbps).

Minimax Music v2.6 minimax_music

MiniMax Music 2.6 creates complete tracks with singing, backing music, and detailed arrangements from a style description and optional lyrics.

Call it viaaudio(music) MCP tool · raw: POST /v1/jobs/minimax_music

Cost30 cr per call
Mode / timeoutwebhook / 8m (from our YAML)

Parameters — the model's input schema:

ParamTypeRequiredDefaultAllowed / rangeDescription
promptstring10–2000 charsDescription of the music style, mood, genre, and scenario.
lyricsstring""max 3500 charsSong lyrics. Use \n to separate lines. Supports structure tags: [Intro], [Verse], [Pre Chorus], [Chorus], [Post Chorus], [Hook], [Bridge], [Interlude], [Transition], [Build Up], [Break], [Inst], [Solo], [Outro]. Required when is_instrumental is false.
lyrics_optimizerbooleanfalsetrue / falseWhen true and lyrics is empty, auto-generates lyrics from the prompt.
is_instrumentalbooleanfalsetrue / falseWhen true, generates vocal-free instrumental music.
audio_settingobjectsee belowAudio configuration settings (object).
audio_setting.sample_rateinteger4410016000, 24000, 32000, 44100Sample rate of generated audio (Hz).
audio_setting.bitrateinteger25600032000, 64000, 128000, 256000Bitrate of generated audio (bps).
audio_setting.formatstringmp3mp3, wav, pcmOutput audio format.

Our wrapper params (not part of the model schema): out (required — workdir-relative output path, e.g. .mp3), mock (optional — test placeholder). This model has no format_field, so our format wrapper is not used here.

Limits — model limits: prompt 10–2000 characters; lyrics max 3500 characters; output formats mp3 / wav / pcm; sample rate up to 44100 Hz; bitrate up to 256000 bps. Lyrics are required when is_instrumental is false.

Audio Concat audio_concat

FieldValue
Categoryaudio_process
Modesync
Timeout30s
CostFree (cost_per_unit: 0)
HandlerexecAudioConcatAudioConcat (internal/ffmpeg/audio_concat.go)
MCP routeaudio(action: "concat") — maps the tool's tracks[] arg to the model's files field

Description: Concatenate multiple audio files in order. Accepts a mix of input formats — every input is decoded and re-encoded to the target output format, then joined with ffmpeg's concat demuxer (-c copy, no second re-encode).

Parameters (from YAML input_schema, cross-checked against handler):

ParamTypeRequiredDefaultNotes
filesarray of stringyesOrdered list of audio paths (any mix of mp3/wav/aac/flac/ogg). Handler errors if empty; non-string entries rejected.
outstringyesOutput audio path.
silence_betweennumberno0Seconds of silence inserted between files (not after the last). Implemented via generated anullsrc mono 44.1 kHz segments.
output_formatstringnoinferred from out ext, else mp3enum: mp3, aac, wav, flac, ogg. Read by handler ✓.
sample_rateintegernosource rateTarget Hz; applied via -ar. Read by handler ✓.

Behaviour notes:

  • Single-file fast path: with one file and silence_between <= 0, if input/output extensions match and no sample_rate is given, it byte-copies the file (acts as a pass-through). Otherwise it delegates to AudioConvert — i.e. a single file makes this a format converter.
  • Codec mapping (via outputCodecArgs): wav→pcm_s16le, flac→flac, ogg→libvorbis 192k, aac→aac 192k, default→libmp3lame 192k.
  • Concat-list injection is guarded: a file path containing a quote or newline is rejected.
  • Returns outputs.audio / outputs.local_path plus metrics (num_files, total_duration_sec, silence_between).

Audio-Only Mix audio_only_mix

FieldValue
Categoryaudio_process
Modesync
Timeout2m
CostFree (cost_per_unit: 0)
HandlerexecAudioOnlyMixAudioOnlyMix (internal/ffmpeg/audio_only_mix.go)
MCP routeaudio(action: "mix") — passes tracks[] (and the optional music / music_level) through

Description: Mix audio files into a single audio file. Two modes: a flat mix of 2+ tracks with ffmpeg's amix filter, or — when the optional music bed is set — a music-under-voice mix where tracks are the primary program (1+ allowed) and the bed is auto-fit to their length and ducked under them. Unlike video_audio_mix (which overlays audio onto a video), this produces a pure audio file with no video track.

Parameters:

ParamTypeRequiredDefaultNotes
tracksarray of stringyesAudio paths. Flat mix: min 2, all at equal level. With music: the primary program (e.g. voiceover), min 1.
musicstringnoOptional background music bed. When set, the bed is auto-fit to the tracks' length (trimmed if longer, looped if shorter) and ducked under them.
music_levelnumberno-18Music bed level in dB relative to the voice (used only with music).
outstringyesOutput audio path.

Behaviour notes (code-only, not exposed as params):

  • Flat mix: all tracks are mixed at equal levels; output is normalized (amix=...:normalize=1) to prevent clipping; output duration equals the longest input.
  • Music-under-voice: the bed never runs past the voice and never drowns it (ducked at music_level dB).
  • Output is forced to stereo (-ac 2).
  • For per-layer volume / timing offsets onto a video, use video_audio_mix instead.

Audio Trim audio_trim

FieldValue
Categoryaudio_process
Modesync
Timeout1m
CostFree (cost_per_unit: 0)
MCP routeaudio(action: "trim") — maps the tool's audio arg to the model's in field

Description: Cut an audio file to a start time and optional duration — e.g. shorten a long music bed before mixing, or drop a lead-in/lead-out. Output timestamps are rebased to 0, so the result is a clean seekable clip.

Parameters:

ParamTypeRequiredDefaultNotes
instringyesInput audio path (the MCP trim action's audio argument).
outstringyesOutput audio path.
start_secnumberno0Where the kept window starts, in seconds (≥ 0).
duration_secnumbernoLength of the kept window. Omit (or ≤ 0) to keep everything from start_sec to the end.

Audio Convert audio_convert

FieldValue
Categoryaudio_process
Modesync
Timeout30s
CostFree (cost_per_unit: 0)
HandlerexecAudioConvertAudioConvert (internal/ffmpeg/audio_convert.go)
MCP routeNone — internal-only (REST POST /v1/jobs/audio_convert or pipeline step). No audio(...) action routes here.

Description: Convert an audio file between formats, change sample rate, and/or adjust bitrate. Input format is auto-detected; output is chosen by the format key (see mismatch below) or inferred from the out extension.

Parameters (from YAML — see mismatch flag):

ParamTypeRequiredDefaultNotes
instringyesInput audio path.
outstringyesOutput audio path; format inferred from extension if no format key set.
output_formatstringnoinferred from out extenum: mp3, mp3_128, mp3_320, aac, aac_256, wav, wav_48k, flac, ogg, opus. ⚠ See mismatch.
sample_rateintegernooriginalTarget Hz (e.g. 44100, 48000); applied via -ar. Read by handler ✓.

⚠ YAML ↔ handler mismatch (important): The YAML declares the format selector as output_format, but execAudioConvert reads inputs["format"] (executor.go:250), not output_format. Consequences:

  • A caller passing output_format exactly as the YAML documents will have it silently ignored; the handler falls back to inferring the format from the out file extension.
  • The extended enum values that have no matching extension — mp3_128, mp3_320, aac_256, wav_48k, opus — are only reachable by passing the undocumented key format (e.g. format: "mp3_320"). Format/bitrate table (handler audioCodecs): mp3=192k, mp3_128=128k, mp3_320=320k, aac=192k, aac_256=256k, wav/wav_48k=pcm_s16le (wav_48k forces -ar 48000), flac=lossless, ogg=libvorbis 192k, opus=libopus 128k.
  • Recommendation: either rename the YAML field to format, or update the handler to also read output_format (as audio_concat does), or have the MCP/handler alias the two keys.

Behaviour notes: Unknown format → error listing valid keys. Returns outputs.audio / outputs.local_path plus metrics (input_duration_sec, output_duration_sec, format, codec).


Audio Tail Fade tail_fade

FieldValue
Categoryaudio_process
Modesync
Timeout30s
CostFree (cost_per_unit: 0)
HandlerexecTailFadeTailFade (internal/ffmpeg/tail_fade.go)
MCP routeNone — internal-only (REST POST /v1/jobs/tail_fade or pipeline step). No audio(...) action routes here.

Description: Add a silence pad and a fade-out at the end of an audio file to prevent an abrupt ending (the "audio cuts off" bug). Intended to run after voiceover generation, before assembly. Purely parameter-driven — no prompt.

Parameters:

ParamTypeRequiredDefaultNotes
instringyesInput audio path (workdir-relative).
outstringyesOutput audio path.
pad_secnumberno0.8Seconds of trailing silence added (ffmpeg apad=pad_dur).
fade_secnumberno0.6Fade-out duration (ffmpeg afade=t=out).

Behaviour notes:

  • Defaults are applied when the value is <= 0, so passing 0 yields the default (0.8 / 0.6), not a true zero. To disable padding/fade you cannot use this model with 0.
  • The fade start point is computed internally as input_duration + 0.1s — it is not a parameter.
  • Output encoded with -q:a 2 (VBR ~190 kbps mp3-class quality, format from out ext).
  • Returns outputs.audio / outputs.local_path plus metrics (input_duration_sec, output_duration_sec, pad_sec, fade_sec, fade_start_sec).

Framehood