Skip to content

Actor models

Temporarily disabled

Actor models — LoRA training, actor-consistent generation, and voice cloning — are temporarily disabled while we rework them. The actor tool and the actor-dependent actions (image(actor_sheet), video(scene), actor_id routing) are not currently available. This page is kept for reference.

Persistent characters: LoRA training, actor-consistent generation, and voice cloning.

Generations are charged in credits (see Credits & plans). Every generation model also accepts mock: true for a free placeholder result.

Actor LoRA Training actor_lora_train

Fine-tunes a Flux LoRA on a ZIP of 4–30 reference images, producing a downloadable LoRA .safetensors URL for a persistent actor identity.

Call it via — MCP tool actor action create (the create path submits training to actor_lora_train after registering the actor) · raw: POST /v1/jobs/actor_lora_train

Cost500 cr per call
Mode / timeoutwebhook / 20m

Parameters — the model's input schema:

ParamTypeRequiredDefaultAllowed / rangeDescription
images_data_urlstringpublicly-accessible URL or base64 data URIURL to a ZIP archive of training images. Use at least 4 (more is better). The archive may also contain per-image caption .txt files and *_mask.jpg mask files sharing the image's name.
trigger_wordstring— (none)Trigger word used in captions. If omitted, no trigger word is used; if no captions are supplied, the trigger word is used in place of captions.
create_masksbooleantruetrue / falseIf true, segmentation masks weight the training loss (a face mask is used for people when possible).
stepsinteger— (unspecified)Number of training steps for the LoRA.
is_stylebooleanfalsetrue / falseIf true, trains a style LoRA: deactivates segmentation and auto-captioning and uses the trigger word to specify the style.
is_input_format_already_preprocessedbooleanfalsetrue / falseIf false, expects raw input (image + matching caption file by name). Set true if the data is already in the preprocessed format.
data_archive_formatstring— (inferred from URL)e.g. zipArchive format. If unspecified, inferred from the URL.

Our wrapper params (not part of the model schema): out (required — output filename for the resulting LoRA file) and mock (optional — returns a test placeholder instead of running real training). No format/size mapping applies to this model (it has no size field; our YAML format_field is empty).

Limits — the model states a practical minimum of ~4 images (more recommended); our wrapper documents a 4–30 reference-image range. No max resolution, file-size, or hard image-count cap is published for this endpoint, so none is asserted here.

Flux LoRA Inference actor_lora_inference

FLUX.1 [dev] text-to-image generation with one or more custom LoRA adaptations — used internally to render an actor with its trained likeness LoRA.

Call it viaimage(create, actor_id=…, prompt=…) (also used internally by image(actor_sheet), image(animate) first-frame, actor(batch), and video(scene); the Worker injects the actor's LoRA path + scale and prepends the trigger_word) · raw: POST /v1/jobs/actor_lora_inference

CostBilled per megapixel — ≈7 cr per image at the ~1 MP presets
Mode / timeoutsync / 60s (from our YAML)

Parameters — the model's input schema:

ParamTypeRequiredDefaultAllowed / rangeDescription
promptstringThe prompt to generate an image from.
image_sizeImageSize | enumportrait_16_9 (our default; the model's own is landscape_4_3)square_hd, square, portrait_4_3, portrait_16_9, landscape_4_3, landscape_16_9 — or an object {width, height} for custom sizeThe size of the generated image.
num_inference_stepsinteger28The number of inference steps to perform.
seedintegerSame seed + same prompt + same model version yields the same image.
loraslist<LoraWeight>each item: {path (string, required), scale (float, default 1)}The LoRAs to use; any number may be supplied and are merged.
guidance_scalefloat3.5CFG scale — how closely the model sticks to the prompt.
sync_modebooleanIf true, media is returned as a data URI and not stored in request history.
num_imagesinteger1Number of images to generate (always 1 for streaming output).
enable_safety_checkerbooleantrueEnables the safety checker.
output_formatenumjpegjpeg, pngThe format of the generated image.
accelerationenumnonenone, regularAcceleration level; regular balances speed and quality.

Our wrapper params (not part of the model schema): out (required — workdir-relative output filename), mock (optional — test placeholder). The image/video MCP tools accept the same friendly format names on the actor path as on the plain path and normalize them to this model's image_size enum before submitting: portrait/vertical/shorts/reels/9:16portrait_16_9 (default), landscape/horizontal/wide/16:9landscape_16_9, square/1:1square_hd, 3:4portrait_4_3, 4:3landscape_4_3. Matching is case-insensitive (the value is lowercased before lookup). (Raw image_size enum values pass through unchanged; an unrecognised value falls back to the default rather than erroring.)

Limitsimage_size enum is fixed to the six named values above; custom sizes are passed as a {width, height} object (model default 512×512). num_images defaults to 1 and is forced to 1 for streaming output. No max-resolution / file-size / character limit is published for the model beyond these.

Actor Voice Clone (IVC) actor_voice_clone

Instant Voice Cloning from one or more audio samples; returns an ElevenLabs voice_id that can be stored on an actor and reused for text-to-speech.

Call it viaactor tool, create or update action with voice_sample_url set (the worker submits the clone job automatically and stores the returned voice_id on the actor) · raw: POST /v1/jobs/actor_voice_clone

Cost1 cr per call
Mode / timeoutsync / 120s

Parameters — the model's input schema (POST /v1/voices/add, multipart form):

ParamTypeRequiredDefaultAllowed / rangeDescription
namestringThe name that identifies this voice (shown in the voice dropdown).
filesfile[] (multipart)audio recordingsA list of audio recordings intended for voice cloning.
remove_background_noisebooleanfalsetrue / falseIf set, removes background noise from samples via the audio-isolation model. If samples have no background noise, this can reduce quality.
descriptionstring | nullnullA description of the voice.
labelsmap<string,string> | string | nullnullkeys: language, accent, gender, ageLabels for the voice (free-form metadata).

Our wrapper params (not part of the model schema): out (required — output filename for the job result), mock (optional — test placeholder, skips real generation). Our YAML exposes the model's files field as sample_urls (an array of public audio URLs); the Go adapter downloads each URL and submits it as a multipart files entry. This model has no format→size mapping (format_field is empty).

Limits — Request is a multipart form accepting multiple audio files. No hard max-file-count or per-file size is published in this endpoint's reference; documented guidance is to provide clean samples totaling ~1–3 minutes. labels keys are restricted to language, accent, gender, or age. Other hard limits (exact max file count / file size / supported codecs) are not stated in the endpoint reference and are omitted.

Framehood