Actor models
Temporarily disabled
Actor models — LoRA training, actor-consistent generation, and voice cloning — are temporarily disabled while we rework them. The actor tool and the actor-dependent actions (image(actor_sheet), video(scene), actor_id routing) are not currently available. This page is kept for reference.
Persistent characters: LoRA training, actor-consistent generation, and voice cloning.
Generations are charged in credits (see Credits & plans). Every generation model also accepts
mock: truefor a free placeholder result.
Actor LoRA Training actor_lora_train
Fine-tunes a Flux LoRA on a ZIP of 4–30 reference images, producing a downloadable LoRA .safetensors URL for a persistent actor identity.
Call it via — MCP tool actor action create (the create path submits training to actor_lora_train after registering the actor) · raw: POST /v1/jobs/actor_lora_train
| Cost | 500 cr per call |
| Mode / timeout | webhook / 20m |
Parameters — the model's input schema:
| Param | Type | Required | Default | Allowed / range | Description |
|---|---|---|---|---|---|
images_data_url | string | ✓ | — | publicly-accessible URL or base64 data URI | URL to a ZIP archive of training images. Use at least 4 (more is better). The archive may also contain per-image caption .txt files and *_mask.jpg mask files sharing the image's name. |
trigger_word | string | — (none) | — | Trigger word used in captions. If omitted, no trigger word is used; if no captions are supplied, the trigger word is used in place of captions. | |
create_masks | boolean | true | true / false | If true, segmentation masks weight the training loss (a face mask is used for people when possible). | |
steps | integer | — (unspecified) | — | Number of training steps for the LoRA. | |
is_style | boolean | false | true / false | If true, trains a style LoRA: deactivates segmentation and auto-captioning and uses the trigger word to specify the style. | |
is_input_format_already_preprocessed | boolean | false | true / false | If false, expects raw input (image + matching caption file by name). Set true if the data is already in the preprocessed format. | |
data_archive_format | string | — (inferred from URL) | e.g. zip | Archive format. If unspecified, inferred from the URL. |
Our wrapper params (not part of the model schema): out (required — output filename for the resulting LoRA file) and mock (optional — returns a test placeholder instead of running real training). No format/size mapping applies to this model (it has no size field; our YAML format_field is empty).
Limits — the model states a practical minimum of ~4 images (more recommended); our wrapper documents a 4–30 reference-image range. No max resolution, file-size, or hard image-count cap is published for this endpoint, so none is asserted here.
Flux LoRA Inference actor_lora_inference
FLUX.1 [dev] text-to-image generation with one or more custom LoRA adaptations — used internally to render an actor with its trained likeness LoRA.
Call it via — image(create, actor_id=…, prompt=…) (also used internally by image(actor_sheet), image(animate) first-frame, actor(batch), and video(scene); the Worker injects the actor's LoRA path + scale and prepends the trigger_word) · raw: POST /v1/jobs/actor_lora_inference
| Cost | Billed per megapixel — ≈7 cr per image at the ~1 MP presets |
| Mode / timeout | sync / 60s (from our YAML) |
Parameters — the model's input schema:
| Param | Type | Required | Default | Allowed / range | Description |
|---|---|---|---|---|---|
prompt | string | ✓ | — | — | The prompt to generate an image from. |
image_size | ImageSize | enum | portrait_16_9 (our default; the model's own is landscape_4_3) | square_hd, square, portrait_4_3, portrait_16_9, landscape_4_3, landscape_16_9 — or an object {width, height} for custom size | The size of the generated image. | |
num_inference_steps | integer | 28 | — | The number of inference steps to perform. | |
seed | integer | — | — | Same seed + same prompt + same model version yields the same image. | |
loras | list<LoraWeight> | — | each item: {path (string, required), scale (float, default 1)} | The LoRAs to use; any number may be supplied and are merged. | |
guidance_scale | float | 3.5 | — | CFG scale — how closely the model sticks to the prompt. | |
sync_mode | boolean | — | — | If true, media is returned as a data URI and not stored in request history. | |
num_images | integer | 1 | — | Number of images to generate (always 1 for streaming output). | |
enable_safety_checker | boolean | true | — | Enables the safety checker. | |
output_format | enum | jpeg | jpeg, png | The format of the generated image. | |
acceleration | enum | none | none, regular | Acceleration level; regular balances speed and quality. |
Our wrapper params (not part of the model schema): out (required — workdir-relative output filename), mock (optional — test placeholder). The image/video MCP tools accept the same friendly format names on the actor path as on the plain path and normalize them to this model's image_size enum before submitting: portrait/vertical/shorts/reels/9:16 → portrait_16_9 (default), landscape/horizontal/wide/16:9 → landscape_16_9, square/1:1 → square_hd, 3:4 → portrait_4_3, 4:3 → landscape_4_3. Matching is case-insensitive (the value is lowercased before lookup). (Raw image_size enum values pass through unchanged; an unrecognised value falls back to the default rather than erroring.)
Limits — image_size enum is fixed to the six named values above; custom sizes are passed as a {width, height} object (model default 512×512). num_images defaults to 1 and is forced to 1 for streaming output. No max-resolution / file-size / character limit is published for the model beyond these.
Actor Voice Clone (IVC) actor_voice_clone
Instant Voice Cloning from one or more audio samples; returns an ElevenLabs voice_id that can be stored on an actor and reused for text-to-speech.
Call it via — actor tool, create or update action with voice_sample_url set (the worker submits the clone job automatically and stores the returned voice_id on the actor) · raw: POST /v1/jobs/actor_voice_clone
| Cost | 1 cr per call |
| Mode / timeout | sync / 120s |
Parameters — the model's input schema (POST /v1/voices/add, multipart form):
| Param | Type | Required | Default | Allowed / range | Description |
|---|---|---|---|---|---|
name | string | ✓ | — | — | The name that identifies this voice (shown in the voice dropdown). |
files | file[] (multipart) | ✓ | — | audio recordings | A list of audio recordings intended for voice cloning. |
remove_background_noise | boolean | — | false | true / false | If set, removes background noise from samples via the audio-isolation model. If samples have no background noise, this can reduce quality. |
description | string | null | — | null | — | A description of the voice. |
labels | map<string,string> | string | null | — | null | keys: language, accent, gender, age | Labels for the voice (free-form metadata). |
Our wrapper params (not part of the model schema): out (required — output filename for the job result), mock (optional — test placeholder, skips real generation). Our YAML exposes the model's files field as sample_urls (an array of public audio URLs); the Go adapter downloads each URL and submits it as a multipart files entry. This model has no format→size mapping (format_field is empty).
Limits — Request is a multipart form accepting multiple audio files. No hard max-file-count or per-file size is published in this endpoint's reference; documented guidance is to provide clean samples totaling ~1–3 minutes. labels keys are restricted to language, accent, gender, or age. Other hard limits (exact max file count / file size / supported codecs) are not stated in the endpoint reference and are omitted.