Video models
Video generation, image-to-video, editing, swap, and upscaling — model input schemas.
Generations are charged in credits (see Credits & plans). Every generation model also accepts
mock: truefor a free placeholder result.
Seedance 2.0 Reference-to-Video seedance_r2v
ByteDance's reference-to-video model that generates a clip from a text prompt plus up to 9 reference images, 3 videos, and 3 audio clips for identity, motion, and voice consistency. Output up to native 4K.
Call it via — video tool, action: "create" (text→video; optional reference_images, video_urls, audio_urls) · raw: POST /v1/jobs/seedance_r2v
| Cost | 303 cr per call (5 s at the default 720p). Scales with resolution: 480p ≈ 135 cr, 1080p 681 cr, 4K 1555 cr per 5 s |
| Mode / timeout | webhook / 15m |
Parameters — the model's input schema:
| Param | Type | Required | Default | Allowed / range | Description |
|---|---|---|---|---|---|
prompt | string | ✓ | — | — | Text prompt used to generate the video. Refer to references as @Image1, @Video1, @Audio1. |
image_urls | list<string> | — | up to 9; JPEG/PNG/WebP; ≤30 MB each | Reference images. Refer to them as @Image1, @Image2… Total files across all modalities ≤ 12. | |
video_urls | list<string> | — | up to 3; MP4/MOV; combined 2–15 s; total <50 MB; each ~480p (640×640) to ~720p (834×1112) | Reference videos. Refer to them as @Video1, @Video2… | |
audio_urls | list<string> | — | up to 3; MP3/WAV; combined ≤15 s; ≤15 MB each | Reference audio. Refer to them as @Audio1… If audio is provided, at least one reference image or video is required. | |
resolution | enum | 720p | 480p, 720p, 1080p, 4k | 480p for cheap drafts (~0.45× credits), 720p default, 1080p for final delivery (2.25×), 4k for hero shots (~5.1×). | |
duration | enum | auto | auto, 4–15 | Duration in seconds, or auto to let the model decide. | |
aspect_ratio | enum | auto | auto, 21:9, 16:9, 4:3, 1:1, 3:4, 9:16 | Aspect ratio of the generated video. When omitted, our wrapper applies its vertical preset (9:16) — pass auto explicitly to follow the reference images' geometry. | |
generate_audio | boolean | true | — | Generate synchronized audio (SFX, ambient, lip-synced speech). Cost is the same either way. | |
bitrate_mode | enum | standard | standard, high | Output bitrate mode; high requests a higher-quality, larger-file encode. | |
end_user_id | string | — | — | Unique ID of the end user. |
Our wrapper params (not part of the model input schema): out (required — workdir-relative output path), mock (optional — test placeholder), and format (optional — size preset shorts/reels/horizontal, mapped by our format_field/format_mapping to the model's aspect_ratio: shorts/reels→9:16, horizontal→16:9, default 9:16).
Limits — prompt: text only. image_urls: max 9 images, JPEG/PNG/WebP, ≤30 MB each. video_urls: max 3 videos, MP4/MOV, combined 2–15 s, total <50 MB, each between ~480p (640×640) and ~720p (834×1112). audio_urls: max 3 files, MP3/WAV, combined ≤15 s, ≤15 MB each; requires at least one image or video reference. Total reference files across all modalities ≤ 12. Output resolution up to native 4K; duration 4–15 s (or auto). No seed input — every render is a new take.
Kling v3 Standard Image-to-Video kling_v3_std_i2v
Image-to-video at standard quality with cinematic visuals, fluid motion, native audio generation, and custom element support — use for quick drafts and iterations before pro renders.
Call it via — image tool, action: "animate", tier: "standard" (the default animate tier) · raw: POST /v1/jobs/kling_v3_std_i2v
The image(animate) tool exposes the multi-shot timeline directly: pass multi_prompt (an array of {prompt, duration} shots) and optional shot_type instead of a single prompt. The tool validates Kling's caps before submitting — at most 6 shots and a combined duration ≤ 15 s (each shot 1–15 s, default 5) — and rejects prompt + multi_prompt together.
| Cost | 84 cr per call |
| Mode / timeout | webhook / 15m |
Parameters — the model's input schema:
| Param | Type | Required | Default | Allowed / range | Description |
|---|---|---|---|---|---|
start_image_url | string | ✓ | — | — | URL of the image used as the starting frame of the video. |
prompt | string | — | maxLength 2500 | Text prompt for video generation. Either prompt or multi_prompt must be provided, but not both. | |
multi_prompt | array<object> | — | items: { prompt: string (required), duration: string default "5", enum "1"–"15" } | List of prompts for multi-shot generation; divides the video into multiple shots. | |
duration | string | "5" | "3","4","5","6","7","8","9","10","11","12","13","14","15" | Duration of the generated video in seconds. | |
generate_audio | boolean | true | — | Generate native audio for the video. Supports Chinese/English; other languages auto-translated to English. | |
end_image_url | string | — | — | URL of the image used as the end frame of the video. | |
elements | array<object> | — | items: { frontal_image_url, reference_image_urls (1–3, ≥1 required), video_url, voice_id } | Characters/objects to inject. Each entry is either an image set (frontal + reference images) or a video. Reference in prompt as @Element1, @Element2, etc. Only one element may carry a video. | |
shot_type | string | "customize" | customize, intelligent | Multi-shot generation type; intelligent lets the model auto-determine shot structure. | |
negative_prompt | string | "blur, distort, and low quality" | maxLength 2500 | What to steer away from. | |
cfg_scale | number | 0.5 | 0–1 | Classifier-Free Guidance scale — how strictly the model follows the prompt. |
Our wrapper params (not part of the model input schema): out (required — output filename) and mock (optional — test placeholder). format is accepted by our image MCP tool but is NOT forwarded to this model (the model has no size/aspect field; YAML format_field is empty), so it has no effect here.
Limits (model limits):
- Prompt / negative_prompt: max 2500 characters each.
- Duration: 3–15 s (top-level); multi-shot element duration 1–15 s.
start_image_url/end_image_url/ element images: max file size 10 MB, min 300×300 px, aspect ratio 0.40–2.50; accepted formats jpg, jpeg, png, webp, gif, avif.- Element
video_url: max 200 MB, 720–2160 px per side, 3–10.05 s, 24–60 FPS; accepted formats mp4, mov, webm, m4v, gif. - Element
reference_image_urls: 1–3 images, at least one required.
Kling v3 Pro Image-to-Video kling_v3_pro_i2v
Top-tier image-to-video with cinematic visuals, fluid motion, native audio generation, and custom element (character/object) injection.
Call it via — MCP tool image, action animate with tier: "pro" (routes animate_pro → kling_v3_pro_i2v) · raw: POST /v1/jobs/kling_v3_pro_i2v
The image(animate) tool exposes the multi-shot timeline directly: pass multi_prompt (an array of {prompt, duration} shots) and optional shot_type instead of a single prompt. The tool validates Kling's caps before submitting — at most 6 shots and a combined duration ≤ 15 s (each shot 1–15 s, default 5) — and rejects prompt + multi_prompt together. Billed per second (no per-shot surcharge).
| Cost | 112 cr per call |
| Mode / timeout | webhook / 15m (from our YAML) |
Parameters — the model's input schema:
| Param | Type | Required | Default | Allowed / range | Description |
|---|---|---|---|---|---|
start_image_url | string | ✓ | — | Max 10MB; min 300×300px; aspect ratio 0.40–2.50 | URL of the start frame image. Aspect ratio of the output is inferred from this image. |
prompt | string | — | — | maxLength 2500 | Text prompt. Either prompt or multi_prompt must be provided, but not both. |
multi_prompt | KlingV3MultiPromptElement[] | — | — | array of {prompt (req), duration} | Multi-shot prompt list; divides the video into shots. Overrides prompt. Each shot duration enum "1"–"15", default "5". |
duration | string (enum) | — | "5" | "3","4","5","6","7","8","9","10","11","12","13","14","15" | Total video length in seconds. |
generate_audio | boolean | — | true | — | Generate native audio (Chinese/English native; other languages auto-translated to English). |
end_image_url | string | null | — | — | Max 10MB; min 300×300px; aspect ratio 0.40–2.50 | Optional end frame image URL (start-to-end interpolation). |
elements | KlingV3ComboElementInput[] | null | — | — | array | Reference characters/objects to inject. Each item is an image set (frontal_image_url + reference_image_urls) or a video (video_url), with optional voice_id. Reference in prompt as @Element1, @Element2. |
shot_type | string (enum) | — | "customize" | customize, intelligent | Multi-shot generation type; intelligent lets the model auto-plan shot structure. |
negative_prompt | string | — | "blur, distort, and low quality" | maxLength 2500 | Things to avoid. |
cfg_scale | number | — | 0.5 | 0–1 | Classifier-free guidance scale; higher = stricter prompt adherence. |
elements[] sub-fields: frontal_image_url (string, main view), reference_image_urls (string[], 1–3 images from different angles, at least one required when using image elements), video_url (string, max one video element per request), voice_id (string; voice binding supported only for video elements, not image elements).
Our wrapper params (not part of the model input schema): out (required — workdir-relative output path), mock (optional — test placeholder). We do not map a format field — there is no model size/aspect_ratio parameter; aspect ratio is inferred from start_image_url (format_field: "").
Limits — model limits:
- Video duration: 3–15 seconds (single-prompt
duration); per-shotmulti_promptduration 1–15s; shot durations sum to total length. prompt/negative_prompt: max 2500 characters each.start_image_url/end_image_url/ element images: max 10 MB; min 300×300 px; aspect ratio 0.40–2.50; formats jpg, jpeg, png, webp, gif, avif.- Element
reference_image_urls: 1–3 images. - Element
video_url: max 200 MB; 720–2160 px; 3.0–10.05 s; 24–60 fps; formats mp4, mov, webm, m4v, gif; max one video element per request. - Audio: native Chinese and English; other languages auto-translated to English.
- Cost: ≈22 cr/s (audio off, the catalog default), ≈34 cr/s (audio on).
Kling O3 Video Edit kling_o3_video_edit
Video-to-video editing with Kling O3 — restyle footage, replace characters/objects, or insert elements into a source video using reference images and structured element definitions.
Call it via — video tool, action edit_ref (video(edit_ref) — requires video_url, prompt, reference_images) · raw: POST /v1/jobs/kling_o3_video_edit
| Cost | 126 cr per call |
| Mode / timeout | webhook / 15m |
Parameters — the model's input schema:
| Param | Type | Required | Default | Allowed / range | Description |
|---|---|---|---|---|---|
prompt | string | ✓ | — | maxLength 2500 | Text prompt for the edit. Reference the source video as @Video1, elements as @Element1–@ElementN, and reference images as @Image1–@ImageN. |
video_url | string | ✓ | — | .mp4/.mov only; 720–2160px; 3.0–10.05s; 24–60 FPS; ≤200MB | Reference (source) video URL to edit. |
image_urls | string[] | null | — | null | each image ≤10MB, ≥300×300px, aspect 0.40–2.50 | Reference images for style/appearance, cited in prompt as @Image1, @Image2, … Max 4 total (elements + reference images) when using video. |
keep_audio | boolean | — | true | true / false | Keep the original audio from the source video. |
elements | object[] | null | — | null | array of { frontal_image_url: string, reference_image_urls: string[] (1–3) } | Elements (characters/objects) to inject, cited in prompt as @Element1, @Element2. Each element needs a frontal image and 1–3 reference images (per-image limits same as image_urls). |
shot_type | string | — | customize | const customize | Multi-shot generation type (only customize is accepted). |
Our wrapper params (not part of the model input schema): out (required — workdir-relative output path), mock (optional — skip the API call and return a placeholder). This model has no format mapping (no model size field). Our video(edit_ref) action collects reference photos under reference_images and maps them to the model's image_urls field; the optional elements argument passes through to the model's elements input (cite as @Element1).
Limits — prompt ≤2500 chars · source video .mp4/.mov, 3.0–10.05s, 720–2160px, 24–60 FPS, ≤200MB · reference/element images ≤10MB each, min 300×300px, aspect ratio 0.40–2.50 · max 4 total (elements + reference images) when using video.
PixVerse Swap pixverse_swap
Generate high-quality video clips by swapping a person, object, or background in source footage using a reference image — keyframe-based, prompt-free.
Call it via — video tool, action swap (routes to pixverse_swap) · raw: POST /v1/jobs/pixverse_swap
| Cost | 30 cr per call |
| Mode / timeout | webhook / 15m (from our YAML) |
Parameters — the model's input schema:
| Param | Type | Required | Default | Allowed / range | Description |
|---|---|---|---|---|---|
video_url | string | ✓ | — | URL | URL of the external video to swap. |
image_url | string | ✓ | — | URL | URL of the target image for swapping (the element to swap IN). |
mode | string | person | person, object, background | The swap mode to use. | |
keyframe_id | integer | 1 | min 1, max = duration_seconds × 24 | Keyframe ID for face/object mapping. Input video is normalized to 24 FPS, so keyframe 1 = first frame, keyframe 24 = 1s in. | |
resolution | string | 720p | 360p, 540p, 720p | Output resolution (1080p not supported). | |
original_sound_switch | boolean | true | true / false | Whether to keep the original audio. | |
seed | integer | null | null | any integer | Random seed for generation. |
Our wrapper params (not part of the model input schema): out (required — workdir-relative output path), mock (optional — skip the API call and return a placeholder for testing). This model does not use our format→size mapping (format_field is empty).
Limits:
- Input video formats: MP4, MOV, WebM, M4V, GIF.
- Reference image formats: JPG, JPEG, PNG, WebP, GIF, AVIF.
- Resolution: 360p / 540p / 720p (1080p listed but not supported).
- Cost is per 5-second clip; videos longer than 5s cost double. Best quality on clips under ~10 seconds.
keyframe_idupper bound isduration_seconds × 24(24 FPS normalized).
Wan 2.7 Video Edit wan_27_video_edit
Video-to-video editing driven by a text instruction (and optional reference image) — restyle, transform scenes, or apply style transfer to existing footage using WAN 2.7.
Call it via — video tool, action: "edit" (restyle existing footage) · raw: POST /v1/jobs/wan_27_video_edit
| Cost | 100 cr per call |
| Mode / timeout | webhook / 15m |
Parameters — the model's input schema:
| Param | Type | Required | Default | Allowed / range | Description |
|---|---|---|---|---|---|
prompt | string | ✓ | — | minLength 1 | Editing instruction or style-transfer description. |
video_url | string | ✓ | — | MP4/MOV, 2–10s, ≤100 MB | URL of the input video to edit. |
reference_image_url | string (nullable) | null | jpg/jpeg/png/webp/gif/avif | Reference image URL for reference-based editing. | |
resolution | string | 1080p | 720p, 1080p | Output video resolution tier. | |
aspect_ratio | string (nullable) | null (matches input) | 16:9, 9:16, 1:1, 4:3, 3:4 | Aspect ratio of the generated video; defaults to the input video's. | |
duration | integer | 0 | 0, 2–10 | Output duration in seconds. 0 = match input; when set (2–10) truncates from the start. | |
audio_setting | string | auto | auto, origin | Audio handling. auto: model decides whether to regenerate audio. origin: preserve original audio. | |
seed | integer (nullable) | null | 0–2147483647 | Random seed for reproducibility. | |
enable_safety_checker | boolean | true | true / false | Enable content moderation on input and output. |
Wrapper params (our API, not part of the model input schema): out (required — workdir-relative output filename), mock (optional — return a test placeholder, skips the model call). This model defines format_field: "", so there is no format → model-size mapping.
Limits — Source video: MP4/MOV, duration 2–10 s, max file size 100 MB (upload timeout 30 s). Reference image formats: jpg, jpeg, png, webp, gif, avif. Output duration: 0 (match input) or 2–10 s. Output resolution: 720p or 1080p. Seed range: 0–2147483647.
Topaz Video Upscale topaz_upscale_video
Professional-grade video upscaling and enhancement using Topaz technology — upscale resolution, interpolate frames, and clean up noise/compression artifacts.
Call it via — video tool, action upscale (pass video_url) · raw: POST /v1/jobs/topaz_upscale_video
| Cost | 100 cr per call |
| Mode / timeout | webhook / 15m |
Parameters — the model's input schema:
| Param | Type | Required | Default | Allowed / range | Description |
|---|---|---|---|---|---|
video_url | string | ✓ | — | — | URL of the video to upscale. |
model | string | Proteus | Proteus, Artemis HQ, Artemis MQ, Artemis LQ, Nyx, Nyx Fast, Nyx XL, Nyx HF, Gaia HQ, Gaia CG, Gaia 2, Starlight Precise 1, Starlight Precise 2, Starlight Precise 2.5, Starlight HQ, Starlight Mini, Starlight Sharp, Starlight Fast 1, Starlight Fast 2 | Enhancement model. Proteus = most videos; Artemis = denoise+sharpen; Nyx = dedicated denoising; Gaia HQ/CG = rendered content; Gaia 2 = animation/motion graphics at 2x; Starlight = generative diffusion-based upscaling. | |
upscale_factor | number | 2 | 1–4 | Factor to upscale by (e.g. 2.0 doubles width and height). | |
target_fps | integer | — (null) | 16–60 | Target FPS for frame interpolation. If set, interpolation is enabled. | |
compression | number | — (null, model-dependent) | 0.0–1.0 | Compression artifact removal level. | |
noise | number | — (null, model-dependent) | 0.0–1.0 | Noise reduction level. | |
halo | number | — (null, model-dependent) | 0.0–1.0 | Halo reduction level. | |
grain | number | — (null, model-dependent) | 0.0–0.1 (step 0.01) | Film grain amount. | |
recover_detail | number | — (null) | 0.0–1.0 | Recover original detail; higher preserves more original detail. | |
H264_output | boolean | false | true / false | Use H264 codec for output. Default (false) = H265. |
Our wrapper params (not part of the model input schema): out (required — workdir-relative output path), mock (optional — test placeholder). This model has no format mapping (format_field is empty).
Limits — accepted input formats: mp4, mov, webm, m4v, gif. Max upscale_factor 4x; target_fps capped at 60. Pricing scales with duration and resolution: 2 cr/sec up to 720p, 4 cr/sec for 720p–1080p, 16 cr/sec above 1080p; price doubles for 60fps output; Gaia 2 costs half. (No published max duration / resolution / file-size limit.)