# Video models

Video generation, image-to-video, editing, swap, and upscaling — model input schemas.

> Generations are charged in credits (see [Credits & plans](/guide/billing)). Every generation model also accepts `mock: true` for a free placeholder result.

### Seedance 2.0 Reference-to-Video `seedance_r2v`

ByteDance's reference-to-video model that generates a clip from a text prompt plus up to 9 reference images, 3 videos, and 3 audio clips for identity, motion, and voice consistency. Output up to native 4K.

**Call it via** — `video` tool, `action: "create"` (text→video; optional `reference_images`, `video_urls`, `audio_urls`) · raw: `POST /v1/jobs/seedance_r2v`

| | |
|---|---|
| **Cost** | 303 cr per call (5 s at the default 720p). Scales with resolution: 480p ≈ 135 cr, 1080p 681 cr, 4K 1555 cr per 5 s |
| **Mode / timeout** | webhook / 15m |

**Parameters** — the model's input schema:

| Param | Type | Required | Default | Allowed / range | Description |
|---|---|---|---|---|---|
| `prompt` | string | ✓ | — | — | Text prompt used to generate the video. Refer to references as @Image1, @Video1, @Audio1. |
| `image_urls` | list&lt;string&gt; | | — | up to 9; JPEG/PNG/WebP; ≤30 MB each | Reference images. Refer to them as @Image1, @Image2… Total files across all modalities ≤ 12. |
| `video_urls` | list&lt;string&gt; | | — | up to 3; MP4/MOV; combined 2–15 s; total &lt;50 MB; each ~480p (640×640) to ~720p (834×1112) | Reference videos. Refer to them as @Video1, @Video2… |
| `audio_urls` | list&lt;string&gt; | | — | up to 3; MP3/WAV; combined ≤15 s; ≤15 MB each | Reference audio. Refer to them as @Audio1… If audio is provided, at least one reference image or video is required. |
| `resolution` | enum | | `720p` | `480p`, `720p`, `1080p`, `4k` | 480p for cheap drafts (~0.45× credits), 720p default, 1080p for final delivery (2.25×), 4k for hero shots (~5.1×). |
| `duration` | enum | | `auto` | `auto`, `4`–`15` | Duration in seconds, or auto to let the model decide. |
| `aspect_ratio` | enum | | `auto` | `auto`, `21:9`, `16:9`, `4:3`, `1:1`, `3:4`, `9:16` | Aspect ratio of the generated video. When omitted, our wrapper applies its vertical preset (`9:16`) — pass `auto` explicitly to follow the reference images' geometry. |
| `generate_audio` | boolean | | `true` | — | Generate synchronized audio (SFX, ambient, lip-synced speech). Cost is the same either way. |
| `bitrate_mode` | enum | | `standard` | `standard`, `high` | Output bitrate mode; `high` requests a higher-quality, larger-file encode. |
| `end_user_id` | string | | — | — | Unique ID of the end user. |

Our wrapper params (not part of the model input schema): `out` (required — workdir-relative output path), `mock` (optional — test placeholder), and `format` (optional — size preset `shorts`/`reels`/`horizontal`, mapped by our `format_field`/`format_mapping` to the model's `aspect_ratio`: shorts/reels→`9:16`, horizontal→`16:9`, default `9:16`).

**Limits** — prompt: text only. image_urls: max 9 images, JPEG/PNG/WebP, ≤30 MB each. video_urls: max 3 videos, MP4/MOV, combined 2–15 s, total &lt;50 MB, each between ~480p (640×640) and ~720p (834×1112). audio_urls: max 3 files, MP3/WAV, combined ≤15 s, ≤15 MB each; requires at least one image or video reference. Total reference files across all modalities ≤ 12. Output resolution up to native 4K; duration 4–15 s (or auto). No seed input — every render is a new take.

### Kling v3 Standard Image-to-Video `kling_v3_std_i2v`

Image-to-video at standard quality with cinematic visuals, fluid motion, native audio generation, and custom element support — use for quick drafts and iterations before pro renders.

**Call it via** — `image` tool, `action: "animate"`, `tier: "standard"` (the default animate tier) · raw: `POST /v1/jobs/kling_v3_std_i2v`

The `image(animate)` tool exposes the multi-shot timeline directly: pass `multi_prompt` (an array of `{prompt, duration}` shots) and optional `shot_type` instead of a single `prompt`. The tool validates Kling's caps before submitting — **at most 6 shots and a combined duration ≤ 15 s** (each shot 1–15 s, default 5) — and rejects `prompt` + `multi_prompt` together.

| | |
|---|---|
| **Cost** | 84 cr per call |
| **Mode / timeout** | webhook / 15m |

**Parameters** — the model's input schema:

| Param | Type | Required | Default | Allowed / range | Description |
|---|---|---|---|---|---|
| `start_image_url` | string | ✓ | — | — | URL of the image used as the starting frame of the video. |
| `prompt` | string |  | — | maxLength 2500 | Text prompt for video generation. Either `prompt` or `multi_prompt` must be provided, but not both. |
| `multi_prompt` | array&lt;object> |  | — | items: `{ prompt: string (required), duration: string default "5", enum "1"–"15" }` | List of prompts for multi-shot generation; divides the video into multiple shots. |
| `duration` | string |  | `"5"` | `"3"`,`"4"`,`"5"`,`"6"`,`"7"`,`"8"`,`"9"`,`"10"`,`"11"`,`"12"`,`"13"`,`"14"`,`"15"` | Duration of the generated video in seconds. |
| `generate_audio` | boolean |  | `true` | — | Generate native audio for the video. Supports Chinese/English; other languages auto-translated to English. |
| `end_image_url` | string |  | — | — | URL of the image used as the end frame of the video. |
| `elements` | array&lt;object> |  | — | items: `{ frontal_image_url, reference_image_urls (1–3, ≥1 required), video_url, voice_id }` | Characters/objects to inject. Each entry is either an image set (frontal + reference images) or a video. Reference in prompt as `@Element1`, `@Element2`, etc. Only one element may carry a video. |
| `shot_type` | string |  | `"customize"` | `customize`, `intelligent` | Multi-shot generation type; `intelligent` lets the model auto-determine shot structure. |
| `negative_prompt` | string |  | `"blur, distort, and low quality"` | maxLength 2500 | What to steer away from. |
| `cfg_scale` | number |  | `0.5` | 0–1 | Classifier-Free Guidance scale — how strictly the model follows the prompt. |

Our wrapper params (not part of the model input schema): `out` (required — output filename) and `mock` (optional — test placeholder). `format` is accepted by our `image` MCP tool but is NOT forwarded to this model (the model has no size/aspect field; YAML `format_field` is empty), so it has no effect here.

**Limits** (model limits):
- Prompt / negative_prompt: max 2500 characters each.
- Duration: 3–15 s (top-level); multi-shot element duration 1–15 s.
- `start_image_url` / `end_image_url` / element images: max file size 10 MB, min 300×300 px, aspect ratio 0.40–2.50; accepted formats jpg, jpeg, png, webp, gif, avif.
- Element `video_url`: max 200 MB, 720–2160 px per side, 3–10.05 s, 24–60 FPS; accepted formats mp4, mov, webm, m4v, gif.
- Element `reference_image_urls`: 1–3 images, at least one required.

### Kling v3 Pro Image-to-Video `kling_v3_pro_i2v`

Top-tier image-to-video with cinematic visuals, fluid motion, native audio generation, and custom element (character/object) injection.

**Call it via** — MCP tool `image`, action `animate` with `tier: "pro"` (routes `animate_pro` → `kling_v3_pro_i2v`) · raw: `POST /v1/jobs/kling_v3_pro_i2v`

The `image(animate)` tool exposes the multi-shot timeline directly: pass `multi_prompt` (an array of `{prompt, duration}` shots) and optional `shot_type` instead of a single `prompt`. The tool validates Kling's caps before submitting — **at most 6 shots and a combined duration ≤ 15 s** (each shot 1–15 s, default 5) — and rejects `prompt` + `multi_prompt` together. Billed per second (no per-shot surcharge).

| | |
|---|---|
| **Cost** | 112 cr per call |
| **Mode / timeout** | webhook / 15m (from our YAML) |

**Parameters** — the model's input schema:

| Param | Type | Required | Default | Allowed / range | Description |
|---|---|---|---|---|---|
| `start_image_url` | string | ✓ | — | Max 10MB; min 300×300px; aspect ratio 0.40–2.50 | URL of the start frame image. Aspect ratio of the output is inferred from this image. |
| `prompt` | string | — | — | maxLength 2500 | Text prompt. Either `prompt` or `multi_prompt` must be provided, but not both. |
| `multi_prompt` | `KlingV3MultiPromptElement[]` | — | — | array of `{prompt (req), duration}` | Multi-shot prompt list; divides the video into shots. Overrides `prompt`. Each shot `duration` enum `"1"`–`"15"`, default `"5"`. |
| `duration` | string (enum) | — | `"5"` | `"3"`,`"4"`,`"5"`,`"6"`,`"7"`,`"8"`,`"9"`,`"10"`,`"11"`,`"12"`,`"13"`,`"14"`,`"15"` | Total video length in seconds. |
| `generate_audio` | boolean | — | `true` | — | Generate native audio (Chinese/English native; other languages auto-translated to English). |
| `end_image_url` | string \| null | — | — | Max 10MB; min 300×300px; aspect ratio 0.40–2.50 | Optional end frame image URL (start-to-end interpolation). |
| `elements` | `KlingV3ComboElementInput[]` \| null | — | — | array | Reference characters/objects to inject. Each item is an image set (`frontal_image_url` + `reference_image_urls`) or a video (`video_url`), with optional `voice_id`. Reference in prompt as `@Element1`, `@Element2`. |
| `shot_type` | string (enum) | — | `"customize"` | `customize`, `intelligent` | Multi-shot generation type; `intelligent` lets the model auto-plan shot structure. |
| `negative_prompt` | string | — | `"blur, distort, and low quality"` | maxLength 2500 | Things to avoid. |
| `cfg_scale` | number | — | `0.5` | 0–1 | Classifier-free guidance scale; higher = stricter prompt adherence. |

`elements[]` sub-fields: `frontal_image_url` (string, main view), `reference_image_urls` (string[], 1–3 images from different angles, at least one required when using image elements), `video_url` (string, max one video element per request), `voice_id` (string; voice binding supported only for video elements, not image elements).

Our wrapper params (not part of the model input schema): `out` (required — workdir-relative output path), `mock` (optional — test placeholder). We do not map a `format` field — there is no model size/aspect_ratio parameter; aspect ratio is inferred from `start_image_url` (`format_field: ""`).

**Limits** — model limits:
- Video duration: 3–15 seconds (single-prompt `duration`); per-shot `multi_prompt` duration 1–15s; shot durations sum to total length.
- `prompt` / `negative_prompt`: max 2500 characters each.
- `start_image_url` / `end_image_url` / element images: max 10 MB; min 300×300 px; aspect ratio 0.40–2.50; formats jpg, jpeg, png, webp, gif, avif.
- Element `reference_image_urls`: 1–3 images.
- Element `video_url`: max 200 MB; 720–2160 px; 3.0–10.05 s; 24–60 fps; formats mp4, mov, webm, m4v, gif; max one video element per request.
- Audio: native Chinese and English; other languages auto-translated to English.
- Cost: ≈22 cr/s (audio off, the catalog default), ≈34 cr/s (audio on).

### Kling O3 Video Edit `kling_o3_video_edit`

Video-to-video editing with Kling O3 — restyle footage, replace characters/objects, or insert elements into a source video using reference images and structured element definitions.

**Call it via** — `video` tool, action `edit_ref` (`video(edit_ref)` — requires `video_url`, `prompt`, `reference_images`) · raw: `POST /v1/jobs/kling_o3_video_edit`

| | |
|---|---|
| **Cost** | 126 cr per call |
| **Mode / timeout** | webhook / 15m |

**Parameters** — the model's input schema:

| Param | Type | Required | Default | Allowed / range | Description |
|---|---|---|---|---|---|
| `prompt` | string | ✓ | — | maxLength 2500 | Text prompt for the edit. Reference the source video as `@Video1`, elements as `@Element1`–`@ElementN`, and reference images as `@Image1`–`@ImageN`. |
| `video_url` | string | ✓ | — | .mp4/.mov only; 720–2160px; 3.0–10.05s; 24–60 FPS; ≤200MB | Reference (source) video URL to edit. |
| `image_urls` | string[] \| null | — | null | each image ≤10MB, ≥300×300px, aspect 0.40–2.50 | Reference images for style/appearance, cited in prompt as `@Image1`, `@Image2`, … Max 4 total (elements + reference images) when using video. |
| `keep_audio` | boolean | — | `true` | true / false | Keep the original audio from the source video. |
| `elements` | object[] \| null | — | null | array of `{ frontal_image_url: string, reference_image_urls: string[] (1–3) }` | Elements (characters/objects) to inject, cited in prompt as `@Element1`, `@Element2`. Each element needs a frontal image and 1–3 reference images (per-image limits same as `image_urls`). |
| `shot_type` | string | — | `customize` | const `customize` | Multi-shot generation type (only `customize` is accepted). |

Our wrapper params (not part of the model input schema): `out` (required — workdir-relative output path), `mock` (optional — skip the API call and return a placeholder). This model has no `format` mapping (no model size field). Our `video(edit_ref)` action collects reference photos under `reference_images` and maps them to the model's `image_urls` field; the optional `elements` argument passes through to the model's `elements` input (cite as `@Element1`).

**Limits** — prompt ≤2500 chars · source video .mp4/.mov, 3.0–10.05s, 720–2160px, 24–60 FPS, ≤200MB · reference/element images ≤10MB each, min 300×300px, aspect ratio 0.40–2.50 · max 4 total (elements + reference images) when using video.

### PixVerse Swap `pixverse_swap`

Generate high-quality video clips by swapping a person, object, or background in source footage using a reference image — keyframe-based, prompt-free.

**Call it via** — `video` tool, action `swap` (routes to `pixverse_swap`) · raw: `POST /v1/jobs/pixverse_swap`

| | |
|---|---|
| **Cost** | 30 cr per call |
| **Mode / timeout** | webhook / 15m (from our YAML) |

**Parameters** — the model's input schema:

| Param | Type | Required | Default | Allowed / range | Description |
|---|---|---|---|---|---|
| `video_url` | string | ✓ | — | URL | URL of the external video to swap. |
| `image_url` | string | ✓ | — | URL | URL of the target image for swapping (the element to swap IN). |
| `mode` | string | | `person` | `person`, `object`, `background` | The swap mode to use. |
| `keyframe_id` | integer | | `1` | min `1`, max = `duration_seconds × 24` | Keyframe ID for face/object mapping. Input video is normalized to 24 FPS, so keyframe 1 = first frame, keyframe 24 = 1s in. |
| `resolution` | string | | `720p` | `360p`, `540p`, `720p` | Output resolution (1080p not supported). |
| `original_sound_switch` | boolean | | `true` | true / false | Whether to keep the original audio. |
| `seed` | integer \| null | | `null` | any integer | Random seed for generation. |

Our wrapper params (not part of the model input schema): `out` (required — workdir-relative output path), `mock` (optional — skip the API call and return a placeholder for testing). This model does not use our `format`→size mapping (`format_field` is empty).

**Limits**:
- Input video formats: MP4, MOV, WebM, M4V, GIF.
- Reference image formats: JPG, JPEG, PNG, WebP, GIF, AVIF.
- Resolution: 360p / 540p / 720p (1080p listed but not supported).
- Cost is per 5-second clip; videos longer than 5s cost double. Best quality on clips under ~10 seconds.
- `keyframe_id` upper bound is `duration_seconds × 24` (24 FPS normalized).

### Wan 2.7 Video Edit `wan_27_video_edit`

Video-to-video editing driven by a text instruction (and optional reference image) — restyle, transform scenes, or apply style transfer to existing footage using WAN 2.7.

**Call it via** — `video` tool, `action: "edit"` (restyle existing footage) · raw: `POST /v1/jobs/wan_27_video_edit`

| | |
|---|---|
| **Cost** | 100 cr per call |
| **Mode / timeout** | webhook / 15m |

**Parameters** — the model's input schema:

| Param | Type | Required | Default | Allowed / range | Description |
|---|---|---|---|---|---|
| `prompt` | string | ✓ | — | minLength 1 | Editing instruction or style-transfer description. |
| `video_url` | string | ✓ | — | MP4/MOV, 2–10s, ≤100 MB | URL of the input video to edit. |
| `reference_image_url` | string (nullable) | | null | jpg/jpeg/png/webp/gif/avif | Reference image URL for reference-based editing. |
| `resolution` | string | | `1080p` | `720p`, `1080p` | Output video resolution tier. |
| `aspect_ratio` | string (nullable) | | null (matches input) | `16:9`, `9:16`, `1:1`, `4:3`, `3:4` | Aspect ratio of the generated video; defaults to the input video's. |
| `duration` | integer | | `0` | `0`, `2`–`10` | Output duration in seconds. `0` = match input; when set (2–10) truncates from the start. |
| `audio_setting` | string | | `auto` | `auto`, `origin` | Audio handling. `auto`: model decides whether to regenerate audio. `origin`: preserve original audio. |
| `seed` | integer (nullable) | | null | 0–2147483647 | Random seed for reproducibility. |
| `enable_safety_checker` | boolean | | `true` | `true` / `false` | Enable content moderation on input and output. |

Wrapper params (our API, not part of the model input schema): `out` (required — workdir-relative output filename), `mock` (optional — return a test placeholder, skips the model call). This model defines `format_field: ""`, so there is no `format` → model-size mapping.

**Limits** — Source video: MP4/MOV, duration 2–10 s, max file size 100 MB (upload timeout 30 s). Reference image formats: jpg, jpeg, png, webp, gif, avif. Output duration: 0 (match input) or 2–10 s. Output resolution: 720p or 1080p. Seed range: 0–2147483647.

### Topaz Video Upscale `topaz_upscale_video`

Professional-grade video upscaling and enhancement using Topaz technology — upscale resolution, interpolate frames, and clean up noise/compression artifacts.

**Call it via** — `video` tool, action `upscale` (pass `video_url`) · raw: `POST /v1/jobs/topaz_upscale_video`

| | |
|---|---|
| **Cost** | 100 cr per call |
| **Mode / timeout** | webhook / 15m |

**Parameters** — the model's input schema:

| Param | Type | Required | Default | Allowed / range | Description |
|---|---|---|---|---|---|
| `video_url` | string | ✓ | — | — | URL of the video to upscale. |
| `model` | string | | `Proteus` | `Proteus`, `Artemis HQ`, `Artemis MQ`, `Artemis LQ`, `Nyx`, `Nyx Fast`, `Nyx XL`, `Nyx HF`, `Gaia HQ`, `Gaia CG`, `Gaia 2`, `Starlight Precise 1`, `Starlight Precise 2`, `Starlight Precise 2.5`, `Starlight HQ`, `Starlight Mini`, `Starlight Sharp`, `Starlight Fast 1`, `Starlight Fast 2` | Enhancement model. Proteus = most videos; Artemis = denoise+sharpen; Nyx = dedicated denoising; Gaia HQ/CG = rendered content; Gaia 2 = animation/motion graphics at 2x; Starlight = generative diffusion-based upscaling. |
| `upscale_factor` | number | | `2` | 1–4 | Factor to upscale by (e.g. 2.0 doubles width and height). |
| `target_fps` | integer | | — (null) | 16–60 | Target FPS for frame interpolation. If set, interpolation is enabled. |
| `compression` | number | | — (null, model-dependent) | 0.0–1.0 | Compression artifact removal level. |
| `noise` | number | | — (null, model-dependent) | 0.0–1.0 | Noise reduction level. |
| `halo` | number | | — (null, model-dependent) | 0.0–1.0 | Halo reduction level. |
| `grain` | number | | — (null, model-dependent) | 0.0–0.1 (step 0.01) | Film grain amount. |
| `recover_detail` | number | | — (null) | 0.0–1.0 | Recover original detail; higher preserves more original detail. |
| `H264_output` | boolean | | `false` | true / false | Use H264 codec for output. Default (false) = H265. |

Our wrapper params (not part of the model input schema): `out` (required — workdir-relative output path), `mock` (optional — test placeholder). This model has no `format` mapping (`format_field` is empty).

**Limits** — accepted input formats: mp4, mov, webm, m4v, gif. Max `upscale_factor` 4x; `target_fps` capped at 60. Pricing scales with duration and resolution: 2 cr/sec up to 720p, 4 cr/sec for 720p–1080p, 16 cr/sec above 1080p; price doubles for 60fps output; Gaia 2 costs half. (No published max duration / resolution / file-size limit.)
