Video models

Video generation, image-to-video, editing, swap, and upscaling — model input schemas.

Generations are charged in credits (see Credits & plans). Every generation model also accepts mock: true for a free placeholder result.

Seedance 2.0 Reference-to-Video `seedance_r2v`

ByteDance's reference-to-video model that generates a clip from a text prompt plus up to 9 reference images, 3 videos, and 3 audio clips for identity, motion, and voice consistency. Output up to native 4K.

Call it via — video tool, action: "create" (text→video; optional reference_images, video_urls, audio_urls) · raw: POST /v1/jobs/seedance_r2v


Cost	303 cr per call (5 s at the default 720p). Scales with resolution: 480p ≈ 135 cr, 1080p 681 cr, 4K 1555 cr per 5 s
Mode / timeout	webhook / 15m

Parameters — the model's input schema:

Param	Type	Required	Default	Allowed / range	Description
`prompt`	string	✓	—	—	Text prompt used to generate the video. Refer to references as @Image1, @Video1, @Audio1.
`image_urls`	list<string>		—	up to 9; JPEG/PNG/WebP; ≤30 MB each	Reference images. Refer to them as @Image1, @Image2… Total files across all modalities ≤ 12.
`video_urls`	list<string>		—	up to 3; MP4/MOV; combined 2–15 s; total <50 MB; each ~480p (640×640) to ~720p (834×1112)	Reference videos. Refer to them as @Video1, @Video2…
`audio_urls`	list<string>		—	up to 3; MP3/WAV; combined ≤15 s; ≤15 MB each	Reference audio. Refer to them as @Audio1… If audio is provided, at least one reference image or video is required.
`resolution`	enum		`720p`	`480p`, `720p`, `1080p`, `4k`	480p for cheap drafts (~0.45× credits), 720p default, 1080p for final delivery (2.25×), 4k for hero shots (~5.1×).
`duration`	enum		`auto`	`auto`, `4`–`15`	Duration in seconds, or auto to let the model decide.
`aspect_ratio`	enum		`auto`	`auto`, `21:9`, `16:9`, `4:3`, `1:1`, `3:4`, `9:16`	Aspect ratio of the generated video. When omitted, our wrapper applies its vertical preset (`9:16`) — pass `auto` explicitly to follow the reference images' geometry.
`generate_audio`	boolean		`true`	—	Generate synchronized audio (SFX, ambient, lip-synced speech). Cost is the same either way.
`bitrate_mode`	enum		`standard`	`standard`, `high`	Output bitrate mode; `high` requests a higher-quality, larger-file encode.
`end_user_id`	string		—	—	Unique ID of the end user.

Our wrapper params (not part of the model input schema): out (required — workdir-relative output path), mock (optional — test placeholder), and format (optional — size preset shorts/reels/horizontal, mapped by our format_field/format_mapping to the model's aspect_ratio: shorts/reels→9:16, horizontal→16:9, default 9:16).

Limits — prompt: text only. image_urls: max 9 images, JPEG/PNG/WebP, ≤30 MB each. video_urls: max 3 videos, MP4/MOV, combined 2–15 s, total <50 MB, each between ~480p (640×640) and ~720p (834×1112). audio_urls: max 3 files, MP3/WAV, combined ≤15 s, ≤15 MB each; requires at least one image or video reference. Total reference files across all modalities ≤ 12. Output resolution up to native 4K; duration 4–15 s (or auto). No seed input — every render is a new take.

Kling v3 Standard Image-to-Video `kling_v3_std_i2v`

Image-to-video at standard quality with cinematic visuals, fluid motion, native audio generation, and custom element support — use for quick drafts and iterations before pro renders.

Call it via — image tool, action: "animate", tier: "standard" (the default animate tier) · raw: POST /v1/jobs/kling_v3_std_i2v

The image(animate) tool exposes the multi-shot timeline directly: pass multi_prompt (an array of {prompt, duration} shots) and optional shot_type instead of a single prompt. The tool validates Kling's caps before submitting — at most 6 shots and a combined duration ≤ 15 s (each shot 1–15 s, default 5) — and rejects prompt + multi_prompt together.


Cost	84 cr per call
Mode / timeout	webhook / 15m

Parameters — the model's input schema:

Param	Type	Required	Default	Allowed / range	Description
`start_image_url`	string	✓	—	—	URL of the image used as the starting frame of the video.
`prompt`	string		—	maxLength 2500	Text prompt for video generation. Either `prompt` or `multi_prompt` must be provided, but not both.
`multi_prompt`	array<object>		—	items: `{ prompt: string (required), duration: string default "5", enum "1"–"15" }`	List of prompts for multi-shot generation; divides the video into multiple shots.
`duration`	string		`"5"`	`"3"`,`"4"`,`"5"`,`"6"`,`"7"`,`"8"`,`"9"`,`"10"`,`"11"`,`"12"`,`"13"`,`"14"`,`"15"`	Duration of the generated video in seconds.
`generate_audio`	boolean		`true`	—	Generate native audio for the video. Supports Chinese/English; other languages auto-translated to English.
`end_image_url`	string		—	—	URL of the image used as the end frame of the video.
`elements`	array<object>		—	items: `{ frontal_image_url, reference_image_urls (1–3, ≥1 required), video_url, voice_id }`	Characters/objects to inject. Each entry is either an image set (frontal + reference images) or a video. Reference in prompt as `@Element1`, `@Element2`, etc. Only one element may carry a video.
`shot_type`	string		`"customize"`	`customize`, `intelligent`	Multi-shot generation type; `intelligent` lets the model auto-determine shot structure.
`negative_prompt`	string		`"blur, distort, and low quality"`	maxLength 2500	What to steer away from.
`cfg_scale`	number		`0.5`	0–1	Classifier-Free Guidance scale — how strictly the model follows the prompt.

Our wrapper params (not part of the model input schema): out (required — output filename) and mock (optional — test placeholder). format is accepted by our image MCP tool but is NOT forwarded to this model (the model has no size/aspect field; YAML format_field is empty), so it has no effect here.

Limits (model limits):

Prompt / negative_prompt: max 2500 characters each.
Duration: 3–15 s (top-level); multi-shot element duration 1–15 s.
start_image_url / end_image_url / element images: max file size 10 MB, min 300×300 px, aspect ratio 0.40–2.50; accepted formats jpg, jpeg, png, webp, gif, avif.
Element video_url: max 200 MB, 720–2160 px per side, 3–10.05 s, 24–60 FPS; accepted formats mp4, mov, webm, m4v, gif.
Element reference_image_urls: 1–3 images, at least one required.

Kling v3 Pro Image-to-Video `kling_v3_pro_i2v`

Top-tier image-to-video with cinematic visuals, fluid motion, native audio generation, and custom element (character/object) injection.

Call it via — MCP tool image, action animate with tier: "pro" (routes animate_pro → kling_v3_pro_i2v) · raw: POST /v1/jobs/kling_v3_pro_i2v


Cost	112 cr per call
Mode / timeout	webhook / 15m (from our YAML)

Parameters — the model's input schema:

Param	Type	Required	Default	Allowed / range	Description
`start_image_url`	string	✓	—	Max 10MB; min 300×300px; aspect ratio 0.40–2.50	URL of the start frame image. Aspect ratio of the output is inferred from this image.
`prompt`	string	—	—	maxLength 2500	Text prompt. Either `prompt` or `multi_prompt` must be provided, but not both.
`multi_prompt`	`KlingV3MultiPromptElement[]`	—	—	array of `{prompt (req), duration}`	Multi-shot prompt list; divides the video into shots. Overrides `prompt`. Each shot `duration` enum `"1"`–`"15"`, default `"5"`.
`duration`	string (enum)	—	`"5"`	`"3"`,`"4"`,`"5"`,`"6"`,`"7"`,`"8"`,`"9"`,`"10"`,`"11"`,`"12"`,`"13"`,`"14"`,`"15"`	Total video length in seconds.
`generate_audio`	boolean	—	`true`	—	Generate native audio (Chinese/English native; other languages auto-translated to English).
`end_image_url`	string \| null	—	—	Max 10MB; min 300×300px; aspect ratio 0.40–2.50	Optional end frame image URL (start-to-end interpolation).
`elements`	`KlingV3ComboElementInput[]` \| null	—	—	array	Reference characters/objects to inject. Each item is an image set (`frontal_image_url` + `reference_image_urls`) or a video (`video_url`), with optional `voice_id`. Reference in prompt as `@Element1`, `@Element2`.
`shot_type`	string (enum)	—	`"customize"`	`customize`, `intelligent`	Multi-shot generation type; `intelligent` lets the model auto-plan shot structure.
`negative_prompt`	string	—	`"blur, distort, and low quality"`	maxLength 2500	Things to avoid.
`cfg_scale`	number	—	`0.5`	0–1	Classifier-free guidance scale; higher = stricter prompt adherence.

elements[] sub-fields: frontal_image_url (string, main view), reference_image_urls (string[], 1–3 images from different angles, at least one required when using image elements), video_url (string, max one video element per request), voice_id (string; voice binding supported only for video elements, not image elements).

Our wrapper params (not part of the model input schema): out (required — workdir-relative output path), mock (optional — test placeholder). We do not map a format field — there is no model size/aspect_ratio parameter; aspect ratio is inferred from start_image_url (format_field: "").

Limits — model limits:

Video duration: 3–15 seconds (single-prompt duration); per-shot multi_prompt duration 1–15s; shot durations sum to total length.
prompt / negative_prompt: max 2500 characters each.
start_image_url / end_image_url / element images: max 10 MB; min 300×300 px; aspect ratio 0.40–2.50; formats jpg, jpeg, png, webp, gif, avif.
Element reference_image_urls: 1–3 images.
Element video_url: max 200 MB; 720–2160 px; 3.0–10.05 s; 24–60 fps; formats mp4, mov, webm, m4v, gif; max one video element per request.
Audio: native Chinese and English; other languages auto-translated to English.
Cost: ≈22 cr/s (audio off, the catalog default), ≈34 cr/s (audio on).

Kling O3 Video Edit `kling_o3_video_edit`

Video-to-video editing with Kling O3 — restyle footage, replace characters/objects, or insert elements into a source video using reference images and structured element definitions.

Call it via — video tool, action edit_ref (video(edit_ref) — requires video_url, prompt, reference_images) · raw: POST /v1/jobs/kling_o3_video_edit


Cost	126 cr per call
Mode / timeout	webhook / 15m

Parameters — the model's input schema:

Param	Type	Required	Default	Allowed / range	Description
`prompt`	string	✓	—	maxLength 2500	Text prompt for the edit. Reference the source video as `@Video1`, elements as `@Element1`–`@ElementN`, and reference images as `@Image1`–`@ImageN`.
`video_url`	string	✓	—	.mp4/.mov only; 720–2160px; 3.0–10.05s; 24–60 FPS; ≤200MB	Reference (source) video URL to edit.
`image_urls`	string[] \| null	—	null	each image ≤10MB, ≥300×300px, aspect 0.40–2.50	Reference images for style/appearance, cited in prompt as `@Image1`, `@Image2`, … Max 4 total (elements + reference images) when using video.
`keep_audio`	boolean	—	`true`	true / false	Keep the original audio from the source video.
`elements`	object[] \| null	—	null	array of `{ frontal_image_url: string, reference_image_urls: string[] (1–3) }`	Elements (characters/objects) to inject, cited in prompt as `@Element1`, `@Element2`. Each element needs a frontal image and 1–3 reference images (per-image limits same as `image_urls`).
`shot_type`	string	—	`customize`	const `customize`	Multi-shot generation type (only `customize` is accepted).

Our wrapper params (not part of the model input schema): out (required — workdir-relative output path), mock (optional — skip the API call and return a placeholder). This model has no format mapping (no model size field). Our video(edit_ref) action collects reference photos under reference_images and maps them to the model's image_urls field; the optional elements argument passes through to the model's elements input (cite as @Element1).

Limits — prompt ≤2500 chars · source video .mp4/.mov, 3.0–10.05s, 720–2160px, 24–60 FPS, ≤200MB · reference/element images ≤10MB each, min 300×300px, aspect ratio 0.40–2.50 · max 4 total (elements + reference images) when using video.

PixVerse Swap `pixverse_swap`

Generate high-quality video clips by swapping a person, object, or background in source footage using a reference image — keyframe-based, prompt-free.

Call it via — video tool, action swap (routes to pixverse_swap) · raw: POST /v1/jobs/pixverse_swap


Cost	30 cr per call
Mode / timeout	webhook / 15m (from our YAML)

Parameters — the model's input schema:

Param	Type	Required	Default	Allowed / range	Description
`video_url`	string	✓	—	URL	URL of the external video to swap.
`image_url`	string	✓	—	URL	URL of the target image for swapping (the element to swap IN).
`mode`	string		`person`	`person`, `object`, `background`	The swap mode to use.
`keyframe_id`	integer		`1`	min `1`, max = `duration_seconds × 24`	Keyframe ID for face/object mapping. Input video is normalized to 24 FPS, so keyframe 1 = first frame, keyframe 24 = 1s in.
`resolution`	string		`720p`	`360p`, `540p`, `720p`	Output resolution (1080p not supported).
`original_sound_switch`	boolean		`true`	true / false	Whether to keep the original audio.
`seed`	integer \| null		`null`	any integer	Random seed for generation.

Our wrapper params (not part of the model input schema): out (required — workdir-relative output path), mock (optional — skip the API call and return a placeholder for testing). This model does not use our format→size mapping (format_field is empty).

Limits:

Input video formats: MP4, MOV, WebM, M4V, GIF.
Reference image formats: JPG, JPEG, PNG, WebP, GIF, AVIF.
Resolution: 360p / 540p / 720p (1080p listed but not supported).
Cost is per 5-second clip; videos longer than 5s cost double. Best quality on clips under ~10 seconds.
keyframe_id upper bound is duration_seconds × 24 (24 FPS normalized).

Wan 2.7 Video Edit `wan_27_video_edit`

Video-to-video editing driven by a text instruction (and optional reference image) — restyle, transform scenes, or apply style transfer to existing footage using WAN 2.7.

Call it via — video tool, action: "edit" (restyle existing footage) · raw: POST /v1/jobs/wan_27_video_edit


Cost	100 cr per call
Mode / timeout	webhook / 15m

Parameters — the model's input schema:

Param	Type	Required	Default	Allowed / range	Description
`prompt`	string	✓	—	minLength 1	Editing instruction or style-transfer description.
`video_url`	string	✓	—	MP4/MOV, 2–10s, ≤100 MB	URL of the input video to edit.
`reference_image_url`	string (nullable)		null	jpg/jpeg/png/webp/gif/avif	Reference image URL for reference-based editing.
`resolution`	string		`1080p`	`720p`, `1080p`	Output video resolution tier.
`aspect_ratio`	string (nullable)		null (matches input)	`16:9`, `9:16`, `1:1`, `4:3`, `3:4`	Aspect ratio of the generated video; defaults to the input video's.
`duration`	integer		`0`	`0`, `2`–`10`	Output duration in seconds. `0` = match input; when set (2–10) truncates from the start.
`audio_setting`	string		`auto`	`auto`, `origin`	Audio handling. `auto`: model decides whether to regenerate audio. `origin`: preserve original audio.
`seed`	integer (nullable)		null	0–2147483647	Random seed for reproducibility.
`enable_safety_checker`	boolean		`true`	`true` / `false`	Enable content moderation on input and output.

Wrapper params (our API, not part of the model input schema): out (required — workdir-relative output filename), mock (optional — return a test placeholder, skips the model call). This model defines format_field: "", so there is no format → model-size mapping.

Limits — Source video: MP4/MOV, duration 2–10 s, max file size 100 MB (upload timeout 30 s). Reference image formats: jpg, jpeg, png, webp, gif, avif. Output duration: 0 (match input) or 2–10 s. Output resolution: 720p or 1080p. Seed range: 0–2147483647.

Topaz Video Upscale `topaz_upscale_video`

Professional-grade video upscaling and enhancement using Topaz technology — upscale resolution, interpolate frames, and clean up noise/compression artifacts.

Call it via — video tool, action upscale (pass video_url) · raw: POST /v1/jobs/topaz_upscale_video


Cost	100 cr per call
Mode / timeout	webhook / 15m

Parameters — the model's input schema:

Param	Type	Required	Default	Allowed / range	Description
`video_url`	string	✓	—	—	URL of the video to upscale.
`model`	string		`Proteus`	`Proteus`, `Artemis HQ`, `Artemis MQ`, `Artemis LQ`, `Nyx`, `Nyx Fast`, `Nyx XL`, `Nyx HF`, `Gaia HQ`, `Gaia CG`, `Gaia 2`, `Starlight Precise 1`, `Starlight Precise 2`, `Starlight Precise 2.5`, `Starlight HQ`, `Starlight Mini`, `Starlight Sharp`, `Starlight Fast 1`, `Starlight Fast 2`	Enhancement model. Proteus = most videos; Artemis = denoise+sharpen; Nyx = dedicated denoising; Gaia HQ/CG = rendered content; Gaia 2 = animation/motion graphics at 2x; Starlight = generative diffusion-based upscaling.
`upscale_factor`	number		`2`	1–4	Factor to upscale by (e.g. 2.0 doubles width and height).
`target_fps`	integer		— (null)	16–60	Target FPS for frame interpolation. If set, interpolation is enabled.
`compression`	number		— (null, model-dependent)	0.0–1.0	Compression artifact removal level.
`noise`	number		— (null, model-dependent)	0.0–1.0	Noise reduction level.
`halo`	number		— (null, model-dependent)	0.0–1.0	Halo reduction level.
`grain`	number		— (null, model-dependent)	0.0–0.1 (step 0.01)	Film grain amount.
`recover_detail`	number		— (null)	0.0–1.0	Recover original detail; higher preserves more original detail.
`H264_output`	boolean		`false`	true / false	Use H264 codec for output. Default (false) = H265.

Our wrapper params (not part of the model input schema): out (required — workdir-relative output path), mock (optional — test placeholder). This model has no format mapping (format_field is empty).

Limits — accepted input formats: mp4, mov, webm, m4v, gif. Max upscale_factor 4x; target_fps capped at 60. Pricing scales with duration and resolution: 2 cr/sec up to 720p, 4 cr/sec for 720p–1080p, 16 cr/sec above 1080p; price doubles for 60fps output; Gaia 2 costs half. (No published max duration / resolution / file-size limit.)

Video models ​

Seedance 2.0 Reference-to-Video seedance_r2v ​

Kling v3 Standard Image-to-Video kling_v3_std_i2v ​

Kling v3 Pro Image-to-Video kling_v3_pro_i2v ​

Kling O3 Video Edit kling_o3_video_edit ​

PixVerse Swap pixverse_swap ​

Wan 2.7 Video Edit wan_27_video_edit ​

Topaz Video Upscale topaz_upscale_video ​

Video models

Seedance 2.0 Reference-to-Video `seedance_r2v`

Kling v3 Standard Image-to-Video `kling_v3_std_i2v`

Kling v3 Pro Image-to-Video `kling_v3_pro_i2v`

Kling O3 Video Edit `kling_o3_video_edit`

PixVerse Swap `pixverse_swap`

Wan 2.7 Video Edit `wan_27_video_edit`

Topaz Video Upscale `topaz_upscale_video`