Skip to content

Video models

Video generation, image-to-video, editing, swap, and upscaling — model input schemas.

Generations are charged in credits (see Credits & plans). Every generation model also accepts mock: true for a free placeholder result.

Seedance 2.0 Reference-to-Video seedance_r2v

ByteDance's reference-to-video model that generates a clip from a text prompt plus up to 9 reference images, 3 videos, and 3 audio clips for identity, motion, and voice consistency. Output up to native 4K.

Call it viavideo tool, action: "create" (text→video; optional reference_images, video_urls, audio_urls) · raw: POST /v1/jobs/seedance_r2v

Cost303 cr per call (5 s at the default 720p). Scales with resolution: 480p ≈ 135 cr, 1080p 681 cr, 4K 1555 cr per 5 s
Mode / timeoutwebhook / 15m

Parameters — the model's input schema:

ParamTypeRequiredDefaultAllowed / rangeDescription
promptstringText prompt used to generate the video. Refer to references as @Image1, @Video1, @Audio1.
image_urlslist<string>up to 9; JPEG/PNG/WebP; ≤30 MB eachReference images. Refer to them as @Image1, @Image2… Total files across all modalities ≤ 12.
video_urlslist<string>up to 3; MP4/MOV; combined 2–15 s; total <50 MB; each ~480p (640×640) to ~720p (834×1112)Reference videos. Refer to them as @Video1, @Video2…
audio_urlslist<string>up to 3; MP3/WAV; combined ≤15 s; ≤15 MB eachReference audio. Refer to them as @Audio1… If audio is provided, at least one reference image or video is required.
resolutionenum720p480p, 720p, 1080p, 4k480p for cheap drafts (~0.45× credits), 720p default, 1080p for final delivery (2.25×), 4k for hero shots (~5.1×).
durationenumautoauto, 415Duration in seconds, or auto to let the model decide.
aspect_ratioenumautoauto, 21:9, 16:9, 4:3, 1:1, 3:4, 9:16Aspect ratio of the generated video. When omitted, our wrapper applies its vertical preset (9:16) — pass auto explicitly to follow the reference images' geometry.
generate_audiobooleantrueGenerate synchronized audio (SFX, ambient, lip-synced speech). Cost is the same either way.
bitrate_modeenumstandardstandard, highOutput bitrate mode; high requests a higher-quality, larger-file encode.
end_user_idstringUnique ID of the end user.

Our wrapper params (not part of the model input schema): out (required — workdir-relative output path), mock (optional — test placeholder), and format (optional — size preset shorts/reels/horizontal, mapped by our format_field/format_mapping to the model's aspect_ratio: shorts/reels→9:16, horizontal→16:9, default 9:16).

Limits — prompt: text only. image_urls: max 9 images, JPEG/PNG/WebP, ≤30 MB each. video_urls: max 3 videos, MP4/MOV, combined 2–15 s, total <50 MB, each between ~480p (640×640) and ~720p (834×1112). audio_urls: max 3 files, MP3/WAV, combined ≤15 s, ≤15 MB each; requires at least one image or video reference. Total reference files across all modalities ≤ 12. Output resolution up to native 4K; duration 4–15 s (or auto). No seed input — every render is a new take.

Kling v3 Standard Image-to-Video kling_v3_std_i2v

Image-to-video at standard quality with cinematic visuals, fluid motion, native audio generation, and custom element support — use for quick drafts and iterations before pro renders.

Call it viaimage tool, action: "animate", tier: "standard" (the default animate tier) · raw: POST /v1/jobs/kling_v3_std_i2v

The image(animate) tool exposes the multi-shot timeline directly: pass multi_prompt (an array of {prompt, duration} shots) and optional shot_type instead of a single prompt. The tool validates Kling's caps before submitting — at most 6 shots and a combined duration ≤ 15 s (each shot 1–15 s, default 5) — and rejects prompt + multi_prompt together.

Cost84 cr per call
Mode / timeoutwebhook / 15m

Parameters — the model's input schema:

ParamTypeRequiredDefaultAllowed / rangeDescription
start_image_urlstringURL of the image used as the starting frame of the video.
promptstringmaxLength 2500Text prompt for video generation. Either prompt or multi_prompt must be provided, but not both.
multi_promptarray<object>items: { prompt: string (required), duration: string default "5", enum "1"–"15" }List of prompts for multi-shot generation; divides the video into multiple shots.
durationstring"5""3","4","5","6","7","8","9","10","11","12","13","14","15"Duration of the generated video in seconds.
generate_audiobooleantrueGenerate native audio for the video. Supports Chinese/English; other languages auto-translated to English.
end_image_urlstringURL of the image used as the end frame of the video.
elementsarray<object>items: { frontal_image_url, reference_image_urls (1–3, ≥1 required), video_url, voice_id }Characters/objects to inject. Each entry is either an image set (frontal + reference images) or a video. Reference in prompt as @Element1, @Element2, etc. Only one element may carry a video.
shot_typestring"customize"customize, intelligentMulti-shot generation type; intelligent lets the model auto-determine shot structure.
negative_promptstring"blur, distort, and low quality"maxLength 2500What to steer away from.
cfg_scalenumber0.50–1Classifier-Free Guidance scale — how strictly the model follows the prompt.

Our wrapper params (not part of the model input schema): out (required — output filename) and mock (optional — test placeholder). format is accepted by our image MCP tool but is NOT forwarded to this model (the model has no size/aspect field; YAML format_field is empty), so it has no effect here.

Limits (model limits):

  • Prompt / negative_prompt: max 2500 characters each.
  • Duration: 3–15 s (top-level); multi-shot element duration 1–15 s.
  • start_image_url / end_image_url / element images: max file size 10 MB, min 300×300 px, aspect ratio 0.40–2.50; accepted formats jpg, jpeg, png, webp, gif, avif.
  • Element video_url: max 200 MB, 720–2160 px per side, 3–10.05 s, 24–60 FPS; accepted formats mp4, mov, webm, m4v, gif.
  • Element reference_image_urls: 1–3 images, at least one required.

Kling v3 Pro Image-to-Video kling_v3_pro_i2v

Top-tier image-to-video with cinematic visuals, fluid motion, native audio generation, and custom element (character/object) injection.

Call it via — MCP tool image, action animate with tier: "pro" (routes animate_prokling_v3_pro_i2v) · raw: POST /v1/jobs/kling_v3_pro_i2v

The image(animate) tool exposes the multi-shot timeline directly: pass multi_prompt (an array of {prompt, duration} shots) and optional shot_type instead of a single prompt. The tool validates Kling's caps before submitting — at most 6 shots and a combined duration ≤ 15 s (each shot 1–15 s, default 5) — and rejects prompt + multi_prompt together. Billed per second (no per-shot surcharge).

Cost112 cr per call
Mode / timeoutwebhook / 15m (from our YAML)

Parameters — the model's input schema:

ParamTypeRequiredDefaultAllowed / rangeDescription
start_image_urlstringMax 10MB; min 300×300px; aspect ratio 0.40–2.50URL of the start frame image. Aspect ratio of the output is inferred from this image.
promptstringmaxLength 2500Text prompt. Either prompt or multi_prompt must be provided, but not both.
multi_promptKlingV3MultiPromptElement[]array of {prompt (req), duration}Multi-shot prompt list; divides the video into shots. Overrides prompt. Each shot duration enum "1""15", default "5".
durationstring (enum)"5""3","4","5","6","7","8","9","10","11","12","13","14","15"Total video length in seconds.
generate_audiobooleantrueGenerate native audio (Chinese/English native; other languages auto-translated to English).
end_image_urlstring | nullMax 10MB; min 300×300px; aspect ratio 0.40–2.50Optional end frame image URL (start-to-end interpolation).
elementsKlingV3ComboElementInput[] | nullarrayReference characters/objects to inject. Each item is an image set (frontal_image_url + reference_image_urls) or a video (video_url), with optional voice_id. Reference in prompt as @Element1, @Element2.
shot_typestring (enum)"customize"customize, intelligentMulti-shot generation type; intelligent lets the model auto-plan shot structure.
negative_promptstring"blur, distort, and low quality"maxLength 2500Things to avoid.
cfg_scalenumber0.50–1Classifier-free guidance scale; higher = stricter prompt adherence.

elements[] sub-fields: frontal_image_url (string, main view), reference_image_urls (string[], 1–3 images from different angles, at least one required when using image elements), video_url (string, max one video element per request), voice_id (string; voice binding supported only for video elements, not image elements).

Our wrapper params (not part of the model input schema): out (required — workdir-relative output path), mock (optional — test placeholder). We do not map a format field — there is no model size/aspect_ratio parameter; aspect ratio is inferred from start_image_url (format_field: "").

Limits — model limits:

  • Video duration: 3–15 seconds (single-prompt duration); per-shot multi_prompt duration 1–15s; shot durations sum to total length.
  • prompt / negative_prompt: max 2500 characters each.
  • start_image_url / end_image_url / element images: max 10 MB; min 300×300 px; aspect ratio 0.40–2.50; formats jpg, jpeg, png, webp, gif, avif.
  • Element reference_image_urls: 1–3 images.
  • Element video_url: max 200 MB; 720–2160 px; 3.0–10.05 s; 24–60 fps; formats mp4, mov, webm, m4v, gif; max one video element per request.
  • Audio: native Chinese and English; other languages auto-translated to English.
  • Cost: ≈22 cr/s (audio off, the catalog default), ≈34 cr/s (audio on).

Kling O3 Video Edit kling_o3_video_edit

Video-to-video editing with Kling O3 — restyle footage, replace characters/objects, or insert elements into a source video using reference images and structured element definitions.

Call it viavideo tool, action edit_ref (video(edit_ref) — requires video_url, prompt, reference_images) · raw: POST /v1/jobs/kling_o3_video_edit

Cost126 cr per call
Mode / timeoutwebhook / 15m

Parameters — the model's input schema:

ParamTypeRequiredDefaultAllowed / rangeDescription
promptstringmaxLength 2500Text prompt for the edit. Reference the source video as @Video1, elements as @Element1@ElementN, and reference images as @Image1@ImageN.
video_urlstring.mp4/.mov only; 720–2160px; 3.0–10.05s; 24–60 FPS; ≤200MBReference (source) video URL to edit.
image_urlsstring[] | nullnulleach image ≤10MB, ≥300×300px, aspect 0.40–2.50Reference images for style/appearance, cited in prompt as @Image1, @Image2, … Max 4 total (elements + reference images) when using video.
keep_audiobooleantruetrue / falseKeep the original audio from the source video.
elementsobject[] | nullnullarray of { frontal_image_url: string, reference_image_urls: string[] (1–3) }Elements (characters/objects) to inject, cited in prompt as @Element1, @Element2. Each element needs a frontal image and 1–3 reference images (per-image limits same as image_urls).
shot_typestringcustomizeconst customizeMulti-shot generation type (only customize is accepted).

Our wrapper params (not part of the model input schema): out (required — workdir-relative output path), mock (optional — skip the API call and return a placeholder). This model has no format mapping (no model size field). Our video(edit_ref) action collects reference photos under reference_images and maps them to the model's image_urls field; the optional elements argument passes through to the model's elements input (cite as @Element1).

Limits — prompt ≤2500 chars · source video .mp4/.mov, 3.0–10.05s, 720–2160px, 24–60 FPS, ≤200MB · reference/element images ≤10MB each, min 300×300px, aspect ratio 0.40–2.50 · max 4 total (elements + reference images) when using video.

PixVerse Swap pixverse_swap

Generate high-quality video clips by swapping a person, object, or background in source footage using a reference image — keyframe-based, prompt-free.

Call it viavideo tool, action swap (routes to pixverse_swap) · raw: POST /v1/jobs/pixverse_swap

Cost30 cr per call
Mode / timeoutwebhook / 15m (from our YAML)

Parameters — the model's input schema:

ParamTypeRequiredDefaultAllowed / rangeDescription
video_urlstringURLURL of the external video to swap.
image_urlstringURLURL of the target image for swapping (the element to swap IN).
modestringpersonperson, object, backgroundThe swap mode to use.
keyframe_idinteger1min 1, max = duration_seconds × 24Keyframe ID for face/object mapping. Input video is normalized to 24 FPS, so keyframe 1 = first frame, keyframe 24 = 1s in.
resolutionstring720p360p, 540p, 720pOutput resolution (1080p not supported).
original_sound_switchbooleantruetrue / falseWhether to keep the original audio.
seedinteger | nullnullany integerRandom seed for generation.

Our wrapper params (not part of the model input schema): out (required — workdir-relative output path), mock (optional — skip the API call and return a placeholder for testing). This model does not use our format→size mapping (format_field is empty).

Limits:

  • Input video formats: MP4, MOV, WebM, M4V, GIF.
  • Reference image formats: JPG, JPEG, PNG, WebP, GIF, AVIF.
  • Resolution: 360p / 540p / 720p (1080p listed but not supported).
  • Cost is per 5-second clip; videos longer than 5s cost double. Best quality on clips under ~10 seconds.
  • keyframe_id upper bound is duration_seconds × 24 (24 FPS normalized).

Wan 2.7 Video Edit wan_27_video_edit

Video-to-video editing driven by a text instruction (and optional reference image) — restyle, transform scenes, or apply style transfer to existing footage using WAN 2.7.

Call it viavideo tool, action: "edit" (restyle existing footage) · raw: POST /v1/jobs/wan_27_video_edit

Cost100 cr per call
Mode / timeoutwebhook / 15m

Parameters — the model's input schema:

ParamTypeRequiredDefaultAllowed / rangeDescription
promptstringminLength 1Editing instruction or style-transfer description.
video_urlstringMP4/MOV, 2–10s, ≤100 MBURL of the input video to edit.
reference_image_urlstring (nullable)nulljpg/jpeg/png/webp/gif/avifReference image URL for reference-based editing.
resolutionstring1080p720p, 1080pOutput video resolution tier.
aspect_ratiostring (nullable)null (matches input)16:9, 9:16, 1:1, 4:3, 3:4Aspect ratio of the generated video; defaults to the input video's.
durationinteger00, 210Output duration in seconds. 0 = match input; when set (2–10) truncates from the start.
audio_settingstringautoauto, originAudio handling. auto: model decides whether to regenerate audio. origin: preserve original audio.
seedinteger (nullable)null0–2147483647Random seed for reproducibility.
enable_safety_checkerbooleantruetrue / falseEnable content moderation on input and output.

Wrapper params (our API, not part of the model input schema): out (required — workdir-relative output filename), mock (optional — return a test placeholder, skips the model call). This model defines format_field: "", so there is no format → model-size mapping.

Limits — Source video: MP4/MOV, duration 2–10 s, max file size 100 MB (upload timeout 30 s). Reference image formats: jpg, jpeg, png, webp, gif, avif. Output duration: 0 (match input) or 2–10 s. Output resolution: 720p or 1080p. Seed range: 0–2147483647.

Topaz Video Upscale topaz_upscale_video

Professional-grade video upscaling and enhancement using Topaz technology — upscale resolution, interpolate frames, and clean up noise/compression artifacts.

Call it viavideo tool, action upscale (pass video_url) · raw: POST /v1/jobs/topaz_upscale_video

Cost100 cr per call
Mode / timeoutwebhook / 15m

Parameters — the model's input schema:

ParamTypeRequiredDefaultAllowed / rangeDescription
video_urlstringURL of the video to upscale.
modelstringProteusProteus, Artemis HQ, Artemis MQ, Artemis LQ, Nyx, Nyx Fast, Nyx XL, Nyx HF, Gaia HQ, Gaia CG, Gaia 2, Starlight Precise 1, Starlight Precise 2, Starlight Precise 2.5, Starlight HQ, Starlight Mini, Starlight Sharp, Starlight Fast 1, Starlight Fast 2Enhancement model. Proteus = most videos; Artemis = denoise+sharpen; Nyx = dedicated denoising; Gaia HQ/CG = rendered content; Gaia 2 = animation/motion graphics at 2x; Starlight = generative diffusion-based upscaling.
upscale_factornumber21–4Factor to upscale by (e.g. 2.0 doubles width and height).
target_fpsinteger— (null)16–60Target FPS for frame interpolation. If set, interpolation is enabled.
compressionnumber— (null, model-dependent)0.0–1.0Compression artifact removal level.
noisenumber— (null, model-dependent)0.0–1.0Noise reduction level.
halonumber— (null, model-dependent)0.0–1.0Halo reduction level.
grainnumber— (null, model-dependent)0.0–0.1 (step 0.01)Film grain amount.
recover_detailnumber— (null)0.0–1.0Recover original detail; higher preserves more original detail.
H264_outputbooleanfalsetrue / falseUse H264 codec for output. Default (false) = H265.

Our wrapper params (not part of the model input schema): out (required — workdir-relative output path), mock (optional — test placeholder). This model has no format mapping (format_field is empty).

Limits — accepted input formats: mp4, mov, webm, m4v, gif. Max upscale_factor 4x; target_fps capped at 60. Pricing scales with duration and resolution: 2 cr/sec up to 720p, 4 cr/sec for 720p–1080p, 16 cr/sec above 1080p; price doubles for 60fps output; Gaia 2 costs half. (No published max duration / resolution / file-size limit.)

Framehood