I’m seeing a severe time gap between AI Render for still images vs. video.
Still image (AI Render): ~1 minute per image (finishes reliably).
Video: 1024 resolution, 15 fps, 1-second clip (15 frames) — still not finished after 6 hours.
If I generate 15 still images manually, I can complete them in ~15 minutes, and with careful prompting I can keep frame-to-frame consistency—similar to how Iray users render frame sequences. In contrast, using AI Render’s “video” workflow for the same 15 frames takes 6+ hours and may not complete.
Questions / concerns:
Why does AI Render video take orders of magnitude longer than rendering the exact same number of frames as individual stills?
Is the video mode doing additional per-frame processing (e.g., temporal coherence passes) that cannot be optionally reduced or skipped?
Can you provide guidance or settings to achieve parity with the “15 stills” approach without losing temporal consistency?
Are there planned optimizations (e.g., frame caching/reuse, motion-vector guidance, adjustable temporal strength) to bring video times closer to stills?
AI Render is a valuable feature with real potential to move the industry forward. Please prioritize performance and workflow options for video generation so short clips (e.g., 5–15 seconds at 15 fps, 1024) are practical within typical production timelines.
@yoshihi Video AI models store all frames of your video in VRAM at once, not just a single frame. This can easily exceed your available VRAM.
If this happens the AI model uses shared RAM and your CPU. This exponentially increases rendering times. So, instead 5 minutes per frame you might need to wait 50 minutes per frame and therefore over 12,5 hours for your video.
You need to reduce the video size. I recommend a 480p resolution at first.
Check the amount of utilized VRAM of your GPU in the Windows Taskmanager during rendering.
The ComfyUI backend is usually smart enough to free up utilized VRAM after each step.
The resolution and frame rate and maximum number of frames are trained by the creators of the video models (ins Wan’s case it’s Alibaba). They cannot be changed easily without degrading the result.
For example, Wan2.1-VACE-1.3B is trained for 480p (480x832) and 81 frames of total video length at 16 fps.
Wan2.1-VACE-14B supports both 480p and 720p (720x1280) at 81 frames of total video length. at 16 fps But this needs a lot more VRAM.
Generate a simple base animation in iClone, then ingest that sequence/video into a service like Runway for conversion. If iClone could automate this handoff, the feature would become practically usable. Providing it as a cloud subscription service—instead of local rendering—would also be fine.