Alibaba Releases Open-Source Wan 2.1 Suite of AI Video Generation Models, Claimed to Outperform OpenAI’s Sora

Alibaba launched a collection of synthetic intelligence (AI) video technology fashions on Wednesday. Dubbed Wan 2.1, these are open-source fashions that can be utilized for each educational and business functions. The Chinese language e-commerce big launched the fashions in a number of parameter-based variants. Developed by the corporate’s Wan crew, these fashions had been first launched in January and the corporate claimed that Wan 2.1 can generate extremely lifelike movies. Presently, these fashions are being hosted on the AI and machine studying (ML) hub Hugging Face.

Alibaba Introduces Wan 2.1 Video Technology Fashions

The brand new Alibaba video AI fashions are hosted on Alibaba’s Wan crew’s Hugging Face page. The mannequin pages additionally element the Wan 2.1 suite of enormous language fashions (LLMs). There are 4 fashions in complete — T2V-1.3B, T2V-14B, I2V-14B-720P, and I2V-14B-480P. The T2V is brief for text-to-video whereas the I2V stands for image-to-video.

The researchers declare that the smallest variant, Wan 2.1 T2V-1.3B, may be run on a consumer-grade GPU with as little as 8.19GB vRAM. As per the submit, the AI mannequin can generate a five-second-long video with 480p decision utilizing an Nvidia RTX 4090 in about 4 minutes.

Whereas the Wan 2.1 suite is geared toward video technology, they’ll additionally carry out different capabilities resembling picture technology, video-to-audio technology, and video modifying. Nonetheless, the presently open-sourced fashions should not able to these superior duties. For video technology, it accepts textual content prompts in Chinese language and English languages in addition to picture inputs.

Coming to the structure, the researchers revealed that the Wan 2.1 fashions are designed utilizing a diffusion transformer structure. Nonetheless, the corporate innovated the bottom structure with new variational autoencoders (VAE), coaching methods, and extra.

Most notably, the AI fashions use a brand new 3D causal VAE structure dubbed Wan-VAE. It improves spatiotemporal compression and reduces reminiscence utilization. The autoencoder can encode and decode unlimited-length 1080p decision movies with out shedding historic temporal data. This allows constant video technology.

Based mostly on inner testing, the corporate claimed that the Wan 2.1 fashions outperform OpenAI’s Sora AI mannequin in consistency, scene technology high quality, single object accuracy, and spatial positioning.

These fashions can be found underneath the Apache 2.0 licence. Whereas it does permit for unrestricted utilization for tutorial and analysis functions, business utilization comes with a number of restrictions.

Source link