Skip to content

Releases: PKU-YuanGroup/Open-Sora-Plan

Release v1.2.0

25 Jul 06:28
adb2a20
Compare
Choose a tag to compare

v1.2.0 is here! Utilizing a 3D full attention architecture instead of 2+1D. We released a true 3D video diffusion model trained on 4s 720p.

  • Architecture shift from 2+1D model to 3D full attention architecture and no longer supports 2+1D.
  • Instead of joint image-video training, the image weights are trained first as the initialization for the video.
  • Release all data annotations, the data are filtered by aesthetic and motion.
  • Improve CasualVideoVAE performance and report performance on validation set of WebVid and Panda70M.

Although the 3D attention architecture excels in spatio-temporal consistency, it is so expensive to train that it is difficult to scale up. We hope to collaborate with the open-source community to optimize the 3D DiT architecture. For further details, please refer to our report.

Release v1.1.0

27 May 10:02
2a8b232
Compare
Choose a tag to compare
  • Support for longer videos, dynamic resolution training and inference.
  • Support for Ascend training and inferencing
  • Release all training data and annotations.
  • Improve CasualVideoVAE performance.

In this version, we employ ShareGPT4Video for video annotation, followed by training the model on 3k hours of video data. The resulting model exhibited advancements in both video quality and duration. For further details, please refer to our report.

Release v1.0.0

09 Apr 06:43
Compare
Choose a tag to compare
  • Added text conditional control to generate videos.
  • Support HUAWEI NPU in hw branch.
  • Released all training data and annotations.
  • Add training, sampling scripts.
  • Add CausalVideoVAE training details.

We trained all models to use 40K videos crawled from the web, most of which are landscape related content. The complete training process takes about 2048 GPU hours. More detailed changes can be found in our report.

We hope this release further benefits the community and makes text-to-video models more accessible.