English | 简体中文
👋 join us on WeChat
The MiniSora open-source community is positioned as a community-driven initiative organized spontaneously by community members. The MiniSora community aims to explore the implementation path and future development direction of Sora.
- Regular round-table discussions will be held with the Sora team and the community to explore possibilities.
- We will delve into existing technological pathways for video generation.
- Leading the replication of papers or research results related to Sora, such as DiT (MiniSora-DiT), etc.
- Conducting a comprehensive review of Sora-related technologies and their implementations, i.e., "From DDPM to Sora: A Review of Video Generation Models Based on Diffusion Models".
- Movie Gen: A Cast of Media Foundation Models
- Stable Diffusion 3: MM-DiT: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
- MiniSora-DiT: Reproducing the DiT Paper with XTuner
- Introduction of MiniSora and Latest Progress in Replicating Sora
- GPU-Friendly: Ideally, it should have low requirements for GPU memory size and the number of GPUs, such as being trainable and inferable with compute power like 8 A100 80G cards, 8 A6000 48G cards, or RTX4090 24G.
- Training-Efficiency: It should achieve good results without requiring extensive training time.
- Inference-Efficiency: When generating videos during inference, there is no need for high length or resolution; acceptable parameters include 3-10 seconds in length and 480p resolution.
MiniSora-DiT: Reproducing the DiT Paper with XTuner
https://github.com/mini-sora/minisora-DiT
We are recruiting MiniSora Community contributors to reproduce DiT
using XTuner.
We hope the community member has the following characteristics:
- Familiarity with the
OpenMMLab MMEngine
mechanism. - Familiarity with
DiT
.
- The author of
DiT
is the same as the author ofSora
. - XTuner has the core technology to efficiently train sequences of length
1000K
.
Speaker: MMagic Core Contributors
Live Streaming Time: 03/12 20:00
Highlights: MMagic core contributors will lead us in interpreting the Stable Diffusion 3 paper, discussing the architecture details and design principles of Stable Diffusion 3.
PPT: FeiShu Link
ZhiHu Notes: A Survey on Generative Diffusion Model: An Overview of Generative Diffusion Models
-
Technical Report: Video generation models as world simulators
-
Latte: Latte: Latent Diffusion Transformer for Video Generation
-
Stable Cascade (ICLR 24 Paper): Würstchen: An efficient architecture for large-scale text-to-image diffusion models
-
Stable Diffusion 3: MM-DiT: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
-
Updating...
- 01 Diffusion Model
- 02 Diffusion Transformer
- 03 Baseline Video Generation Models
- 04 Diffusion UNet
- 05 Video Generation
- 06 Dataset
- 6.1 Pubclic Datasets
- 6.2 Video Augmentation Methods
- 6.2.1 Basic Transformations
- 6.2.2 Feature Space
- 6.2.3 GAN-based Augmentation
- 6.2.4 Encoder/Decoder Based
- 6.2.5 Simulation
- 07 Patchifying Methods
- 08 Long-context
- 09 Audio Related Resource
- 10 Consistency
- 11 Prompt Engineering
- 12 Security
- 13 World Model
- 14 Video Compression
- 15 Mamba
- 16 Existing high-quality resources
- 17 Efficient Training
- 17.1 Parallelism based Approach
- 17.1.1 Data Parallelism (DP)
- 17.1.2 Model Parallelism (MP)
- 17.1.3 Pipeline Parallelism (PP)
- 17.1.4 Generalized Parallelism (GP)
- 17.1.5 ZeRO Parallelism (ZP)
- 17.2 Non-parallelism based Approach
- 17.2.1 Reducing Activation Memory
- 17.2.2 CPU-Offloading
- 17.2.3 Memory Efficient Optimizer
- 17.3 Novel Structure
- 17.1 Parallelism based Approach
- 18 Efficient Inference
- 18.1 Reduce Sampling Steps
- 18.1.1 Continuous Steps
- 18.1.2 Fast Sampling
- 18.1.3 Step distillation
- 18.2 Optimizing Inference
- 18.2.1 Low-bit Quantization
- 18.2.2 Parallel/Sparse inference
- 18.1 Reduce Sampling Steps
Paper | Link |
1) Guided-Diffusion: Diffusion Models Beat GANs on Image Synthesis | NeurIPS 21 Paper, GitHub |
2) Latent Diffusion: High-Resolution Image Synthesis with Latent Diffusion Models | CVPR 22 Paper, GitHub |
3) EDM: Elucidating the Design Space of Diffusion-Based Generative Models | NeurIPS 22 Paper, GitHub |
4) DDPM: Denoising Diffusion Probabilistic Models | NeurIPS 20 Paper, GitHub |
5) DDIM: Denoising Diffusion Implicit Models | ICLR 21 Paper, GitHub |
6) Score-Based Diffusion: Score-Based Generative Modeling through Stochastic Differential Equations | ICLR 21 Paper, GitHub, Blog |
7) Stable Cascade: Würstchen: An efficient architecture for large-scale text-to-image diffusion models | ICLR 24 Paper, GitHub, Blog |
8) Diffusion Models in Vision: A Survey | TPAMI 23 Paper, GitHub |
9) Improved DDPM: Improved Denoising Diffusion Probabilistic Models | ICML 21 Paper, Github |
10) Classifier-free diffusion guidance | NIPS 21 Paper |
11) Glide: Towards photorealistic image generation and editing with text-guided diffusion models | Paper, Github |
12) VQ-DDM: Global Context with Discrete Diffusion in Vector Quantised Modelling for Image Generation | CVPR 22 Paper, Github |
13) Diffusion Models for Medical Anomaly Detection | Paper, Github |
14) Generation of Anonymous Chest Radiographs Using Latent Diffusion Models for Training Thoracic Abnormality Classification Systems | Paper |
15) DiffusionDet: Diffusion Model for Object Detection | ICCV 23 Paper, Github |
16) Label-efficient semantic segmentation with diffusion models | ICLR 22 Paper, Github, Project |
Paper | Link |
1) UViT: All are Worth Words: A ViT Backbone for Diffusion Models | CVPR 23 Paper, GitHub, ModelScope |
2) DiT: Scalable Diffusion Models with Transformers | ICCV 23 Paper, GitHub, Project, ModelScope |
3) SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers | ArXiv 23, GitHub, ModelScope |
4) FiT: Flexible Vision Transformer for Diffusion Model | ArXiv 24, GitHub |
5) k-diffusion: Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers | ArXiv 24, GitHub |
6) Large-DiT: Large Diffusion Transformer | GitHub |
7) VisionLLaMA: A Unified LLaMA Interface for Vision Tasks | ArXiv 24, GitHub |
8) Stable Diffusion 3: MM-DiT: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis | Paper, Blog |
9) PIXART-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation | ArXiv 24, Project |
10) PIXART-α: Fast Training of Diffusion Transformer for Photorealistic Text-To-Image Synthesis | ArXiv 23, GitHub ModelScope |
11) PIXART-δ: Fast and Controllable Image Generation With Latent Consistency Model | ArXiv 24, |
12) Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers | ArXiv 24, GitHub |
13) DDM: Deconstructing Denoising Diffusion Models for Self-Supervised Learning | ArXiv 24 |
14) Autoregressive Image Generation without Vector Quantization | ArXiv 24, GitHub |
15) Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model | ArXiv 24 |
Paper | Link |
1) ViViT: A Video Vision Transformer | ICCV 21 Paper, GitHub |
2) VideoLDM: Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models | CVPR 23 Paper |
3) DiT: Scalable Diffusion Models with Transformers | ICCV 23 Paper, Github, Project, ModelScope |
4) Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators | ArXiv 23, GitHub |
5) Latte: Latent Diffusion Transformer for Video Generation | ArXiv 24, GitHub, Project, ModelScope |
Paper | Link |
1) Taming Transformers for High-Resolution Image Synthesis | CVPR 21 Paper,GitHub ,Project |
2) ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment | ArXiv 24 Github |
Paper | Link |
1) Animatediff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning | ICLR 24 Paper, GitHub, ModelScope |
2) I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models | ArXiv 23, GitHub, ModelScope |
3) Imagen Video: High Definition Video Generation with Diffusion Models | ArXiv 22 |
4) MoCoGAN: Decomposing Motion and Content for Video Generation | CVPR 18 Paper |
5) Adversarial Video Generation on Complex Datasets | Paper |
6) W.A.L.T: Photorealistic Video Generation with Diffusion Models | ArXiv 23, Project |
7) VideoGPT: Video Generation using VQ-VAE and Transformers | ArXiv 21, GitHub |
8) Video Diffusion Models | ArXiv 22, GitHub, Project |
9) MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation | NeurIPS 22 Paper, GitHub, Project, Blog |
10) VideoPoet: A Large Language Model for Zero-Shot Video Generation | ArXiv 23, Project, Blog |
11) MAGVIT: Masked Generative Video Transformer | CVPR 23 Paper, GitHub, Project, Colab |
12) EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions | ArXiv 24, GitHub, Project |
13) SimDA: Simple Diffusion Adapter for Efficient Video Generation | Paper, GitHub, Project |
14) StableVideo: Text-driven Consistency-aware Diffusion Video Editing | ICCV 23 Paper, GitHub, Project |
15) SVD: Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets | Paper, GitHub |
16) ADD: Adversarial Diffusion Distillation | Paper, GitHub |
17) GenTron: Diffusion Transformers for Image and Video Generation | CVPR 24 Paper, Project |
18) LFDM: Conditional Image-to-Video Generation with Latent Flow Diffusion Models | CVPR 23 Paper, GitHub |
19) MotionDirector: Motion Customization of Text-to-Video Diffusion Models | ArXiv 23, GitHub |
20) TGAN-ODE: Latent Neural Differential Equations for Video Generation | Paper, GitHub |
21) VideoCrafter1: Open Diffusion Models for High-Quality Video Generation | ArXiv 23, GitHub |
22) VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models | ArXiv 24, GitHub |
23) LVDM: Latent Video Diffusion Models for High-Fidelity Long Video Generation | ArXiv 22, GitHub |
24) LaVie: High-Quality Video Generation with Cascaded Latent Diffusion Models | ArXiv 23, GitHub ,Project |
25) PYoCo: Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models | ICCV 23 Paper, Project |
26) VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation | CVPR 23 Paper |
27) Movie Gen: A Cast of Media Foundation Models | Paper, Project |
Dataset Name - Paper | Link |
1) Panda-70M - Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers70M Clips, 720P, Downloadable |
CVPR 24 Paper, Github, Project, ModelScope |
2) InternVid-10M - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation10M Clips, 720P, Downloadable |
ArXiv 24, Github |
3) CelebV-Text - CelebV-Text: A Large-Scale Facial Text-Video Dataset70K Clips, 720P, Downloadable |
CVPR 23 Paper, Github, Project |
4) HD-VG-130M - VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation130M Clips, 720P, Downloadable |
ArXiv 23, Github, Tool |
5) HD-VILA-100M - Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions100M Clips, 720P, Downloadable |
CVPR 22 Paper, Github |
6) VideoCC - Learning Audio-Video Modalities from Image Captions10.3M Clips, 720P, Downloadable |
ECCV 22 Paper, Github |
7) YT-Temporal-180M - MERLOT: Multimodal Neural Script Knowledge Models180M Clips, 480P, Downloadable |
NeurIPS 21 Paper, Github, Project |
8) HowTo100M - HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips136M Clips, 240P, Downloadable |
ICCV 19 Paper, Github, Project |
9) UCF101 - UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild13K Clips, 240P, Downloadable |
CVPR 12 Paper, Project |
10) MSVD - Collecting Highly Parallel Data for Paraphrase Evaluation122K Clips, 240P, Downloadable |
ACL 11 Paper, Project |
11) Fashion-Text2Video - A human video dataset with rich label and text annotations600 Videos, 480P, Downloadable |
ArXiv 23, Project |
12) LAION-5B - A dataset of 5,85 billion CLIP-filtered image-text pairs, 14x bigger than LAION-400M5B Clips, Downloadable |
NeurIPS 22 Paper, Project |
13) ActivityNet Captions - ActivityNet Captions contains 20k videos amounting to 849 video hours with 100k total descriptions, each with its unique start and end time20k videos, Downloadable |
Arxiv 17 Paper, Project |
14) MSR-VTT - A large-scale video benchmark for video understanding10k Clips, Downloadable |
CVPR 16 Paper, Project |
15) The Cityscapes Dataset - Benchmark suite and evaluation server for pixel-level, instance-level, and panoptic semantic labelingDownloadable |
Arxiv 16 Paper, Project |
16) Youku-mPLUG - First open-source large-scale Chinese video text datasetDownloadable |
ArXiv 23, Project, ModelScope |
17) VidProM - VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models6.69M, Downloadable |
ArXiv 24, Github |
18) Pixabay100 - A video dataset collected from PixabayDownloadable |
Github |
19) WebVid - Large-scale text-video dataset, containing 10 million video-text pairs scraped from the stock footage sitesLong Durations and Structured Captions |
ArXiv 21, Project , ModelScope |
20) MiraData(Mini-Sora Data): A Large-Scale Video Dataset with Long Durations and Structured Captions10M video-text pairs |
Github, Project |
21) IDForge: A video dataset featuring scenes of people speaking.300k Clips, Downloadable |
ArXiv 24, Github |
Three-stream CNNs for action recognition | PRL 17 Paper |
Dynamic Hand Gesture Recognition Using Multi-direction 3D Convolutional Neural Networks | EL 19 Paper |
Intra-clip Aggregation for Video Person Re-identification | ICIP 20 Paper |
VideoMix: Rethinking Data Augmentation for Video Classification | CVPR 20 Paper |
mixup: Beyond Empirical Risk Minimization | ICLR 17 Paper |
CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features | ICCV 19 Paper |
Video Salient Object Detection via Fully Convolutional Networks | ICIP 18 Paper |
Illumination-Based Data Augmentation for Robust Background Subtraction | SKIMA 19 Paper |
Image editing-based data augmentation for illumination-insensitive background subtraction | EIM 20 Paper |
Feature Re-Learning with Data Augmentation for Content-based Video Recommendation | ACM 18 Paper |
GAC-GAN: A General Method for Appearance-Controllable Human Video Motion Transfer | Trans 21 Paper |
Deep Video-Based Performance Cloning | CVPR 18 Paper |
Adversarial Action Data Augmentation for Similar Gesture Action Recognition | IJCNN 19 Paper |
Self-Paced Video Data Augmentation by Generative Adversarial Networks with Insufficient Samples | MM 20 Paper |
GAC-GAN: A General Method for Appearance-Controllable Human Video Motion Transfer | Trans 20 Paper |
Dynamic Facial Expression Generation on Hilbert Hypersphere With Conditional Wasserstein Generative Adversarial Nets | TPAMI 20 Paper |
CrowdGAN: Identity-Free Interactive Crowd Video Generation and Beyond | TPAMI 22 Paper |
Rotationally-Temporally Consistent Novel View Synthesis of Human Performance Video | ECCV 20 Paper |
Autoencoder-based Data Augmentation for Deepfake Detection | ACM 23 Paper |
A data augmentation methodology for training machine/deep learning gait recognition algorithms | CVPR 16 Paper |
ElderSim: A Synthetic Data Generation Platform for Human Action Recognition in Eldercare Applications | IEEE 21 Paper |
Mid-Air: A Multi-Modal Dataset for Extremely Low Altitude Drone Flights | CVPR 19 Paper |
Generating Human Action Videos by Coupling 3D Game Engines and Probabilistic Graphical Models | IJCV 19 Paper |
Using synthetic data for person tracking under adverse weather conditions | IVC 21 Paper |
Unlimited Road-scene Synthetic Annotation (URSA) Dataset | ITSC 18 Paper |
SAIL-VOS 3D: A Synthetic Dataset and Baselines for Object Detection and 3D Mesh Reconstruction From Video Data | CVPR 21 Paper |
Universal Semantic Segmentation for Fisheye Urban Driving Images | SMC 20 Paper |
Paper | Link |
1) ViT: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale | CVPR 21 Paper, Github |
2) MAE: Masked Autoencoders Are Scalable Vision Learners | CVPR 22 Paper, Github |
3) ViViT: A Video Vision Transformer (-) | ICCV 21 Paper, GitHub |
4) DiT: Scalable Diffusion Models with Transformers (-) | ICCV 23 Paper, GitHub, Project, ModelScope |
5) U-ViT: All are Worth Words: A ViT Backbone for Diffusion Models (-) | CVPR 23 Paper, GitHub, ModelScope |
6) FlexiViT: One Model for All Patch Sizes | Paper, Github |
7) Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution | ArXiv 23, Github |
8) VQ-VAE: Neural Discrete Representation Learning | Paper, Github |
9) VQ-GAN: Neural Discrete Representation Learning | CVPR 21 Paper, Github |
10) LVT: Latent Video Transformer | Paper, Github |
11) VideoGPT: Video Generation using VQ-VAE and Transformers (-) | ArXiv 21, GitHub |
12) Predicting Video with VQVAE | ArXiv 21 |
13) CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers | ICLR 23 Paper, Github |
14) TATS: Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer | ECCV 22 Paper, Github |
15) MAGVIT: Masked Generative Video Transformer (-) | CVPR 23 Paper, GitHub, Project, Colab |
16) MagViT2: Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation | ICLR 24 Paper, Github |
17) VideoPoet: A Large Language Model for Zero-Shot Video Generation (-) | ArXiv 23, Project, Blog |
18) CLIP: Learning Transferable Visual Models From Natural Language Supervision | CVPR 21 Paper, Github |
19) BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation | ArXiv 22, Github |
20) BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | ArXiv 23, Github |
Paper | Link |
1) World Model on Million-Length Video And Language With RingAttention | ArXiv 24, GitHub |
2) Ring Attention with Blockwise Transformers for Near-Infinite Context | ArXiv 23, GitHub |
3) Extending LLMs' Context Window with 100 Samples | ArXiv 24, GitHub |
4) Efficient Streaming Language Models with Attention Sinks | ICLR 24 Paper, GitHub |
5) The What, Why, and How of Context Length Extension Techniques in Large Language Models – A Detailed Survey | Paper |
6) MovieChat: From Dense Token to Sparse Memory for Long Video Understanding | CVPR 24 Paper, GitHub, Project |
7) MemoryBank: Enhancing Large Language Models with Long-Term Memory | Paper, GitHub |
Paper | Link |
1) Stable Audio: Fast Timing-Conditioned Latent Audio Diffusion | ArXiv 24, Github, Blog |
2) MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation | CVPR 23 Paper, GitHub |
3) Pengi: An Audio Language Model for Audio Tasks | NeurIPS 23 Paper, GitHub |
4) Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset | NeurlPS 23 Paper, GitHub |
5) Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration | ArXiv 23, GitHub |
6) NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality | TPAMI 24 Paper, GitHub |
7) NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers | ICLR 24 Paper, GitHub |
8) UniAudio: An Audio Foundation Model Toward Universal Audio Generation | ArXiv 23, GitHub |
9) Diffsound: Discrete Diffusion Model for Text-to-sound Generation | TASLP 22 Paper |
10) AudioGen: Textually Guided Audio Generation | ICLR 23 Paper, Project |
11) AudioLDM: Text-to-audio generation with latent diffusion models | ICML 23 Paper, GitHub, Project, Huggingface |
12) AudioLDM2: Learning Holistic Audio Generation with Self-supervised Pretraining | ArXiv 23, GitHub, Project, Huggingface |
13) Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models | ICML 23 Paper, GitHub |
14) Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation | ArXiv 23 |
15) TANGO: Text-to-audio generation using instruction-tuned LLM and latent diffusion model | ArXiv 23, GitHub, Project, Huggingface |
16) AudioLM: a Language Modeling Approach to Audio Generation | ArXiv 22 |
17) AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head | ArXiv 23, GitHub |
18) MusicGen: Simple and Controllable Music Generation | NeurIPS 23 Paper, GitHub |
19) LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT | ArXiv 23 |
20) Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners | CVPR 24 Paper |
21) Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding | EMNLP 23 Paper |
22) Audio-Visual LLM for Video Understanding | ArXiv 23 |
23) VideoPoet: A Large Language Model for Zero-Shot Video Generation (-) | ArXiv 23, Project, Blog |
24) Movie Gen: A Cast of Media Foundation Models | Paper, Project |
Paper | Link |
1) Consistency Models | Paper, GitHub |
2) Improved Techniques for Training Consistency Models | ArXiv 23 |
3) Score-Based Diffusion: Score-Based Generative Modeling through Stochastic Differential Equations (-) | ICLR 21 Paper, GitHub, Blog |
4) Improved Techniques for Training Score-Based Generative Models | NIPS 20 Paper, GitHub |
4) Generative Modeling by Estimating Gradients of the Data Distribution | NIPS 19 Paper, GitHub |
5) Maximum Likelihood Training of Score-Based Diffusion Models | NIPS 21 Paper, GitHub |
6) Layered Neural Atlases for Consistent Video Editing | TOG 21 Paper, GitHub, Project |
7) StableVideo: Text-driven Consistency-aware Diffusion Video Editing | ICCV 23 Paper, GitHub, Project |
8) CoDeF: Content Deformation Fields for Temporally Consistent Video Processing | Paper, GitHub, Project |
9) Sora Generates Videos with Stunning Geometrical Consistency | Paper, GitHub, Project |
10) Efficient One-stage Video Object Detection by Exploiting Temporal Consistency | ECCV 22 Paper, GitHub |
11) Bootstrap Motion Forecasting With Self-Consistent Constraints | ICCV 23 Paper |
12) Enforcing Realism and Temporal Consistency for Large-Scale Video Inpainting | Paper |
13) Enhancing Multi-Camera People Tracking with Anchor-Guided Clustering and Spatio-Temporal Consistency ID Re-Assignment | CVPRW 23 Paper, GitHub |
14) Exploiting Spatial-Temporal Semantic Consistency for Video Scene Parsing | ArXiv 21 |
15) Semi-Supervised Crowd Counting With Spatial Temporal Consistency and Pseudo-Label Filter | TCSVT 23 Paper |
16) Spatio-temporal Consistency and Hierarchical Matching for Multi-Target Multi-Camera Vehicle Tracking | CVPRW 19 Paper |
17) VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning (-) | ArXiv 23 |
18) VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM (-) | ArXiv 24 |
19) MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask | ArXiv 23 |
Paper | Link |
1) RealCompo: Dynamic Equilibrium between Realism and Compositionality Improves Text-to-Image Diffusion Models | ArXiv 24, GitHub, Project |
2) Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs | ArXiv 24, GitHub |
3) LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models | TMLR 23 Paper, GitHub |
4) LLM BLUEPRINT: ENABLING TEXT-TO-IMAGE GEN-ERATION WITH COMPLEX AND DETAILED PROMPTS | ICLR 24 Paper, GitHub |
5) Progressive Text-to-Image Diffusion with Soft Latent Direction | ArXiv 23 |
6) Self-correcting LLM-controlled Diffusion Models | CVPR 24 Paper, GitHub |
7) LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation | MM 23 Paper |
8) LayoutGPT: Compositional Visual Planning and Generation with Large Language Models | NeurIPS 23 Paper, GitHub |
9) Gen4Gen: Generative Data Pipeline for Generative Multi-Concept Composition | ArXiv 24, GitHub |
10) InstructEdit: Improving Automatic Masks for Diffusion-based Image Editing With User Instructions | ArXiv 23, GitHub |
11) Controllable Text-to-Image Generation with GPT-4 | ArXiv 23 |
12) LLM-grounded Video Diffusion Models | ICLR 24 Paper |
13) VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning | ArXiv 23 |
14) FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax | ArXiv 23, Github, Project |
15) VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM | ArXiv 24 |
16) Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator | NeurIPS 23 Paper |
17) Empowering Dynamics-aware Text-to-Video Diffusion with Large Language Models | ArXiv 23 |
18) MotionZero: Exploiting Motion Priors for Zero-shot Text-to-Video Generation | ArXiv 23 |
19) GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning | ArXiv 23 |
20) Multimodal Procedural Planning via Dual Text-Image Prompting | ArXiv 23, Github |
21) InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists | ICLR 24 Paper, Github |
22) DreamSync: Aligning Text-to-Image Generation with Image Understanding Feedback | ArXiv 23 |
23) TaleCrafter: Interactive Story Visualization with Multiple Characters | SIGGRAPH Asia 23 Paper |
24) Reason out Your Layout: Evoking the Layout Master from Large Language Models for Text-to-Image Synthesis | ArXiv 23, Github |
25) COLE: A Hierarchical Generation Framework for Graphic Design | ArXiv 23 |
26) Knowledge-Aware Artifact Image Synthesis with LLM-Enhanced Prompting and Multi-Source Supervision | ArXiv 23 |
27) Vlogger: Make Your Dream A Vlog | CVPR 24 Paper, Github |
28) GALA3D: Towards Text-to-3D Complex Scene Generation via Layout-guided Generative Gaussian Splatting | Paper |
29) MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion | ArXiv 24 |
Paper | Link |
1) LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models | ArXiv 23, GitHub |
2) Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation | ArXiv 23, GitHub |
3) CoCa: Contrastive Captioners are Image-Text Foundation Models | ArXiv 22, Github |
4) CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion | ArXiv 24 |
5) VideoChat: Chat-Centric Video Understanding | CVPR 24 Paper, Github |
6) De-Diffusion Makes Text a Strong Cross-Modal Interface | ArXiv 23 |
7) HowToCaption: Prompting LLMs to Transform Video Annotations at Scale | ArXiv 23 |
8) SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data | ArXiv 24 |
9) LLMGA: Multimodal Large Language Model based Generation Assistant | ArXiv 23, Github |
10) ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment | ArXiv 24, Github |
11) MyVLM: Personalizing VLMs for User-Specific Queries | ArXiv 24 |
12) A Picture is Worth a Thousand Words: Principled Recaptioning Improves Image Generation | ArXiv 23, Github |
13) Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs(-) | ArXiv 24, Github |
14) FlexCap: Generating Rich, Localized, and Flexible Captions in Images | ArXiv 24 |
15) Video ReCap: Recursive Captioning of Hour-Long Videos | ArXiv 24, Github |
16) BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation | ICML 22, Github |
17) PromptCap: Prompt-Guided Task-Aware Image Captioning | ICCV 23, Github |
18) CIC: A framework for Culturally-aware Image Captioning | ArXiv 24 |
19) Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion | ArXiv 24 |
20) FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions | WACV 24, Github |
Paper | Link |
1) BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset | NeurIPS 23 Paper, Github |
2) LIMA: Less Is More for Alignment | NeurIPS 23 Paper |
3) Jailbroken: How Does LLM Safety Training Fail? | NeurIPS 23 Paper |
4) Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models | CVPR 23 Paper |
5) Stable Bias: Evaluating Societal Representations in Diffusion Models | NeurIPS 23 Paper |
6) Ablating concepts in text-to-image diffusion models | ICCV 23 Paper |
7) Diffusion art or digital forgery? investigating data replication in diffusion models | ICCV 23 Paper, Project |
8) Eternal Sunshine of the Spotless Net: Selective Forgetting in Deep Networks | ICCV 20 Paper |
9) Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks | ICML 20 Paper |
10) A pilot study of query-free adversarial attack against stable diffusion | ICCV 23 Paper |
11) Interpretable-Through-Prototypes Deepfake Detection for Diffusion Models | ICCV 23 Paper |
12) Erasing Concepts from Diffusion Models | ICCV 23 Paper, Project |
13) Ablating Concepts in Text-to-Image Diffusion Models | ICCV 23 Paper, Project |
14) BEAVERTAILS: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset | NeurIPS 23 Paper, Project |
15) Stable Bias: Evaluating Societal Representations in Diffusion Models | NeurIPS 23 Paper |
16) Threat Model-Agnostic Adversarial Defense using Diffusion Models | Paper |
17) How well can Text-to-Image Generative Models understand Ethical Natural Language Interventions? | Paper, Github |
18) Differentially Private Diffusion Models Generate Useful Synthetic Images | Paper |
19) Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models | SIGSAC 23 Paper, Github |
20) Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models | Paper, Github |
21) Unified Concept Editing in Diffusion Models | WACV 24 Paper, Project |
22) Diffusion Model Alignment Using Direct Preference Optimization | ArXiv 23 |
23) RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment | TMLR 23 Paper , Github |
24) Self-Alignment of Large Language Models via Monopolylogue-based Social Scene Simulation | Paper, Github, Project |
Paper | Link |
1) NExT-GPT: Any-to-Any Multimodal LLM | ArXiv 23, GitHub |
Paper | Link |
1) H.261: Video codec for audiovisual services at p x 64 kbit/s | Paper |
2) H.262: Information technology - Generic coding of moving pictures and associated audio information: Video | Paper |
3) H.263: Video coding for low bit rate communication | Paper |
4) H.264: Overview of the H.264/AVC video coding standard | Paper |
5) H.265: Overview of the High Efficiency Video Coding (HEVC) Standard | Paper |
6) H.266: Overview of the Versatile Video Coding (VVC) Standard and its Applications | Paper |
7) DVC: An End-to-end Deep Video Compression Framework | CVPR 19 Paper, GitHub |
8) OpenDVC: An Open Source Implementation of the DVC Video Compression Method | Paper, GitHub |
9) HLVC: Learning for Video Compression with Hierarchical Quality and Recurrent Enhancement | CVPR 20 Paper, Github |
10) RLVC: Learning for Video Compression with Recurrent Auto-Encoder and Recurrent Probability Model | J-STSP 21 Paper, Github |
11) PLVC: Perceptual Learned Video Compression with Recurrent Conditional GAN | IJCAI 22 Paper, Github |
12) ALVC: Advancing Learned Video Compression with In-loop Frame Prediction | T-CSVT 22 Paper, Github |
13) DCVC: Deep Contextual Video Compression | NeurIPS 21 Paper, Github |
14) DCVC-TCM: Temporal Context Mining for Learned Video Compression | TM 22 Paper, Github |
15) DCVC-HEM: Hybrid Spatial-Temporal Entropy Modelling for Neural Video Compression | MM 22 Paper, Github |
16) DCVC-DC: Neural Video Compression with Diverse Contexts | CVPR 23 Paper, Github |
17) DCVC-FM: Neural Video Compression with Feature Modulation | CVPR 24 Paper, Github |
18) SSF: Scale-Space Flow for End-to-End Optimized Video Compression | CVPR 20 Paper, Github |
Paper | Link |
1) Mamba: Linear-Time Sequence Modeling with Selective State Spaces | ArXiv 23, Github |
2) Efficiently Modeling Long Sequences with Structured State Spaces | ICLR 22 Paper, Github |
3) Modeling Sequences with Structured State Spaces | Paper |
4) Long Range Language Modeling via Gated State Spaces | ArXiv 22, GitHub |
Paper | Link |
1) Diffusion Models Without Attention | ArXiv 23 |
2) Pan-Mamba: Effective Pan-Sharpening with State Space Model | ArXiv 24, Github |
3) Pretraining Without Attention | ArXiv 22, Github |
4) Block-State Transformers | NIPS 23 Paper |
5) Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model | ArXiv 24, Github |
6) VMamba: Visual State Space Model | ArXiv 24, Github |
7) ZigMa: Zigzag Mamba Diffusion Model | ArXiv 24, Github |
8) MambaVision: A Hybrid Mamba-Transformer Vision Backbone | ArXiv 24, GitHub |
Paper | Link |
1) Long Movie Clip Classification with State-Space Video Models | ECCV 22 Paper, Github |
2) Selective Structured State-Spaces for Long-Form Video Understanding | CVPR 23 Paper |
3) Efficient Movie Scene Detection Using State-Space Transformers | CVPR 23 Paper, Github |
4) VideoMamba: State Space Model for Efficient Video Understanding | Paper, Github |
Paper | Link |
1) Swin-UMamba: Mamba-based UNet with ImageNet-based pretraining | ArXiv 24, Github |
2) MambaIR: A Simple Baseline for Image Restoration with State-Space Model | ArXiv 24, Github |
3) VM-UNet: Vision Mamba UNet for Medical Image Segmentation | ArXiv 24, Github |
Resources | Link |
1) Datawhale - AI视频生成学习 | Feishu doc |
2) A Survey on Generative Diffusion Model | TKDE 24 Paper, GitHub |
3) Awesome-Video-Diffusion-Models: A Survey on Video Diffusion Models | ArXiv 23, GitHub |
4) Awesome-Text-To-Video:A Survey on Text-to-Video Generation/Synthesis | GitHub |
5) video-generation-survey: A reading list of video generation | GitHub |
6) Awesome-Video-Diffusion | GitHub |
7) Video Generation Task in Papers With Code | Task |
8) Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models | ArXiv 24, GitHub |
9) Open-Sora-Plan (PKU-YuanGroup) | GitHub |
10) State of the Art on Diffusion Models for Visual Computing | Paper |
11) Diffusion Models: A Comprehensive Survey of Methods and Applications | CSUR 24 Paper, GitHub |
12) Generate Impressive Videos with Text Instructions: A Review of OpenAI Sora, Stable Diffusion, Lumiere and Comparable | Paper |
13) On the Design Fundamentals of Diffusion Models: A Survey | Paper |
14) Efficient Diffusion Models for Vision: A Survey | Paper |
15) Text-to-Image Diffusion Models in Generative AI: A Survey | Paper |
16) Awesome-Diffusion-Transformers | GitHub, Project |
17) Open-Sora (HPC-AI Tech) | GitHub, Blog |
18) LAVIS - A Library for Language-Vision Intelligence | ACL 23 Paper, GitHub, Project |
19) OpenDiT: An Easy, Fast and Memory-Efficient System for DiT Training and Inference | GitHub |
20) Awesome-Long-Context | GitHub1, GitHub2 |
21) Lite-Sora | GitHub |
22) Mira: A Mini-step Towards Sora-like Long Video Generation | GitHub, Project |
1) A bridging model for parallel computation | Paper |
2) PyTorch Distributed: Experiences on Accelerating Data Parallel Training | VLDB 20 Paper |
1) Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism | ArXiv 19 Paper |
2) TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models | PMLR 21 Paper |
1) GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism | NeurIPS 19 Paper |
2) PipeDream: generalized pipeline parallelism for DNN training | SOSP 19 Paper |
1) Mesh-TensorFlow: Deep Learning for Supercomputers | ArXiv 18 Paper |
2) Beyond Data and Model Parallelism for Deep Neural Networks | MLSys 19 Paper |
1) ZeRO: Memory Optimizations Toward Training Trillion Parameter Models | ArXiv 20 |
2) DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters | ACM 20 Paper |
3) ZeRO-Offload: Democratizing Billion-Scale Model Training | ArXiv 21 |
4) PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel | ArXiv 23 |
1) Gist: Efficient Data Encoding for Deep Neural Network Training | IEEE 18 Paper |
2) Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization | MLSys 20 Paper |
3) Training Deep Nets with Sublinear Memory Cost | ArXiv 16 Paper |
4) Superneurons: dynamic GPU memory management for training deep neural networks | ACM 18 Paper |
1) Training Large Neural Networks with Constant Memory using a New Execution Algorithm | ArXiv 20 Paper |
2) vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design | IEEE 16 Paper |
1) Adafactor: Adaptive Learning Rates with Sublinear Memory Cost | PMLR 18 Paper |
2) Memory-Efficient Adaptive Optimization for Large-Scale Learning | Paper |
1) ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment | ArXiv 24 Github |
1) Generative Modeling by Estimating Gradients of the Data Distribution | NeurIPS 19 Paper |
2) WaveGrad: Estimating Gradients for Waveform Generation | ArXiv 20 |
3) Noise Level Limited Sub-Modeling for Diffusion Probabilistic Vocoders | ICASSP 21 Paper |
4) Noise Estimation for Generative Diffusion Models | ArXiv 21 |
1) Denoising Diffusion Implicit Models | ICLR 21 Paper |
2) DiffWave: A Versatile Diffusion Model for Audio Synthesis | ICLR 21 Paper |
3) On Fast Sampling of Diffusion Probabilistic Models | ArXiv 21 |
4) DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps | NeurIPS 22 Paper |
5) DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models | ArXiv 22 |
6) Fast Sampling of Diffusion Models with Exponential Integrator | ICLR 22 Paper |
1) On Distillation of Guided Diffusion Models | CVPR 23 Paper |
2) Progressive Distillation for Fast Sampling of Diffusion Models | ICLR 22 Paper |
3) SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds | NeurIPS 23 Paper |
4) Tackling the Generative Learning Trilemma with Denoising Diffusion GANs | ICLR 22 Paper |
1) Q-Diffusion: Quantizing Diffusion Models | CVPR 23 Paper |
2) Q-DM: An Efficient Low-bit Quantized Diffusion Model | NeurIPS 23 Paper |
3) Temporal Dynamic Quantization for Diffusion Models | NeurIPS 23 Paper |
1) DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models | CVPR 24 Paper |
2) Efficient Spatially Sparse Inference for Conditional GANs and Diffusion Models | NeurIPS 22 Paper |
3) PipeFusion: Displaced Patch Pipeline Parallelism for Inference of Diffusion Transformer Models | ArXiv 24 |
If this project is helpful to your work, please cite it using the following format:
@misc{minisora,
title={MiniSora},
author={MiniSora Community},
url={https://github.com/mini-sora/minisora},
year={2024}
}
@misc{minisora,
title={Diffusion Model-based Video Generation Models From DDPM to Sora: A Survey},
author={Survey Paper Group of MiniSora Community},
url={https://github.com/mini-sora/minisora},
year={2024}
}
We greatly appreciate your contributions to the Mini Sora open-source community and helping us make it even better than it is now!
For more details, please refer to the Contribution Guidelines