Skip to content

Latest commit

 

History

History
666 lines (606 loc) · 79.8 KB

README_zh-CN.md

File metadata and controls

666 lines (606 loc) · 79.8 KB

Mini Sora 社区

Contributors Forks Issues MIT License Stargazers

 

English | 简体中文

👋 加入我们的 微信社区

MiniSora 开源社区定位为由社区同学自发组织的开源社区,MiniSora 计划探索 Sora 的实现路径和后续的发展方向:

  • 将定期举办 Sora 的圆桌和社区一起探讨可能性
  • 视频生成的现有技术路径探讨
  • 牵头复现 Sora 有关的论文或者研究成果, 如 DiT(MiniSora-DiT) 等
  • 开展以Sora有关的核心技术和实现的综述研究, 即, 从 "DDPM 到 Sora: 基于扩散模型的视频生成模型的综述"

最近更新

empty

MiniSora的Sora复现目标

  1. GPU-Friendly : 最好对GPU内存大小和GPU数量要求较低, 比如8卡A100 80G, 8卡A6000 48G, RTX4090 24G之类的算力可以训练和推理
  2. Training-Efficiency : 不需要训练太久即可有较好的效果
  3. Inference-Efficiency : 推理生成视频时, 长度和分辨率不要求过高, 如3-10s,480p都是可接受的

MiniSora-DiT: 基于XTuner复现论文DiT

https://github.com/mini-sora/minisora-DiT

招募要求

招募MiniSora社区同学使用 XTuner 复现 DiT, 希望领取任务同学有如下特点:

  1. 熟悉 OpenMMLab MMEngine 机制
  2. 熟悉 DiT

背景

  1. DiT 作者和 Sora 作者为同一个
  2. XTuner 现有能够高效训练 1000K 序列长度的核心技术

支持

  1. 算力提供 2*A100
  2. XTuner 核心开发者 P佬@pppppM 会大力支持~

近期圆桌讨论

Stable Diffusion 3 论文(MM-DiT)解读

主讲:MMagic 核心贡献者

在线直播时间:03/12 20:00

直播看点:MMagic 核心贡献者为我们领读 Stable Diffusion 3 论文,介绍 Stable Diffusion 3 的架构细节和设计思路。

PPT: 飞书链接

往期精彩看点

Sora夜谈之Video Diffusion 综述

知乎Notes: A Survey on Generative Diffusion Model 生成扩散模型综述

论文共读发表者募集

相关工作

01 Diffusion Model

论文 链接
1) Guided-Diffusion: Diffusion Models Beat GANs on Image Synthesis NeurIPS 21 Paper, Github
2) Latent Diffusion: High-Resolution Image Synthesis with Latent Diffusion Models CVPR 22 Paper, Github
3) EDM: Elucidating the Design Space of Diffusion-Based Generative Models NeurIPS 22 Paper, Github
4) DDPM: Denoising Diffusion Probabilistic Models NeurIPS 20 Paper, Github
5) DDIM: Denoising Diffusion Implicit Models ICLR 21 Paper, Github
6) Score-Based Diffusion: Score-Based Generative Modeling through Stochastic Differential Equations ICLR 21 Paper, Github, Blog
7) Stable Cascade: Würstchen: An efficient architecture for large-scale text-to-image diffusion models ICLR 24 Paper, Github, Blog
8) Diffusion Models in Vision: A Survey TPAMI 23 Paper, GitHub
9) Improved DDPM: Improved Denoising Diffusion Probabilistic Models ICML 21 Paper, Github
10) Classifier-free diffusion guidance NIPS 21 Paper
11) Glide: Towards photorealistic image generation and editing with text-guided diffusion models Paper, Github
12) VQ-DDM: Global Context with Discrete Diffusion in Vector Quantised Modelling for Image Generation CVPR 22 Paper, Github
13) Diffusion Models for Medical Anomaly Detection Paper, Github
14) Generation of Anonymous Chest Radiographs Using Latent Diffusion Models for Training Thoracic Abnormality Classification Systems Paper
15) DiffusionDet: Diffusion Model for Object Detection ICCV 23 Paper, Github
16) Label-efficient semantic segmentation with diffusion models ICLR 22 Paper, Github, Project

02 Diffusion Transformer

论文 链接
1) UViT: All are Worth Words: A ViT Backbone for Diffusion Models CVPR 23 Paper, Github, ModelScope
2) DiT: Scalable Diffusion Models with Transformers ICCV 23 Paper, Github, Project, ModelScope
3) SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers ArXiv 23, Github, ModelScope
4) FiT: Flexible Vision Transformer for Diffusion Model ArXiv 24, Github
5) k-diffusion: Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers ArXiv 24, Github
6) Large-DiT: Large Diffusion Transformer Github
7) VisionLLaMA: A Unified LLaMA Interface for Vision Tasks ArXiv 24, Github
8) Stable Diffusion 3: MM-DiT: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis Paper, Blog
9) PIXART-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation ArXiv 24, Project
10) PIXART-α: Fast Training of Diffusion Transformer for Photorealistic Text-To-Image Synthesis ArXiv 23, GitHub ModelScope
11) PIXART-δ: Fast and Controllable Image Generation With Latent Consistency Model ArXiv 24,
12) Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers ArXiv 24, GitHub
13) DDM: Deconstructing Denoising Diffusion Models for Self-Supervised Learning ArXiv 24
14) Autoregressive Image Generation without Vector Quantization ArXiv 24, GitHub
15) Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model ArXiv 24

03 Baseline Video Generation Models

论文 链接
1) ViViT: A Video Vision Transformer ICCV 21 Paper, Github
2) VideoLDM: Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models CVPR 23 Paper
3) DiT: Scalable Diffusion Models with Transformers ICCV 23 Paper, Github, Project, ModelScope
4) Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators ArXiv 23, Github
5) Latte: Latent Diffusion Transformer for Video Generation ArXiv 24, GitHub, Project, ModelScope

04 Diffusion UNet

论文 链接
1) Taming Transformers for High-Resolution Image Synthesis CVPR 21 Paper,GitHub ,Project
2) ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment ArXiv 24 Github

05 Video Generation

论文 链接
1) Animatediff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning ICLR 24 Paper, Github, ModelScope
2) I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models ArXiv 23, Github, ModelScope
3) Imagen Video: High Definition Video Generation with Diffusion Models ArXiv 22
4) MoCoGAN: Decomposing Motion and Content for Video Generation CVPR 18 Paper
5) Adversarial Video Generation on Complex Datasets Paper
6) W.A.L.T: Photorealistic Video Generation with Diffusion Models ArXiv 23 Project
7) VideoGPT: Video Generation using VQ-VAE and Transformers ArXiv 21, Github
8) Video Diffusion Models ArXiv 22, Github, Project
9) MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation NeurIPS 22 Paper, Github, Project, Blog
10) VideoPoet: A Large Language Model for Zero-Shot Video Generation ArXiv 23, Project, Blog
11) MAGVIT: Masked Generative Video Transformer CVPR 23 Paper, Github, Project, Colab
12) EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions ArXiv 24, Github, Project
13) SimDA: Simple Diffusion Adapter for Efficient Video Generation Paper, Github, Project
14) StableVideo: Text-driven Consistency-aware Diffusion Video Editing ICCV 23 Paper, Github, Project
15) SVD: Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets Paper, Github
16) ADD: Adversarial Diffusion Distillation Paper, Github
17) GenTron: Diffusion Transformers for Image and Video Generation CVPR 24 Paper, Project
18) LFDM: Conditional Image-to-Video Generation with Latent Flow Diffusion Models CVPR 23 Paper, Github
19) MotionDirector: Motion Customization of Text-to-Video Diffusion Models ArXiv 23, Github
20) TGAN-ODE: Latent Neural Differential Equations for Video Generation Paper, Github
21) VideoCrafter1: Open Diffusion Models for High-Quality Video Generation ArXiv 23, Github
22) VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models ArXiv 24, Github
23) LVDM: Latent Video Diffusion Models for High-Fidelity Long Video Generation ArXiv 22, GitHub
24) LaVie: High-Quality Video Generation with Cascaded Latent Diffusion Models ArXiv 23, GitHub ,Project
25) PYoCo: Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models ICCV 23 Paper, Project
26) VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation CVPR 23 Paper
27) Movie Gen: A Cast of Media Foundation Models Paper, Project

06 Dataset

6.1 数据集资源

数据集名称 - 论文 链接
1) Panda-70M - Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
70M Clips, 720P, Downloadable
CVPR 24 Paper, Github, Project, ModelScope
2) InternVid-10M - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
10M Clips, 720P, Downloadable
ArXiv 24, Github
3) CelebV-Text - CelebV-Text: A Large-Scale Facial Text-Video Dataset
70K Clips, 720P, Downloadable
CVPR 23 Paper, Github, Project
4) HD-VG-130M - VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation
130M Clips, 720P, Downloadable
ArXiv 23, Github, Tool
5) HD-VILA-100M - Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
100M Clips, 720P, Downloadable
CVPR 22 Paper, Github
6) VideoCC - Learning Audio-Video Modalities from Image Captions
10.3M Clips, 720P, Downloadable
ECCV 22 Paper, Github
7) YT-Temporal-180M - MERLOT: Multimodal Neural Script Knowledge Models
180M Clips, 480P, Downloadable
NeurIPS 21 Paper, Github, Project
8) HowTo100M - HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
136M Clips, 240P, Downloadable
ICCV 19 Paper, Github, Project
9) UCF101 - UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
13K Clips, 240P, Downloadable
CVPR 12 Paper, Project
10) MSVD - Collecting Highly Parallel Data for Paraphrase Evaluation
122K Clips, 240P, Downloadable
ACL 11 Paper, Project
11) Fashion-Text2Video - A human video dataset with rich label and text annotations
600 Videos, 480P, Downloadable
ArXiv 23, Project
12) LAION-5B - A dataset of 5,85 billion CLIP-filtered image-text pairs, 14x bigger than LAION-400M
5B Clips, Downloadable
NeurIPS 22 Paper, Project
13) ActivityNet Captions - ActivityNet Captions contains 20k videos amounting to 849 video hours with 100k total descriptions, each with its unique start and end time
20k videos, Downloadable
Arxiv 17 Paper, Project
14) MSR-VTT - A large-scale video benchmark for video understanding
10k Clips, Downloadable
CVPR 16 Paper, Project
15) The Cityscapes Dataset - Benchmark suite and evaluation server for pixel-level, instance-level, and panoptic semantic labeling
Downloadable
Arxiv 16 Paper, Project
16) Youku-mPLUG - First open-source large-scale Chinese video text dataset
Downloadable
ArXiv 23, Project, ModelScope
17) VidProM - VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models
6.69M, Downloadable
ArXiv 24, Github
18) Pixabay100 - A video dataset collected from Pixabay
Downloadable
Github
19) WebVid - Large-scale text-video dataset, containing 10 million video-text pairs scraped from the stock footage sites
Long Durations and Structured Captions
ArXiv 21, Project , ModelScope
20) MiraData(Mini-Sora Data): A Large-Scale Video Dataset with Long Durations and Structured Captions
10M video-text pairs
Github, Project
21) IDForge: A video dataset featuring scenes of people speaking.
300k Clips, Downloadable
ArXiv 24, Github

6.2 数据集增强方法

6.2.1 基础变换
Three-stream CNNs for action recognition PRL 17 Paper
Dynamic Hand Gesture Recognition Using Multi-direction 3D Convolutional Neural Networks EL 19 Paper
Intra-clip Aggregation for Video Person Re-identification ICIP 20 Paper
VideoMix: Rethinking Data Augmentation for Video Classification CVPR 20 Paper
mixup: Beyond Empirical Risk Minimization ICLR 17 Paper
CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features ICCV 19 Paper
Video Salient Object Detection via Fully Convolutional Networks ICIP 18 Paper
Illumination-Based Data Augmentation for Robust Background Subtraction SKIMA 19 Paper
Image editing-based data augmentation for illumination-insensitive background subtraction EIM 20 Paper
6.2.2 由特征空间增强
Feature Re-Learning with Data Augmentation for Content-based Video Recommendation ACM 18 Paper
GAC-GAN: A General Method for Appearance-Controllable Human Video Motion Transfer Trans 21 Paper
6.2.3 基于GAN网络增强
Deep Video-Based Performance Cloning CVPR 18 Paper
Adversarial Action Data Augmentation for Similar Gesture Action Recognition IJCNN 19 Paper
Self-Paced Video Data Augmentation by Generative Adversarial Networks with Insufficient Samples MM 20 Paper
GAC-GAN: A General Method for Appearance-Controllable Human Video Motion Transfer Trans 20 Paper
Dynamic Facial Expression Generation on Hilbert Hypersphere With Conditional Wasserstein Generative Adversarial Nets TPAMI 20 Paper
CrowdGAN: Identity-Free Interactive Crowd Video Generation and Beyond TPAMI 22 Paper
6.2.4 基于Encoder/Decoder方法
Rotationally-Temporally Consistent Novel View Synthesis of Human Performance Video ECCV 20 Paper
Autoencoder-based Data Augmentation for Deepfake Detection ACM 23 Paper
6.2.5 使用模拟器
A data augmentation methodology for training machine/deep learning gait recognition algorithms CVPR 16 Paper
ElderSim: A Synthetic Data Generation Platform for Human Action Recognition in Eldercare Applications IEEE 21 Paper
Mid-Air: A Multi-Modal Dataset for Extremely Low Altitude Drone Flights CVPR 19 Paper
Generating Human Action Videos by Coupling 3D Game Engines and Probabilistic Graphical Models IJCV 19 Paper
Using synthetic data for person tracking under adverse weather conditions IVC 21 Paper
Unlimited Road-scene Synthetic Annotation (URSA) Dataset ITSC 18 Paper
SAIL-VOS 3D: A Synthetic Dataset and Baselines for Object Detection and 3D Mesh Reconstruction From Video Data CVPR 21 Paper
Universal Semantic Segmentation for Fisheye Urban Driving Images SMC 20 Paper

07 Patchifying Methods

论文 链接
1) ViT: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale CVPR 21 Paper, Github
2) MAE: Masked Autoencoders Are Scalable Vision Learners CVPR 22 Paper, Github
3) ViViT: A Video Vision Transformer (-) ICCV 21 Paper, GitHub
4) DiT: Scalable Diffusion Models with Transformers (-) ICCV 23 Paper, GitHub, ModelScope
5) U-ViT: All are Worth Words: A ViT Backbone for Diffusion Models (-) CVPR 23 Paper, GitHub, ModelScope
6) FlexiViT: One Model for All Patch Sizes Paper, Github
7) Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution ArXiv 23, Github
8) VQ-VAE: Neural Discrete Representation Learning Paper, Github
9) VQ-GAN: Neural Discrete Representation Learning CVPR 21 Paper, Github
10) LVT: Latent Video Transformer Paper, Github
11) VideoGPT: Video Generation using VQ-VAE and Transformers (-) ArXiv 21, GitHub
12) Predicting Video with VQVAE ArXiv 21
13) CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers ICLR 23 Paper, Github
14) TATS: Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer ECCV 22 Paper, Github
15) MAGVIT: Masked Generative Video Transformer (-) CVPR 23 Paper, GitHub, Project, Colab
16) MagViT2: Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation ICLR 24 Paper, Github
17) VideoPoet: A Large Language Model for Zero-Shot Video Generation (-) ArXiv 23, Project, Blog
18) CLIP: Learning Transferable Visual Models From Natural Language Supervision CVPR 21 Paper, Github
19) BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation ArXiv 22, Github
20) BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models ArXiv 23, Github

08 Long-context

论文 链接
1) World Model on Million-Length Video And Language With RingAttention ArXiv 24, Github
2) Ring Attention with Blockwise Transformers for Near-Infinite Context ArXiv 23, Github
3) Extending LLMs' Context Window with 100 Samples ArXiv 24, Github
4) Efficient Streaming Language Models with Attention Sinks ICLR 24 Paper, Github
5) The What, Why, and How of Context Length Extension Techniques in Large Language Models – A Detailed Survey Paper
6) MovieChat: From Dense Token to Sparse Memory for Long Video Understanding CVPR 24 Paper, Github, Project
7) MemoryBank: Enhancing Large Language Models with Long-Term Memory Paper, GitHub
论文 链接
1) Stable Audio: Fast Timing-Conditioned Latent Audio Diffusion ArXiv 24, Github, Blog
2) MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation CVPR 23 Paper, GitHub
3) Pengi: An Audio Language Model for Audio Tasks NeurIPS 23 Paper, GitHub
4) Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset NeurlPS 23 Paper, GitHub
5) Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration ArXiv 23, GitHub
6) NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality TPAMI 24 Paper, GitHub
7) NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers ICLR 24 Paper, GitHub
8) UniAudio: An Audio Foundation Model Toward Universal Audio Generation ArXiv 23, GitHub
9) Diffsound: Discrete Diffusion Model for Text-to-sound Generation TASLP 22 Paper
10) AudioGen: Textually Guided Audio Generation ICLR 23 Paper, Project
11) AudioLDM: Text-to-audio generation with latent diffusion models ICML 23 Paper, GitHub, Project, Huggingface
12) AudioLDM2: Learning Holistic Audio Generation with Self-supervised Pretraining ArXiv 23, GitHub, Project, Huggingface
13) Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models ICML 23 Paper, GitHub
14) Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation ArXiv 23
15) TANGO: Text-to-audio generation using instruction-tuned LLM and latent diffusion model ArXiv 23, GitHub, Project, Huggingface
16) AudioLM: a Language Modeling Approach to Audio Generation ArXiv 22
17) AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head ArXiv 23, GitHub
18) MusicGen: Simple and Controllable Music Generation NeurIPS 23 Paper, GitHub
19) LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT ArXiv 23
20) Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners CVPR 24 Paper
21) Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding EMNLP 23 Paper
22) Audio-Visual LLM for Video Understanding ArXiv 23
23) VideoPoet: A Large Language Model for Zero-Shot Video Generation (-) ArXiv 23, Project, Blog
24) Movie Gen: A Cast of Media Foundation Models Paper, Project

10 Consistency

论文 链接
1) Consistency Models Paper, GitHub
2) Improved Techniques for Training Consistency Models ArXiv 23
3) Score-Based Diffusion: Score-Based Generative Modeling through Stochastic Differential Equations (-) ICLR 21 Paper, GitHub, Blog
4) Improved Techniques for Training Score-Based Generative Models NIPS 20 Paper, GitHub
4) Generative Modeling by Estimating Gradients of the Data Distribution NIPS 19 Paper, GitHub
5) Maximum Likelihood Training of Score-Based Diffusion Models NIPS 21 Paper, GitHub
6) Layered Neural Atlases for Consistent Video Editing TOG 21 Paper, GitHub, Project
7) StableVideo: Text-driven Consistency-aware Diffusion Video Editing ICCV 23 Paper, GitHub, Project
8) CoDeF: Content Deformation Fields for Temporally Consistent Video Processing Paper, GitHub, Project
9) Sora Generates Videos with Stunning Geometrical Consistency Paper, GitHub, Project
10) Efficient One-stage Video Object Detection by Exploiting Temporal Consistency ECCV 22 Paper, GitHub
11) Bootstrap Motion Forecasting With Self-Consistent Constraints ICCV 23 Paper
12) Enforcing Realism and Temporal Consistency for Large-Scale Video Inpainting Paper
13) Enhancing Multi-Camera People Tracking with Anchor-Guided Clustering and Spatio-Temporal Consistency ID Re-Assignment CVPRW 23 Paper, GitHub
14) Exploiting Spatial-Temporal Semantic Consistency for Video Scene Parsing ArXiv 21
15) Semi-Supervised Crowd Counting With Spatial Temporal Consistency and Pseudo-Label Filter TCSVT 23 Paper
16) Spatio-temporal Consistency and Hierarchical Matching for Multi-Target Multi-Camera Vehicle Tracking CVPRW 19 Paper
17) VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning (-) ArXiv 23
18) VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM (-) ArXiv 24
19) MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask ArXiv 23

11 Prompt Engineering

论文 链接
1) RealCompo: Dynamic Equilibrium between Realism and Compositionality Improves Text-to-Image Diffusion Models ArXiv 24, Github, Project
2) Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs ArXiv 24, Github
3) LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models TMLR 23 Paper, Github
4) LLM BLUEPRINT: ENABLING TEXT-TO-IMAGE GEN-ERATION WITH COMPLEX AND DETAILED PROMPTS ICLR 24 Paper, Github
5) Progressive Text-to-Image Diffusion with Soft Latent Direction ArXiv 23
6) Self-correcting LLM-controlled Diffusion Models CVPR 24 Paper, Github
7) LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation MM 23 Paper
8) LayoutGPT: Compositional Visual Planning and Generation with Large Language Models NeurIPS 23 Paper, Github
9) Gen4Gen: Generative Data Pipeline for Generative Multi-Concept Composition ArXiv 24, Github
10) InstructEdit: Improving Automatic Masks for Diffusion-based Image Editing With User Instructions ArXiv 23, Github
11) Controllable Text-to-Image Generation with GPT-4 ArXiv 23
12) LLM-grounded Video Diffusion Models ICLR 24 Paper
13) VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning ArXiv 23
14) FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax ArXiv 23, Github, Project
15) VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM ArXiv 24
16) Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator NeurIPS 23 Paper
17) Empowering Dynamics-aware Text-to-Video Diffusion with Large Language Models ArXiv 23
18) MotionZero: Exploiting Motion Priors for Zero-shot Text-to-Video Generation ArXiv 23
19) GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning ArXiv 23
20) Multimodal Procedural Planning via Dual Text-Image Prompting Paper, Github
21) InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists ICLR 24 Paper, Github
22) DreamSync: Aligning Text-to-Image Generation with Image Understanding Feedback Paper
23) TaleCrafter: Interactive Story Visualization with Multiple Characters SIGGRAPH Asia 23 Paper
24) Reason out Your Layout: Evoking the Layout Master from Large Language Models for Text-to-Image Synthesis Paper, Github
25) COLE: A Hierarchical Generation Framework for Graphic Design Paper
26) Knowledge-Aware Artifact Image Synthesis with LLM-Enhanced Prompting and Multi-Source Supervision Paper
27) Vlogger: Make Your Dream A Vlog CVPR 24 Paper, Github
28) GALA3D: Towards Text-to-3D Complex Scene Generation via Layout-guided Generative Gaussian Splatting Paper
29) MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion Paper

Recaption

Paper Link
1) LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models ArXiv 23, GitHub
2) Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation ArXiv 23, GitHub
3) CoCa: Contrastive Captioners are Image-Text Foundation Models ArXiv 22, Github
4) CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion ArXiv 24
5) VideoChat: Chat-Centric Video Understanding CVPR 24 Paper, Github
6) De-Diffusion Makes Text a Strong Cross-Modal Interface ArXiv 23
7) HowToCaption: Prompting LLMs to Transform Video Annotations at Scale ArXiv 23
8) SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data ArXiv 24
9) LLMGA: Multimodal Large Language Model based Generation Assistant ArXiv 23, Github
10) ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment ArXiv 24, Github
11) MyVLM: Personalizing VLMs for User-Specific Queries ArXiv 24
12) A Picture is Worth a Thousand Words: Principled Recaptioning Improves Image Generation ArXiv 23, Github
13) Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs(-) ArXiv 24, Github
14) FlexCap: Generating Rich, Localized, and Flexible Captions in Images ArXiv 24
15) Video ReCap: Recursive Captioning of Hour-Long Videos ArXiv 24, Github
16) BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation ICML 22, Github
17) PromptCap: Prompt-Guided Task-Aware Image Captioning ICCV 23, Github
18) CIC: A framework for Culturally-aware Image Captioning ArXiv 24
19) Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion ArXiv 24
20) FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions WACV 24, Github

12 Security

论文 链接
1) BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset NeurIPS 23 Paper, Github
2) LIMA: Less Is More for Alignment NeurIPS 23 Paper
3) Jailbroken: How Does LLM Safety Training Fail? NeurIPS 23 Paper
4) Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models CVPR 23 Paper
5) Stable Bias: Evaluating Societal Representations in Diffusion Models NeurIPS 23 Paper
6) Ablating concepts in text-to-image diffusion models ICCV 23 Paper
7) Diffusion art or digital forgery? investigating data replication in diffusion models ICCV 23 Paper, Project
8) Eternal Sunshine of the Spotless Net: Selective Forgetting in Deep Networks ICCV 20 Paper
9) Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks ICML 20 Paper
10) A pilot study of query-free adversarial attack against stable diffusion ICCV 23 Paper
11) Interpretable-Through-Prototypes Deepfake Detection for Diffusion Models ICCV 23 Paper
12) Erasing Concepts from Diffusion Models ICCV 23 Paper, Project
13) Ablating Concepts in Text-to-Image Diffusion Models ICCV 23 Paper, Project
14) BEAVERTAILS: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset NeurIPS 23 Paper, Project
15) Stable Bias: Evaluating Societal Representations in Diffusion Models NeurIPS 23 Paper
16) Threat Model-Agnostic Adversarial Defense using Diffusion Models Paper
17) How well can Text-to-Image Generative Models understand Ethical Natural Language Interventions? Paper, Github
18) Differentially Private Diffusion Models Generate Useful Synthetic Images Paper
19) Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models SIGSAC 23 Paper, Github
20) Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models Paper, Github
21) Unified Concept Editing in Diffusion Models WACV 24 Paper, Project
22) Diffusion Model Alignment Using Direct Preference Optimization ArXiv 23
23) RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment TMLR 23 Paper , Github
24) Self-Alignment of Large Language Models via Monopolylogue-based Social Scene Simulation Paper, Github, Project

13 World Model

论文 链接
1) NExT-GPT: Any-to-Any Multimodal LLM ArXiv 23, Github

14 Video Compression

论文 链接
1) H.261: Video codec for audiovisual services at p x 64 kbit/s Paper
2) H.262: Information technology - Generic coding of moving pictures and associated audio information: Video Paper
3) H.263: Video coding for low bit rate communication Paper
4) H.264: Overview of the H.264/AVC video coding standard Paper
5) H.265: Overview of the High Efficiency Video Coding (HEVC) Standard Paper
6) H.266: Overview of the Versatile Video Coding (VVC) Standard and its Applications Paper
7) DVC: An End-to-end Deep Video Compression Framework CVPR 19 Paper, GitHub
8) OpenDVC: An Open Source Implementation of the DVC Video Compression Method Paper, GitHub
9) HLVC: Learning for Video Compression with Hierarchical Quality and Recurrent Enhancement CVPR 20 Paper, Github
10) RLVC: Learning for Video Compression with Recurrent Auto-Encoder and Recurrent Probability Model J-STSP 21 Paper, Github
11) PLVC: Perceptual Learned Video Compression with Recurrent Conditional GAN IJCAI 22 Paper, Github
12) ALVC: Advancing Learned Video Compression with In-loop Frame Prediction T-CSVT 22 Paper, Github
13) DCVC: Deep Contextual Video Compression NeurIPS 21 Paper, Github
14) DCVC-TCM: Temporal Context Mining for Learned Video Compression TM 22 Paper, Github
15) DCVC-HEM: Hybrid Spatial-Temporal Entropy Modelling for Neural Video Compression MM 22 Paper, Github
16) DCVC-DC: Neural Video Compression with Diverse Contexts CVPR 23 Paper, Github
17) DCVC-FM: Neural Video Compression with Feature Modulation CVPR 24 Paper, Github
18) SSF: Scale-Space Flow for End-to-End Optimized Video Compression CVPR 20 Paper, Github

15 Mamba

15.1 Theoretical Foundations and Model Architecture

论文 链接
1) Mamba: Linear-Time Sequence Modeling with Selective State Spaces ArXiv 23, Github
2) Efficiently Modeling Long Sequences with Structured State Spaces ICLR 22 Paper, Github
3) Modeling Sequences with Structured State Spaces Paper
4) Long Range Language Modeling via Gated State Spaces ArXiv 22, GitHub

15.2 Image Generation and Visual Applications

论文 链接
1) Diffusion Models Without Attention ArXiv 23
2) Pan-Mamba: Effective Pan-Sharpening with State Space Model ArXiv 24, Github
3) Pretraining Without Attention ArXiv 22, Github
4) Block-State Transformers NIPS 23 Paper
5) Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model ArXiv 24, Github
6) VMamba: Visual State Space Model ArXiv 24, Github
7) ZigMa: Zigzag Mamba Diffusion Model ArXiv 24, Github
8) MambaVision: A Hybrid Mamba-Transformer Vision Backbone ArXiv 24, GitHub

15.3 Video Processing and Understanding

论文 链接
1) Long Movie Clip Classification with State-Space Video Models ECCV 22 Paper, Github
2) Selective Structured State-Spaces for Long-Form Video Understanding CVPR 23 Paper
3) Efficient Movie Scene Detection Using State-Space Transformers CVPR 23 Paper, Github
4) VideoMamba: State Space Model for Efficient Video Understanding Paper, Github

15.4 Medical Image Processing

论文 链接
1) Swin-UMamba: Mamba-based UNet with ImageNet-based pretraining ArXiv 24, Github
2) MambaIR: A Simple Baseline for Image Restoration with State-Space Model ArXiv 24, Github
3) VM-UNet: Vision Mamba UNet for Medical Image Segmentation ArXiv 24, Github

16 现有高质量资料

资料 链接
1) Datawhale - AI视频生成学习 Feishu doc
2) A Survey on Generative Diffusion Model TKDE 24 Paper, Github
3) Awesome-Video-Diffusion-Models: A Survey on Video Diffusion Models ArXiv 23, Github
4) Awesome-Text-To-Video:A Survey on Text-to-Video Generation/Synthesis Github
5) video-generation-survey: A reading list of video generation Github
6) Awesome-Video-Diffusion Github
7) Video Generation Task in Papers With Code Link
8) Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models ArXiv 24, Github, 中文翻译
9) Open-Sora-Plan (PKU-YuanGroup) Github
10) State of the Art on Diffusion Models for Visual Computing Paper
11) Diffusion Models: A Comprehensive Survey of Methods and Applications CSUR 24 Paper, Github
12) Generate Impressive Videos with Text Instructions: A Review of OpenAI Sora, Stable Diffusion, Lumiere and Comparable Paper
13) On the Design Fundamentals of Diffusion Models: A Survey Paper
14) Efficient Diffusion Models for Vision: A Survey Paper
15) Text-to-Image Diffusion Models in Generative AI: A Survey Paper
16) Awesome-Diffusion-Transformers GitHub, Page
17) Open-Sora (HPC-AI Tech) GitHub, Blog
18) LAVIS - A Library for Language-Vision Intelligence ACL 23 Paper, GitHub, Page
19) OpenDiT: An Easy, Fast and Memory-Efficient System for DiT Training and Inference Github
20) Awesome-Long-Context GitHub1, GitHub2
21) Lite-Sora GitHub
22) Mira: A Mini-step Towards Sora-like Long Video Generation GitHub, Project

17 高效训练

17.1 并行训练方法

17.1.1 数据并行(DP)
1) A bridging model for parallel computation Paper
2) PyTorch Distributed: Experiences on Accelerating Data Parallel Training VLDB 20 Paper
17.1.2 模型并行(MP)
1) Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism ArXiv 19 Paper
2) TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models PMLR 21 Paper
17.1.3 流水线并行(PP)
1) GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism NeurIPS 19 Paper
2) PipeDream: generalized pipeline parallelism for DNN training SOSP 19 Paper
17.1.4 广义并行(GP)
1) Mesh-TensorFlow: Deep Learning for Supercomputers ArXiv 18 Paper
2) Beyond Data and Model Parallelism for Deep Neural Networks MLSys 19 Paper
17.1.5 零冗余并行(ZP)
1) ZeRO: Memory Optimizations Toward Training Trillion Parameter Models ArXiv 20
2) DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters ACM 20 Paper
3) ZeRO-Offload: Democratizing Billion-Scale Model Training ArXiv 21
4) PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel ArXiv 23

17.2 非并行训练方法

17.2.1 减少激活内存
1) Gist: Efficient Data Encoding for Deep Neural Network Training IEEE 18 Paper
2) Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization MLSys 20 Paper
3) Training Deep Nets with Sublinear Memory Cost ArXiv 16 Paper
4) Superneurons: dynamic GPU memory management for training deep neural networks ACM 18 Paper
17.2.2 CPU-Offloading
1) Training Large Neural Networks with Constant Memory using a New Execution Algorithm ArXiv 20 Paper
2) vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design IEEE 16 Paper
17.2.3 高效内存优化器
1) Adafactor: Adaptive Learning Rates with Sublinear Memory Cost PMLR 18 Paper
2) Memory-Efficient Adaptive Optimization for Large-Scale Learning Paper

17.3 新架构

1) ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment ArXiv 24 Github

18 高效推理

18.1 减少Sampling Steps

18.1.1 连续Steps
1) Generative Modeling by Estimating Gradients of the Data Distribution NeurIPS 19 Paper
2) WaveGrad: Estimating Gradients for Waveform Generation ArXiv 20
3) Noise Level Limited Sub-Modeling for Diffusion Probabilistic Vocoders ICASSP 21 Paper
4) Noise Estimation for Generative Diffusion Models ArXiv 21
18.1.2 快速Sampling
1) Denoising Diffusion Implicit Models ICLR 21 Paper
2) DiffWave: A Versatile Diffusion Model for Audio Synthesis ICLR 21 Paper
3) On Fast Sampling of Diffusion Probabilistic Models ArXiv 21
4) DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps NeurIPS 22 Paper
5) DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models ArXiv 22
6) Fast Sampling of Diffusion Models with Exponential Integrator ICLR 22 Paper
18.1.3 Step蒸馏
1) On Distillation of Guided Diffusion Models CVPR 23 Paper
2) Progressive Distillation for Fast Sampling of Diffusion Models ICLR 22 Paper
3) SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds NeurIPS 23 Paper
4) Tackling the Generative Learning Trilemma with Denoising Diffusion GANs ICLR 22 Paper

18.2 优化推理过程

18.2.1 低比特量化
1) Q-Diffusion: Quantizing Diffusion Models CVPR 23 Paper
2) Q-DM: An Efficient Low-bit Quantized Diffusion Model NeurIPS 23 Paper
3) Temporal Dynamic Quantization for Diffusion Models NeurIPS 23 Paper
18.2.2 并行/稀疏推理
1) DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models CVPR 24 Paper
2) Efficient Spatially Sparse Inference for Conditional GANs and Diffusion Models NeurIPS 22 Paper
3) PipeFusion: Displaced Patch Pipeline Parallelism for Inference of Diffusion Transformer Models ArXiv 24

引用

如果本项目对您的工作有所帮助,请使用以下格式引用:

@misc{minisora,
    title={MiniSora},
    author={MiniSora Community},
    url={https://github.com/mini-sora/minisora},
    year={2024}
}
@misc{minisorasurvey,
    title={Diffusion Model-based Video Generation Models From DDPM to Sora: A Survey},
    author={Survey Paper Group of MiniSora Community},
    url={https://github.com/mini-sora/minisora},
    year={2024}
}

Mini Sora 微信社区社区交流群

 

Star History

Star History Chart

如何向Mini Sora 社区贡献

我们非常希望你们能够为 Mini Sora 开源社区做出贡献,并且帮助我们把它做得比现在更好!

具体查看贡献指南

社区贡献者