Welcome to the GitHub repository for Awesome-RL-for-Video-Generation! This repository serves as a curated collection of research, resources, and tools related to Reinforcement Learning (RL) for Video Generation. Our goal is to provide an up-to-date and comprehensive overview of RL techniques used in video generation, focusing on the latest advancements. We aim to bridge the gap between RL theory and real-world applications in video generation tasks, offering a solid foundation for future research and development in this field. We hope this repository will serve as a valuable resource for anyone interested in exploring RL applications in video generation!
- [February 14, 2025] We have developed an agent that automatically collects and analyzes the latest papers in the RL-based Video Generation field. It will update the Related Papers daily at 1:00 AM UTC+8.
We are committed to offering researchers the latest advancements in the field. By regularly reviewing and evaluating recent research studies, we ensure that the list of papers stays up-to-date.
| Date | Paper | Contribution | Available Link |
| Feb 2026 | Unified Personalized Reward Model for Vision Generation |
|
|
|
• Affiliation: Fudan University • Method Name: UnifiedReward-Flex, Base Model: Wan2.1-T2V-14B, Strategy: GRPO |
|||
| Feb 2026 | FSVideo: Fast Speed Video Diffusion Model in a Highly-Compressed Latent Space |
|
|
|
• Affiliation: ByteDance • Method Name: FSVideo, Base Model: Wan2.1-14B-I2V, Strategy: GRPO • Method Name: FSVideo, Base Model: Wan2.1-14B-I2V, Strategy: ReFL |
|||
| Feb 2026 | PISCES: Annotation-free Text-to-Video Post-Training via Optimal Transport-Aligned Rewards |
|
|
|
• Affiliation: Microsoft • Method Name: PISCES, Base Model: HunyuanVideo, Strategy: GRPO • Method Name: PISCES, Base Model: VideoCrafter2, Strategy: GRPO |
|||
| Feb 2026 | PISCES: Annotation-free Text-to-Video Post-Training via Optimal Transport-Aligned Rewards |
|
|
|
• Affiliation: Microsoft • Method Name: PISCES, Base Model: HunyuanVideo, Strategy: GRPO • Method Name: PISCES, Base Model: VideoCrafter2, Strategy: GRPO |
|||
| Jan 2026 | SketchDynamics: Exploring Free-Form Sketches for Dynamic Intent Expression in Animation Generation |
|
|
|
• Affiliation: Zhejiang University • Method Name: RL-Video-Gen, Base Model: Qwen2-VL-7B, Strategy: GRPO • Benchmark Name: VideoGenBench, Data Number: 5000, Evaluation Metric: FID |
|||
| Jan 2026 | The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation |
|
|
|
• Affiliation: Tencent Hunyuan Multimodal Department • Method Name: ScripterAgent, Base Model: Qwen-Omni-7B, Strategy: GRPO • Benchmark Name: ScriptBench, Data Number: 1750, Evaluation Metric: Visual-Script Alignment (VSA) |
|||
| Jan 2026 | SkyReels-V3 Technique Report |
|
|
|
• Affiliation: Zhejiang University • Method Name: RL-Video-Gen, Base Model: Qwen2-VL-7B, Strategy: GRPO • Benchmark Name: VideoGenBench, Data Number: 5000, Evaluation Metric: FID |
|||
| Jan 2026 | A Mechanistic View on Video Generation as World Models: State and Dynamics |
|
|
|
• Affiliation: Hong Kong University of Science and Technology (Guangzhou) • Paper Number: 188 |
|||
| Jan 2026 | From Generative Engines to Actionable Simulators: The Imperative of Physical Grounding in World Models |
|
|
|
• Affiliation: University of Oxford • Paper Number: 49 |
|||
| Jan 2026 | MVGD-Net: A Novel Motion-aware Video Glass Surface Detection Network |
|
|
|
• Affiliation: Zhejiang University • Method Name: RL-Video-Gen, Base Model: Qwen2-VL-7B, Strategy: GRPO • Benchmark Name: VideoGenBench, Data Number: 5000, Evaluation Metric: FID |
|||
| Jan 2026 | CroBIM-V: Memory-Quality Controlled Remote Sensing Referring Video Object Segmentation |
|
|
|
• Affiliation: Zhejiang University • Method Name: RL-Video-Gen, Base Model: Qwen2-VL-7B, Strategy: GRPO • Benchmark Name: VideoGenBench, Data Number: 5000, Evaluation Metric: FID |
|||
| Jan 2026 | PhysRVG: Physics-Aware Unified Reinforcement Learning for Video Generative Models |
|
|
|
• Affiliation: Zhejiang University • Method Name: PhysRVG, Base Model: Wan2.2 5B, Strategy: GRPO • Benchmark Name: PhysRVGBench, Data Number: 700, Evaluation Metric: Intersection over Union (IoU), Trajectory Offset (TO) |
|||
| Jan 2026 | TAGRPO: Boosting GRPO on Image-to-Video Generation with Direct Trajectory Alignment |
|
|
|
• Affiliation: The University of Hong Kong • Method Name: TAGRPO, Base Model: Wan 2.2, Strategy: GRPO • Method Name: TAGRPO, Base Model: HunyuanVideo-1.5, Strategy: GRPO • Benchmark Name: TAGRPO-Bench, Data Number: 200, Evaluation Metric: Q-Save (Visual Quality, Dynamic Quality, Image Alignment), HPSv3 |
|||
| Jan 2026 | TAGRPO: Boosting GRPO on Image-to-Video Generation with Direct Trajectory Alignment |
|
|
|
• Affiliation: The University of Hong Kong • Method Name: TAGRPO, Base Model: Wan 2.2, Strategy: GRPO • Method Name: TAGRPO, Base Model: HunyuanVideo-1.5, Strategy: GRPO • Benchmark Name: TAGRPO-Bench, Data Number: 200, Evaluation Metric: Q-Save • Benchmark Name: TAGRPO-Bench, Data Number: 200, Evaluation Metric: HPSv3 |
|||
| Jan 2026 | Diffusion-DRF: Differentiable Reward Flow for Video Diffusion Fine-Tuning |
|
|
|
• Affiliation: Northeastern University • Method Name: Diffusion-DRF, Base Model: Wan2.1-1.3B-T2V, Strategy: Differentiable Reward Fine-tuning |
|||
| Jan 2026 | Diffusion-DRF: Differentiable Reward Flow for Video Diffusion Fine-Tuning |
|
|
|
• Affiliation: Northeastern University • Method Name: Diffusion-DRF, Base Model: Wan2.1-1.3B-T2V, Strategy: Differentiable Reward Fine-tuning |
|||
| Jan 2026 | Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models |
|
|
|
• Affiliation: Harbin Institute of Technology • Method Name: LocalDPO, Base Model: Wan2.1-1.3B, Strategy: DPO • Method Name: LocalDPO, Base Model: CogVideoX-2B, Strategy: DPO • Method Name: LocalDPO, Base Model: CogVideoX-5B, Strategy: DPO |
|||
| Jan 2026 | Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models |
|
|
|
• Affiliation: Harbin Institute of Technology • Method Name: LocalDPO, Base Model: Wan2.1-1.3B, Strategy: DPO • Method Name: LocalDPO, Base Model: CogVideoX-2B, Strategy: DPO • Method Name: LocalDPO, Base Model: CogVideoX-5B, Strategy: DPO |
|||
| Jan 2026 | Thinking with Frames: Generative Video Distortion Evaluation via Frame Reward Model |
|
|
|
• Affiliation: University of Science and Technology of China • Method Name: REACT, Base Model: Qwen2.5-VL-7B, Strategy: GRPO • Benchmark Name: REACT-Bench, Data Number: 2600, Evaluation Metric: F1-score |
|||
| Jan 2026 | Thinking with Frames: Generative Video Distortion Evaluation via Frame Reward Model |
|
|
|
• Affiliation: University of Science and Technology of China • Method Name: REACT, Base Model: Qwen2.5-VL-7B, Strategy: GRPO • Benchmark Name: REACT-Bench, Data Number: 2600, Evaluation Metric: Accuracy, F1-score, Precision, Recall |
|||
| Jan 2026 | A Versatile Multimodal Agent for Multimedia Content Generation |
|
|
|
• Affiliation: University of Rochester • Method Name: MultiMedia-Agent, Base Model: MiniCPM-V2, Strategy: DPO • Benchmark Name: 18 real world task types, Data Number: 1260, Evaluation Metric: Dover Score, Pick Score, Human Alignment, Aesthetic Score, Psychological Appealing, Audio Video Alignment |
|||
| Jan 2026 | A Versatile Multimodal Agent for Multimedia Content Generation |
|
|
|
• Affiliation: University of Rochester • Method Name: MultiMedia-Agent, Base Model: Minicpm-v2, Strategy: DPO |
|||
| Dec 2025 | PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation |
|
|
|
• Affiliation: Meta Superintelligence Labs • Method Name: PhyGDPO, Base Model: Wan2.1-T2V-14B, Strategy: DPO • Benchmark Name: PhyVidGen-135K, Data Number: 135K, Evaluation Metric: |
|||
| Dec 2025 | SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models |
|
|
|
• Affiliation: Huazhong University of Science and Technology • Method Name: SoliReward, Base Model: HunyuanVideo, Strategy: GRPO • Method Name: SoliReward, Base Model: HunyuanVideo, Strategy: DPO • Benchmark Name: subject deformity and physical plausibility benchmark, Data Number: 50000, Evaluation Metric: RM Accuracy |
|||
| Dec 2025 | DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation |
|
|
|
• Affiliation: ByteDance • Method Name: DreaMontage, Base Model: Seedance 1.0, Strategy: DPO |
|||
| Dec 2025 | VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization |
|
|
|
• Affiliation: Brown University • Method Name: VIVA, Base Model: HunyuanVideo-T2V-13B, Strategy: GRPO |
|||
| Dec 2025 | Kling-Omni Technical Report |
|
|
|
• Affiliation: Kuaishou Technology • Method Name: Kling-Omni, Base Model: , Strategy: DPO |
|||
| Dec 2025 | What Happens Next? Next Scene Prediction with a Unified Video Model |
|
|
|
• Affiliation: Pennsylvania State University • Method Name: unified video model, Base Model: Qwen-VL, LTX, Strategy: GRPO • Benchmark Name: NSP dataset, Data Number: 0.97M samples for SFT, 8K samples for RL, 1K samples for test, Evaluation Metric: causal consistency |
|||
| Dec 2025 | OmniPerson: Unified Identity-Preserving Pedestrian Generation |
|
|
|
• Affiliation: Zhejiang University • Method Name: RL-Video-Gen, Base Model: Qwen2-VL-7B, Strategy: GRPO • Benchmark Name: VideoGenBench, Data Number: 5000, Evaluation Metric: FID |
|||
| Dec 2025 | PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models |
|
|
|
• Affiliation: Sun Yat-sen University • Method Name: Physical-Aware DPO, Base Model: WanX2.1 1.3B, Strategy: DPO • Benchmark Name: PID (Physical Implausibility Detection) dataset, Data Number: 3088, Evaluation Metric: F1 Score |
|||
| Nov 2025 | McSc: Motion-Corrective Preference Alignment for Video Generation with Self-Critic Hierarchical Reasoning |
|
|
|
• Affiliation: Tongyi Lab, Alibaba Group • Method Name: McSc, Base Model: Qwen2-VL-7B-Instruct, Strategy: GRPO • Method Name: McDPO, Base Model: Wan2.1-T2V-1.3B, Strategy: DPO |
|||
| Nov 2025 | Diverse Video Generation with Determinantal Point Process-Guided Policy Optimization |
|
|
|
• Affiliation: Virginia Tech • Method Name: DPP-GRPO, Base Model: Qwen2-7b-Instruct, Strategy: GRPO • Benchmark Name: diverse video-prompt dataset, Data Number: 30,000, Evaluation Metric: TIE, TCE, CLIP |
|||
| Nov 2025 | Growing with the Generator: Self-paced GRPO for Video Generation |
|
|
|
• Affiliation: University of Science and Technology of China • Method Name: Self-Paced GRPO, Base Model: Wan2.1-T2V, Strategy: GRPO • Method Name: Self-Paced GRPO, Base Model: HunyuanVideo, Strategy: GRPO |
|||
| Nov 2025 | PhysCorr: Dual-Reward DPO for Physics-Constrained Text-to-Video Generation with Automated Preference Selection |
|
|
|
• Affiliation: Beijing Institute of Technology • Method Name: PhysCorr, Base Model: , Strategy: DPO • Method Name: PhyDPO, Base Model: , Strategy: DPO |
|||
| Nov 2025 | PhysCorr: Dual-Reward DPO for Physics-Constrained Text-to-Video Generation with Automated Preference Selection |
|
|
|
• Affiliation: Beijing Institute of Technology • Method Name: PhysCorr, Base Model: , Strategy: DPO • Method Name: PhysicsRM, Base Model: LLaVA-Video-Qwen2-7B, Strategy: supervised learning with Huber loss • Method Name: PhyDPO, Base Model: , Strategy: reweighted DPO |
|||
| Nov 2025 | Reg-DPO: SFT-Regularized Direct Preference Optimization with GT-Pair for Improving Video Generation |
|
|
|
• Affiliation: ByteDance • Method Name: Reg-DPO, Base Model: Wan2.1-I2V-14B-720P, Strategy: DPO |
|||
| Nov 2025 | CueBench: Advancing Unified Understanding of Context-Aware Video Anomalies in Real-World |
|
|
|
• Affiliation: Northwestern Polytechnical University, Xi’an Shaanxi, 710129, China • Method Name: CUE-R1, Base Model: Qwen2.5-VL-3B, Strategy: GRPO • Benchmark Name: CUEBENCH, Data Number: 2950, Evaluation Metric: hierarchy score |
|||
| Nov 2025 | ID-Composer: Multi-Subject Video Synthesis with Hierarchical Identity Preservation |
|
|
|
• Affiliation: Peking University • Method Name: ID-COMPOSER, Base Model: Wan-Video-1.3B, Strategy: Flow-GRPO • Benchmark Name: OpenS2V-Nexus, Data Number: 218230, Evaluation Metric: NexusScore |
|||
| Nov 2025 | ID-Composer: Multi-Subject Video Synthesis with Hierarchical Identity Preservation |
|
|
|
• Affiliation: Peking University • Method Name: ID-COMPOSER, Base Model: Wan-Video-1.3B, Strategy: Flow-GRPO • Benchmark Name: OpenS2V-Nexus, Data Number: 218230, Evaluation Metric: NexusScore |
|||
| Nov 2025 | World Simulation with Video Foundation Models for Physical AI |
|
|
|
• Affiliation: NVIDIA • Method Name: Cosmos-Predict2.5, Base Model: Cosmos-Reason1, Strategy: GRPO |
|||
| Nov 2025 | World Simulation with Video Foundation Models for Physical AI |
|
|
|
• Affiliation: NVIDIA • Method Name: Cosmos-Predict2.5, Base Model: Cosmos-Reason1, Strategy: GRPO |
|||
| Oct 2025 | Emu3.5: Native Multimodal Models are World Learners |
|
|
|
• Affiliation: BAAI • Method Name: Discrete Diffusion Adaptation, Base Model: Qwen3, Strategy: GRPO |
|||
| Oct 2025 | Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences |
|
|
|
• Affiliation: School of Artificial Intelligence, University of Chinese Academy of Sciences • Method Name: Omni-RewardModel-BT, Base Model: MiniCPM-o-2.6, Strategy: Bradley-Terry • Method Name: Omni-RewardModel-R1, Base Model: Qwen2.5-VL-7B-Instruct, Strategy: GRPO-based reinforcement learning • Benchmark Name: Omni-RewardBench, Data Number: 3725, Evaluation Metric: accuracy |
|||
| Oct 2025 | Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences |
|
|
|
• Affiliation: School of Artificial Intelligence, University of Chinese Academy of Sciences • Method Name: Omni-RewardModel-BT, Base Model: MiniCPM-o-2.6, Strategy: Bradley-Terry • Method Name: Omni-RewardModel-R1, Base Model: Qwen2.5-VL-7B-Instruct, Strategy: GRPO • Benchmark Name: Omni-RewardBench, Data Number: 3725, Evaluation Metric: accuracy |
|||
| Oct 2025 | LongCat-Video Technical Report |
|
|
|
• Affiliation: Meituan • Method Name: LongCat-Video, Base Model: , Strategy: GRPO |
|||
| Oct 2025 | LongCat-Video Technical Report |
|
|
|
• Affiliation: Meituan • Method Name: LongCat-Video, Base Model: WAN2.1 VAE, Strategy: GRPO |
|||
| Oct 2025 | Epipolar Geometry Improves Video Generation Models |
|
|
|
• Affiliation: University of Oxford • Method Name: Epipolar-DPO, Base Model: Wan-2.1, Strategy: DPO • Benchmark Name: large dataset of over 162,000 generated videos annotated with 3D scene consistency metrics, Data Number: 162000, Evaluation Metric: Sampson epipolar error |
|||
| Oct 2025 | RealDPO: Real or Not Real, that is the Preference |
|
|
|
• Affiliation: University of Electronic Science and Technology of China • Method Name: RealDPO, Base Model: CogVideoX-5B, Strategy: DPO • Benchmark Name: RealAction-5K, Data Number: 5000, Evaluation Metric: Visual Alignment, Text Alignment, Motion Quality, Human Quality |
|||
| Oct 2025 | RealDPO: Real or Not Real, that is the Preference |
|
|
|
• Affiliation: University of Electronic Science and Technology of China • Method Name: RealDPO, Base Model: CogVideoX-5B, Strategy: DPO • Benchmark Name: RealAction-5K, Data Number: 5000, Evaluation Metric: Visual Alignment, Text Alignment, Motion Quality, Human Quality |
|||
| Oct 2025 | ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints |
|
|
|
• Affiliation: UCAS • Method Name: ImagerySearch, Base Model: Wan2.1, Strategy: adaptive test-time search strategy • Benchmark Name: LDT-Bench, Data Number: 2839, Evaluation Metric: ImageryQA |
|||
| Oct 2025 | Identity-GRPO: Optimizing Multi-Human Identity-preserving Video Generation via Reinforcement Learning |
|
|
|
• Affiliation: Alibaba Group • Method Name: Identity-GRPO, Base Model: Qwen2.5-VL-3B, Strategy: GRPO • Benchmark Name: multi-human identity-preserving preference benchmark, Data Number: 500, Evaluation Metric: Accuracy |
|||
| Oct 2025 | Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization |
|
|
|
• Affiliation: Taobao & Tmall Group of Alibaba • Method Name: IPRO, Base Model: Wan 2.2 I2V, Strategy: reward-guided optimization with KL-divergence regularization and facial scoring mechanism |
|||
| Oct 2025 | PhysMaster: Mastering Physical Representation for Video Generation via Reinforcement Learning |
|
|
|
• Affiliation: The University of Hong Kong • Method Name: PhysMaster, Base Model: , Strategy: DPO |
|||
| Oct 2025 | VIST3A: Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator |
|
|
|
• Affiliation: ETH Zurich • Method Name: VIST3A, Base Model: , Strategy: direct reward finetuning |
|||
| Oct 2025 | VIST3A: Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator |
|
|
|
• Affiliation: ETH Zurich • Method Name: VIST3A, Base Model: , Strategy: direct reward finetuning |
|||
| Oct 2025 | Playmate2: Training-Free Multi-Character Audio-Driven Animation via Diffusion Transformer with Reward Feedback |
|
|
|
• Affiliation: Guangzhou Quwan Network Technology • Method Name: Mask-CFG, Base Model: Wan2.1, Strategy: DPO |
|||
| Oct 2025 | VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning |
|
|
|
• Affiliation: CUHK MMLab • Method Name: VR-Thinker, Base Model: Qwen2.5-VL-7B, Strategy: GRPO |
|||
| Oct 2025 | AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration |
|
|
|
• Affiliation: Kling Team, Kuaishou Technology • Method Name: AVoCaDO GRPO, Base Model: Qwen2.5-Omni-7B, Strategy: GRPO |
|||
| Oct 2025 | iMoWM: Taming Interactive Multi-Modal World Model for Robotic Manipulation |
|
|
|
• Affiliation: Nanyang Technological University, Singapore • Method Name: iMoWM, Base Model: , Strategy: model-based RL with DrQ-v2 |
|||
| Oct 2025 | Real-Time Motion-Controllable Autoregressive Video Diffusion |
|
|
|
• Affiliation: Nanyang Technological University • Method Name: AR-Drag, Base Model: Wan2.1-1.3B, Strategy: GRPO |
|||
| Oct 2025 | Real-Time Motion-Controllable Autoregressive Video Diffusion |
|
|
|
• Affiliation: Nanyang Technological University • Method Name: AR-Drag, Base Model: Wan2.1-1.3B, Strategy: GRPO • Benchmark Name: motion controllability benchmark, Data Number: 206, Evaluation Metric: Motion Consistency |
|||
| Oct 2025 | Presenting a Paper is an Art: Self-Improvement Aesthetic Agents for Academic Presentations |
|
|
|
• Affiliation: University of California, Santa Barbara • Method Name: PresAesth, Base Model: Qwen-2.5-VL-7B, Strategy: GRPO • Benchmark Name: EvoPresent Benchmark, Data Number: 650 papers, 2000 slide pairs, Evaluation Metric: Perplexity, ROUGE-L, Layout Balance, Aesthetic Scores, MAE, F1-score, Accuracy |
|||
| Oct 2025 | OpusAnimation: Code-Based Dynamic Chart Generation |
|
|
|
• Affiliation: Opus AI Research, Brown University • Method Name: Joint-Code-Visual Reward based Group Relative Policy Optimization (JCVR-GRPO), Base Model: Qwen2.5-VL-3B, Strategy: GRPO • Benchmark Name: DCG-Bench, Data Number: 700, Evaluation Metric: Execution Pass Rate, QA-based Scores |
|||
| Oct 2025 | OpusAnimation: Code-Based Dynamic Chart Generation |
|
|
|
• Affiliation: Opus AI Research, Brown University • Method Name: JCVR-GRPO, Base Model: Qwen2.5-VL-3B, Strategy: GRPO • Benchmark Name: DCG-Bench, Data Number: 700, Evaluation Metric: Execution Pass Rate, QA-based Scores |
|||
| Oct 2025 | MultiModal Action Conditioned Video Generation |
|
|
|
• Affiliation: MIT CSAIL • Method Name: MultiModal Action Conditioned Video Generation, Base Model: I2VGen, Strategy: Video diffusion model with multimodal action conditioning and feature regularization |
|||
| Oct 2025 | MultiModal Action Conditioned Video Generation |
|
|
|
• Affiliation: MIT CSAIL • Method Name: MultiModal Action Conditioned Video Generation, Base Model: , Strategy: Latent space projection and regularization with diffusion-based video generation |
|||
| Oct 2025 | Self-Forcing++: Towards Minute-Scale High-Quality Video Generation |
|
|
|
• Affiliation: UCLA • Method Name: Self-Forcing++, Base Model: Wan2.1-T2V-1.3B, Strategy: GRPO (Group Relative Policy Optimization) |
|||
| Oct 2025 | VidGuard-R1: AI-Generated Video Detection and Explanation via Reasoning MLLMs and RL |
|
|
|
• Affiliation: Department of Computer Science, University of Texas at Austin, Austin, TX, USA • Method Name: VidGuard-R1, Base Model: Qwen2.5-VL-7B, Strategy: GRPO • Benchmark Name: VidGuard-R1-CoT-30k, Data Number: 30000, Evaluation Metric: Top-1 accuracy • Benchmark Name: VidGuard-R1-RL-100k, Data Number: 100000, Evaluation Metric: Top-1 accuracy |
|||
| Oct 2025 | InfoMosaic-Bench: Evaluating Multi-Source Information Seeking in Tool-Augmented Agents |
|
|
|
• Affiliation: Shanghai Jiao Tong University • Benchmark Name: InfoMosaic-Bench, Data Number: 621, Evaluation Metric: Accuracy, Pass Rate |
|||
| Oct 2025 | Poolformer: Recurrent Networks with Pooling for Long-Sequence Modeling |
|
|
|
• Affiliation: nan • Method Name: Poolformer, Base Model: , Strategy: Recurrent neural networks with pooling operations for long-sequence modeling |
|||
| Oct 2025 | EvoStruggle: A Dataset Capturing the Evolution of Struggle across Activities and Skill Levels |
|
|
|
• Affiliation: University of Bristol • Benchmark Name: EvoStruggle, Data Number: 2793 videos, 5385 annotated temporal struggle segments, Evaluation Metric: mAP at different IoU thresholds (0.3, 0.5, 0.7) |
|||
| Oct 2025 | LVTINO: LAtent Video consisTency INverse sOlver for High Definition Video Restoration |
|
|
|
• Affiliation: Laboratoire MAP5, UMR 8145, Université Paris Cité, CNRS • Method Name: LATINO, Base Model: , Strategy: Bayesian Langevin posterior sampling with Video Consistency Models (VCMs) and Image Consistency Models (ICMs) |
|||
| Oct 2025 | LVTINO: LAtent Video consisTency INverse sOlver for High Definition Video Restoration |
|
|
|
• Affiliation: Laboratoire MAP5, UMR 8145, Université Paris Cité, CNRS • Method Name: LATINO, Base Model: , Strategy: Langevin posterior sampling with stochastic auto-encoder steps |
|||
| Oct 2025 | EvoWorld: Evolving Panoramic World Generation with Explicit 3D Memory |
|
|
|
• Affiliation: Johns Hopkins University • Benchmark Name: Spatial360, Data Number: 58000+, Evaluation Metric: FVD, LMSE, LPIPS, PSNR, SSIM, MEt3R, AUC@30 |
|||
| Sep 2025 | Stable Cinemetrics : Structured Taxonomy and Evaluation for Professional Video Generation |
|
|
|
• Affiliation: Stability AI • Benchmark Name: StableCinemetrics, Data Number: 20K videos, Evaluation Metric: human evaluation (1-5 scale) |
|||
| Sep 2025 | How Far Do Time Series Foundation Models Paint the Landscape of Real-World Benchmarks ? |
|
|
|
• Affiliation: University of Luxembourg • Benchmark Name: REAL-V-TSFM, Data Number: 6130, Evaluation Metric: MAPE, sMAPE, Agg. Relative WQL, Agg. Relative MASE |
|||
| Sep 2025 | V-HUB: A Visual-Centric Humor Understanding Benchmark for Video LLMs |
|
|
|
• Affiliation: Shanghai Jiao Tong University • Benchmark Name: v-HUB, Data Number: 960, Evaluation Metric: BERTScore, SentBERT, METEOR |
|||
| Sep 2025 | Visual Jigsaw Post-Training Improves MLLMs |
|
|
|
• Affiliation: S-Lab, Nanyang Technological University • Method Name: Visual Jigsaw, Base Model: Qwen2.5-VL-7B-Instruct, Strategy: GRPO |
|||
| Sep 2025 | FlashI2V: Fourier-Guided Latent Shifting Prevents Conditional Image Leakage in Image-to-Video Generation |
|
|
|
• Affiliation: Peking University, Shenzhen Graduate School • Method Name: FlashI2V, Base Model: , Strategy: Flow Matching (FM) with Fourier-Guided Latent Shifting |
|||
| Sep 2025 | World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training |
|
|
|
• Affiliation: School of Computer Science and Engineering, Sun Yat-sen University, China • Method Name: World-Env, Base Model: OpenVLA-OFT, Strategy: PPO |
|||
| Sep 2025 | Fidelity-Aware Data Composition for Robust Robot Generalization |
|
|
|
• Affiliation: UCAS-Terminus AI Lab, University of Chinese Academy of Sciences • Method Name: Coherent Information Fidelity Tuning (CIFT), Base Model: Cosmos-Predict2-2B-Video2World, Strategy: Feature-Space Signal-to-Noise Ratio optimization for data composition • Method Name: Multi-View Video Augmentation (MV Aug), Base Model: Cosmos-Predict2-2B-Video2World, Strategy: Latent diffusion transformer with periodic cross-view attention for video-to-video synthesis |
|||
| Sep 2025 | IWR-Bench: Can LVLMs reconstruct interactive webpage from a user interaction video? |
|
|
|
• Affiliation: Shanghai AI Lab, Zhejiang University • Benchmark Name: IWR-Bench, Data Number: 113, Evaluation Metric: Interactive Functionality Score (IFS) and Visual Fidelity Score (VFS) |
|||
| Sep 2025 | Can you SPLICE it together? A Human Curated Benchmark for Probing Visual Reasoning in VLMs |
|
|
|
• Affiliation: Institute of Cognitive Science, Osnabrück University, Osnabrück, Germany • Benchmark Name: SPLICE, Data Number: 3381, Evaluation Metric: Binary Accuracy, Hamming Accuracy, Longest Common Subsequence, Edit Distance |
|||
| Sep 2025 | PoseDiff: A Unified Diffusion Model Bridging Robot Pose Estimation and Video-to-Action Control |
|
|
|
• Affiliation: The University of Manchester • Method Name: PoseDiff, Base Model: , Strategy: DDPM (Denoising Diffusion Probabilistic Model) |
|||
| Sep 2025 | NeMo: Needle in a Montage for Video-Language Understanding |
|
|
|
• Affiliation: The Chinese University of Hong Kong • Benchmark Name: NeMoBench, Data Number: 31,378, Evaluation Metric: Recall@1x, tIoU=0.7, Recall@1x, tIoU=0.5, Average mAP |
|||
| Sep 2025 | Training Agents Inside of Scalable World Models |
|
|
|
• Affiliation: Google DeepMind • Method Name: Dreamer 4, Base Model: , Strategy: PMPO (Preference optimization as probabilistic inference) with task-conditioned policy and reward modeling |
|||
| Sep 2025 | Rethinking JEPA: Compute-Efficient Video SSL with Frozen Teachers |
|
|
|
• Affiliation: Apple • Method Name: SALT (Static-teacher Asymmetric Latent Training), Base Model: , Strategy: Two-stage self-supervised learning with frozen teacher for video representation learning |
|||
| Sep 2025 | Reinforcement Learning with Inverse Rewards for World Model Post-training |
|
|
|
• Affiliation: Microsoft Research • Method Name: Reinforcement Learning with Inverse Rewards (RLIR), Base Model: , Strategy: Group Relative Policy Optimization (GRPO) |
|||
| Sep 2025 | AssemblyHands-X: Modeling 3D Hand-Body Coordination for Understanding Bimanual Human Activities |
|
|
|
• Affiliation: The University of Tokyo, Tokyo, Japan • Benchmark Name: AssemblyHands-X, Data Number: , Evaluation Metric: |
|||
| Sep 2025 | ReWatch-R1: Boosting Complex Video Reasoning in Large Vision-Language Models through Agentic Data Synthesis |
|
|
|
• Affiliation: Alibaba Group • Method Name: ReWatch-R1, Base Model: Qwen2.5-VL-7B, Strategy: GRPO (Group Relative Policy Optimization) |
|||
| Sep 2025 | WorldSplat: Gaussian-Centric Feed-Forward 4D Scene Generation for Autonomous Driving |
|
|
|
• Affiliation: Nankai University |
|||
| Sep 2025 | VideoScore2: Think before You Score in Generative Video Evaluation |
|
|
|
• Affiliation: University of Illinois Urbana-Champaign • Method Name: VIDEOSCORE2, Base Model: Qwen2.5-VL-7B-Instruct, Strategy: Group Relative Policy Optimization (GRPO) • Benchmark Name: VIDEOSCORE-BENCH-V2, Data Number: 500, Evaluation Metric: Accuracy, Relaxed Accuracy, PLCC |
|||
| Sep 2025 | Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs |
|
|
|
• Affiliation: Princeton University • Benchmark Name: DEEPTRACEREWARD, Data Number: 4334, Evaluation Metric: Accuracy, Explanation score, BBox IoU, BBox Distance, Time Distance |
|||
| Sep 2025 | WoW: Towards a World omniscient World model Through Embodied Interaction |
|
|
|
• Affiliation: Beijing Innovation Center of Humanoid Robotics • Method Name: WoW, Base Model: Cosmos2, Strategy: GRPO • Method Name: SOPHIA, Base Model: , Strategy: Self-optimizing framework with critic-refiner loop • Benchmark Name: WoWBench, Data Number: 606, Evaluation Metric: FVD, SSIM, PSNR, DINO, Dreamsim, Mask-guided Regional Consistency, Instruction Understanding, Physical common sense, Planning and Task Decomposition |
|||
| Sep 2025 | Drag4D: Align Your Motion with Text-Driven 3D Scene Generation |
|
|
|
• Affiliation: KAIST • Method Name: Local-Global DragAnything, Base Model: , Strategy: Motion-conditioned video diffusion with part-augmented trajectory guidance • Benchmark Name: Drag4D-30, Data Number: 30, Evaluation Metric: CLIP-Score, Sharp, Colorful, Quality, PSNR, SSIM |
|||
| Sep 2025 | StableDub: Taming Diffusion Prior for Generalized and Efficient Visual Dubbing |
|
|
|
• Affiliation: nan • Method Name: StableDub, Base Model: , Strategy: Diffusion-based visual dubbing with lip-habit-modulated mechanism and occlusion-aware training strategy |
|||
| Sep 2025 | DiTraj: training-free trajectory control for video diffusion transformer |
|
|
|
• Affiliation: Beijing University of Posts and Telecommunications • Method Name: DiTraj, Base Model: Wan2.1, CogVideoX, Strategy: Foreground-background separation guidance and STD-RoPE position embedding modification |
|||
| Sep 2025 | Can AI Perceive Physical Danger and Intervene? |
|
|
|
• Affiliation: Google DeepMind Robotics • Benchmark Name: ASIMOV-2.0, Data Number: 319, Evaluation Metric: Latent risk accuracy, Latent risk severity accuracy, Action effect accuracy, Activated risk accuracy • Benchmark Name: ASIMOV-2.0-Video, Data Number: 287, Evaluation Metric: Injury risk accuracy, Latent risk and severity accuracy, Last intervention timestamp MAE, Intervention rate • Benchmark Name: ASIMOV-2.0-Constraints, Data Number: 164, Evaluation Metric: Constraint violation rate |
|||
| Sep 2025 | VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM-as-a-Judge for Video Understanding |
|
|
|
• Affiliation: Carnegie Mellon University • Method Name: VideoJudge, Base Model: Qwen2.5-VL, Strategy: Generator-evaluator bootstrapping with iterative refinement and feedback • Benchmark Name: VideoJudgeLLaVA-MetaEval, Data Number: , Evaluation Metric: RMSE, MAE, Spearman, Pearson, ECE, PSup, Delta(C-D) • Benchmark Name: VideoJudgeVCG-MetaEval, Data Number: , Evaluation Metric: RMSE, MAE, Spearman, Pearson, ECE, PSup, Delta(C-D) • Benchmark Name: VideoJudge-Pairwise, Data Number: , Evaluation Metric: Accuracy • Benchmark Name: VideoJudge-Pairwise-H, Data Number: 200, Evaluation Metric: Accuracy |
|||
| Sep 2025 | MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal Reasoning |
|
|
|
• Affiliation: HKUST (GZ) • Method Name: MOSS-ChatV, Base Model: Qwen2.5-7B, Strategy: GRPO • Benchmark Name: MOSS-Video, Data Number: 11654, Evaluation Metric: accuracy |
|||
| Sep 2025 | VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception |
|
|
|
• Affiliation: Zhejiang University • Method Name: VTTS, Base Model: Qwen2.5-VL-7B, Strategy: GRPO • Benchmark Name: VTTS-80K, Data Number: 80000, Evaluation Metric: |
|||
| Sep 2025 | KeyWorld: Key Frame Reasoning Enables Effective and Efficient World Models |
|
|
|
• Affiliation: Department of Electronic Engineering, BNRist, Tsinghua University • Method Name: KeyWorld, Base Model: CogVideoX1.5-5B-I2V, Strategy: Diffusion Transformer fine-tuning with motion-aware key frame generation and interpolation |
|||
| Sep 2025 | LLM Trainer: Automated Robotic Data Generating via Demonstration Augmentation using LLMs |
|
|
|
• Affiliation: Carnegie Mellon University • Method Name: LLM Trainer, Base Model: , Strategy: Thompson Sampling for multi-armed bandit optimization of demonstration annotations |
|||
| Sep 2025 | SynchroRaMa : Lip-Synchronized and Emotion-Aware Talking Face Generation via Multi-Modal Emotion Embedding |
|
|
|
• Affiliation: IIT Ropar, India • Method Name: SynchroRaMa, Base Model: Stable Diffusion 1.5, Strategy: Diffusion-based generation with multi-modal emotion embedding and audio-to-motion alignment |
|||
| Sep 2025 | When Words Can't Capture It All: Towards Video-Based User Complaint Text Generation with Multimodal Video Complaint Dataset |
|
|
|
• Affiliation: Indian Institute of Technology Patna • Benchmark Name: ComVID, Data Number: 1175, Evaluation Metric: CR score, BLEU, ROUGE, BERTScore, MoverScore, METEOR, Perplexity, Flesch Reading Ease, Coleman-Liau Index |
|||
| Sep 2025 | Talking Head Generation via AU-Guided Landmark Prediction |
|
|
|
• Affiliation: Stony Brook University • Method Name: Variational Motion Generator (VMG), Base Model: , Strategy: Conditional Variational Autoencoder with flow-based prior and dilated convolutional architecture |
|||
| Sep 2025 | From Prompt to Progression: Taming Video Diffusion Models for Seamless Attribute Transition |
|
|
|
• Affiliation: National Yang Ming Chiao Tung University • Benchmark Name: Controlled-Attribute-Transition Benchmark (CAT-Bench), Data Number: 120, Evaluation Metric: Wholistic Transition Score, Frame-wise Transition Score |
|||
| Sep 2025 | EgoBridge: Domain Adaptation for Generalizable Imitation from Egocentric Human Data |
|
|
|
• Affiliation: Georgia Institute of Technology • Method Name: EgoBridge, Base Model: , Strategy: Optimal Transport (OT) with Dynamic Time Warping (DTW) cost function for domain adaptation between human and robot data |
|||
| Sep 2025 | VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction |
|
|
|
• Affiliation: Waseda University • Benchmark Name: VIR-Bench, Data Number: 200, Evaluation Metric: F1 score |
|||
| Sep 2025 | VLN-Zero: Rapid Exploration and Cache-Enabled Neurosymbolic Vision-Language Planning for Zero-Shot Transfer in Robot Navigation |
|
|
|
• Affiliation: University of Texas at Austin • Method Name: VLN-Zero, Base Model: , Strategy: vision-language model guided exploration with neurosymbolic navigation, hierarchical caching, and constraint-satisfying action generation |
|||
| Sep 2025 | ComposableNav: Instruction-Following Navigation in Dynamic Environments via Composable Diffusion |
|
|
|
• Affiliation: Department of Computer Science, The University of Texas at Austin • Method Name: ComposableNav, Base Model: , Strategy: Denoising Diffusion Policy Optimization (DDPO) and PPO |
|||
| Sep 2025 |
|
|
|
|
• Affiliation: Santa Clara University • Benchmark Name: M3VIR, Data Number: 43200, Evaluation Metric: PSNR, SSIM, LPIPS, FID, DISTS |
|||
| Sep 2025 | Video-to-BT: Generating Reactive Behavior Trees from Human Demonstration Videos for Robotic Assembly |
|
|
|
• Affiliation: Munich Institute of Robotics and Machine Intelligence (MIRMI), Technical University of Munich, Germany • Method Name: Video-to-BT, Base Model: , Strategy: Behavior Tree-based execution with recovery mechanism |
|||
| Sep 2025 | Captioning for Text-Video Retrieval via Dual-Group Direct Preference Optimization |
|
|
|
• Affiliation: Korea University • Method Name: CaRe-DPO, Base Model: VideoChat-Flash-7B, Strategy: DG-DPO (Dual-Group Direct Preference Optimization) |
|||
| Sep 2025 | RLGF: Reinforcement Learning with Geometric Feedback for Autonomous Driving Video Generation |
|
|
|
• Affiliation: SKL-IOTSC, Computer and Information Science, University of Macau • Method Name: RLGF, Base Model: , Strategy: Reinforcement Learning with Geometric Feedback (specifically using LoRA-based optimization with latent-space windowing and hierarchical geometric rewards) |
|||
| Sep 2025 | Beyond Video-to-SFX: Video to Audio Synthesis with Environmentally Aware Speech |
|
|
|
• Affiliation: Australian National University |
|||
| Sep 2025 | PhysicalAgent: Towards General Cognitive Robotics with Foundation World Models |
|
|
|
• Affiliation: Intelligent Robotics Laboratory, Skolkovo Institute of Science and Technology (Skoltech), Bolshoy Boulevard 30, bld. 1, Moscow 121205, Russia |
|||
| Sep 2025 | RewardDance: Reward Scaling in Visual Generation |
|
|
|
• Affiliation: ByteDance Seed • Method Name: RewardDance, Base Model: InternVL, Strategy: ReFL |
|||
| Sep 2025 | GeneVA: A Dataset of Human Annotations for Generative Text to Video Artifacts |
|
|
|
• Affiliation: New York University • Benchmark Name: GeneVA, Data Number: 16356, Evaluation Metric: Average Precision (AP) scores at various IoU thresholds |
|||
| Sep 2025 | BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models |
|
|
|
• Affiliation: Peking University • Method Name: BranchGRPO, Base Model: Wan2.1-1.3B, Strategy: GRPO |
|||
| Sep 2025 | Coefficients-Preserving Sampling for Reinforcement Learning with Flow Matching |
|
|
|
• Affiliation: CreateAI (https://www.iamcreate.ai/) • Method Name: Coefficients-Preserving Sampling (CPS), Base Model: SD3.5-M, FLUX.1-schnell, FLUX.1-dev, Strategy: GRPO |
|||
| Sep 2025 | ManipDreamer3D : Synthesizing Plausible Robotic Manipulation Video with Occupancy-aware 3D Trajectory |
|
|
|
• Affiliation: State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University • Method Name: ManipDreamer3D, Base Model: , Strategy: |
|||
| Sep 2025 | PixFoundation 2.0: Do Video Multi-Modal LLMs Use Motion in Visual Grounding? |
|
|
|
• Affiliation: nan • Benchmark Name: MoCentric-Bench, Data Number: , Evaluation Metric: J (Region similarity), F (Contour accuracy), J&F (Average) |
|||
| Sep 2025 | FantasyHSI: Video-Generation-Centric 4D Human Synthesis In Any Scene through A Graph-based Multi-Agent Framework |
|
|
|
• Affiliation: AMAP, Alibaba Group; Tsinghua University • Method Name: FantasyHSI, Base Model: Wan2.1-I2V-14B, Strategy: DPO • Benchmark Name: SceneBench, Data Number: 120, Evaluation Metric: Penetration Obstacle Score (POS), Reaction Divergence Score (RDS) |
|||
| Sep 2025 | InterPose: Learning to Generate Human-Object Interactions from Large-Scale Web Videos |
|
|
|
• Affiliation: Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) • Benchmark Name: InterPose, Data Number: 73,814, Evaluation Metric: |
|||
| Aug 2025 | EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control |
|
|
|
• Affiliation: Shanghai AI Laboratory • Method Name: EO-1, Base Model: Qwen2.5-VL, Strategy: flow matching denoising with auto-regressive decoding • Benchmark Name: EO-Bench, Data Number: 648, Evaluation Metric: completion score, accuracy |
|||
| Aug 2025 | Dress&Dance: Dress up and Dance as You Like It - Technical Preview |
|
|
|
• Affiliation: University of Illinois Urbana-Champaign • Method Name: Dress&Dance, Base Model: , Strategy: Diffusion-based video generation with CondNet conditioning network, multi-stage progressive training, and curriculum learning • Benchmark Name: Internet video dataset, Data Number: 80000, Evaluation Metric: PSNR, SSIM, LPIPS VGG, LPIPS AlexNet • Benchmark Name: Captured video dataset, Data Number: 18300, Evaluation Metric: PSNR, SSIM, LPIPS VGG, LPIPS AlexNet |
|||
| Aug 2025 | Droplet3D: Commonsense Priors from Videos Facilitate 3D Generation |
|
|
|
• Affiliation: IEIT System Co., Ltd. • Method Name: GRPO, Base Model: , Strategy: Group Relative Policy Optimization (GRPO) • Benchmark Name: Droplet3D-4M, Data Number: 4 million, Evaluation Metric: PSNR, SSIM, LPIPS, MSE, CLIP-S |
|||
| Aug 2025 | InfinityHuman: Towards Long-Term Audio-Driven Human |
|
|
|
• Affiliation: ByteDance • Method Name: InfinityHuman, Base Model: , Strategy: reward feedback learning |
|||
| Aug 2025 | Context-Aware Zero-Shot Anomaly Detection in Surveillance Using Contrastive and Predictive Spatiotemporal Modeling |
|
|
|
• Affiliation: Department of Computer Science and Engineering, BRAC University, Dhaka, Bangladesh • Method Name: Context-Aware Zero-Shot Anomaly Detection, Base Model: , Strategy: Contrastive and Predictive Spatiotemporal Modeling with InfoNCE and CPC losses |
|||
| Aug 2025 | Text-Driven 3D Hand Motion Generation from Sign Language Data |
|
|
|
• Affiliation: LIGM, École des Ponts, IP Paris, Univ Gustave Eiffel, CNRS • Method Name: HandMDM, Base Model: , Strategy: Diffusion models (not RL-based) • Benchmark Name: BOBSL3DT, Data Number: 1312339, Evaluation Metric: R@1, R@3, FID |
|||
| Aug 2025 | Multi-Object Sketch Animation with Grouping and Motion Trajectory Priors |
|
|
|
• Affiliation: Beihang University • Method Name: GroupSketch, Base Model: , Strategy: Score Distillation Sampling (SDS) |
|||
| Aug 2025 | TPA: Temporal Prompt Alignment for Fetal Congenital Heart Defect Classification |
|
|
|
• Affiliation: Department of Machine Learning, Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) • Method Name: Temporal Prompt Alignment (TPA), Base Model: , Strategy: Contrastive Learning with Margin-Hinge Loss • Method Name: Conditional Variational Autoencoder Style Modulation (CVAESM), Base Model: , Strategy: KL Divergence Regularization |
|||
| Aug 2025 | Beyond Simple Edits: Composed Video Retrieval with Dense Modifications |
|
|
|
• Affiliation: Mohamed bin Zayed University of AI • Benchmark Name: Dense-WebVid-CoVR, Data Number: 1.6 million, Evaluation Metric: Recall@K |
|||
| Aug 2025 | PhysGM: Large Physical Gaussian Model for Feed-Forward 4D Synthesis |
|
|
|
• Affiliation: Beijing Institute of Technology • Method Name: PhysGM, Base Model: , Strategy: DPO • Benchmark Name: PhysAssets Dataset, Data Number: 24000+, Evaluation Metric: |
|||
| Aug 2025 | MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents |
|
|
|
• Affiliation: ByteDance • Benchmark Name: MM-BrowseComp, Data Number: 224, Evaluation Metric: Overall Accuracy (OA), Strict Accuracy (SA), Average Checklist Score (AVG CS) |
|||
| Aug 2025 | Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey |
|
|
|
• Affiliation: Harbin Institute of Technology (Shenzhen) • Paper Number: 244 |
|||
| Aug 2025 | Express4D: Expressive, Friendly, and Extensible 4D Facial Motion Generation Benchmark |
|
|
|
• Affiliation: Tel Aviv University • Benchmark Name: Express4D, Data Number: 1205, Evaluation Metric: FID, R-precision, Diversity, Multimodal Distance |
|||
| Aug 2025 | VimoRAG: Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models |
|
|
|
• Affiliation: Harbin Institute of Technology (Shenzhen) • Method Name: McDPO, Base Model: Phi3-3.8B, Strategy: DPO |
|||
| Aug 2025 | CineTrans: Learning to Generate Videos with Cinematic Transitions via Masked Diffusion Models |
|
|
|
• Affiliation: Fudan University • Method Name: CineTrans, Base Model: , Strategy: Masked Diffusion with Attention Mechanism • Benchmark Name: Cine250K, Data Number: 250K, Evaluation Metric: Transition Control Score, Inter-shot Consistency, Intra-shot Consistency, Aesthetic Quality, Semantic Consistency |
|||
| Aug 2025 | FantasyTalking2: Timestep-Layer Adaptive Preference Optimization for Audio-Driven Portrait Animation |
|
|
|
• Affiliation: AMAP, Alibaba Group • Method Name: Timestep-Layer adaptive multi-expert Preference Optimization (TLPO), Base Model: Wan2.1, Strategy: DPO • Benchmark Name: Talking-NSQ, Data Number: 410K, Evaluation Metric: Preference Accuracy |
|||
| Aug 2025 | Hierarchical Fine-grained Preference Optimization for Physically Plausible Video Generation |
|
|
|
• Affiliation: The Hong Kong University of Science and Technology • Method Name: PhysHPO, Base Model: CogVideoX-2B, CogVideoX-5B, HunyuanVideo-540p, Strategy: DPO (Direct Preference Optimization) |
|||
| Aug 2025 | Integrating Reinforcement Learning with Visual Generative Models: Foundations and Advances |
|
|
|
• Affiliation: Institute of Artificial Intelligence (TeleAI), China Telecom. • Paper Number: 164 |
|||
| Aug 2025 | ViMoNet: A Multimodal Vision-Language Framework for Human Behavior Understanding from Motion and Video |
|
|
|
• Affiliation: Department of Computer Science, AIUB, Dhaka, Bangladesh • Benchmark Name: ViMoNet-Bench, Data Number: , Evaluation Metric: GPT-3.5-turbo scoring (0-5) |
|||
| Aug 2025 | Animate-X++: Universal Character Image Animation with Dynamic Backgrounds |
|
|
|
• Affiliation: School of Computing and Data Science, The University of Hong Kong • Method Name: Animate-X++, Base Model: WanX2.1, Strategy: Multi-task training with partial parameter training and pose transformation simulation • Benchmark Name: A2Bench, Data Number: 500, Evaluation Metric: PSNR, SSIM, L1, LPIPS, FID, FID-VID, FVD, CLIP Score, Background Consistency, Motion Smoothness, Aesthetic Quality, Image Quality |
|||
| Aug 2025 | Yan: Foundational Interactive Video Generation |
|
|
|
• Affiliation: Tencent • Method Name: Yan-Sim, Base Model: , Strategy: PPO |
|||
| Aug 2025 | Fine-grained Video Dubbing Duration Alignment with Segment Supervised Preference Optimization |
|
|
|
• Affiliation: Alibaba Digital Media and Entertainment Group • Method Name: SSPO (Segment Supervised Preference Optimization), Base Model: Llama3.1-8B-Chinese-Chat, GLM-4-9B-Chat, Qwen2.5-14B-Instruct, Strategy: DPO (Direct Preference Optimization) |
|||
| Aug 2025 | BigTokDetect: A Clinically-Informed Vision-Language Modeling Framework for Detecting Pro-Bigorexia Videos on TikTok |
|
|
|
• Affiliation: USC Information Sciences Institute • Benchmark Name: BigTok, Data Number: 2210, Evaluation Metric: Accuracy, Precision, Recall, F1-score |
|||
| Aug 2025 | SwiftVideo: A Unified Framework for Few-Step Video Generation through Trajectory-Distribution Alignment |
|
|
|
• Affiliation: Fudan University • Method Name: SwiftVideo, Base Model: Wan2.1-FUN-inp-480p-1.3B, Strategy: DPO |
|||
| Aug 2025 | V.I.P. : Iterative Online Preference Distillation for Efficient Video Diffusion Models |
|
|
|
• Affiliation: Yonsei University • Method Name: ReDPO, Base Model: , Strategy: DPO • Method Name: V.I.P., Base Model: , Strategy: DPO |
|||
| Aug 2025 | V.I.P. : Iterative Online Preference Distillation for Efficient Video Diffusion Models |
|
|
|
• Affiliation: Yonsei University • Method Name: ReDPO, Base Model: None, Strategy: DPO • Method Name: V.I.P., Base Model: None, Strategy: DPO |
|||
| Jul 2025 | Controllable Video Generation: A Survey |
|
|
|
• Affiliation: The Hong Kong University of Science and Technology • Paper Number: 416 |
|||
| Jul 2025 | Controllable Video Generation: A Survey |
|
|
|
• Affiliation: The Hong Kong University of Science and Technology, Hong Kong SAR • Paper Number: 416 |
|||
| Jul 2025 | Show and Polish: Reference-Guided Identity Preservation in Face Video Restoration |
|
|
|
• Affiliation: Zhejiang University • Method Name: IP-FVR, Base Model: , Strategy: identity-preserving feedback learning • Benchmark Name: YouRef, Data Number: , Evaluation Metric: PSNR, SSIM, LPIPS, CLIP-IQA, MUSIQ, LIQE, IDS, 𝐸𝑤𝑎𝑟𝑝, 𝜎𝐼𝐷𝑆 |
|||
| Jul 2025 | EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation |
|
|
|
• Affiliation: Terminal Technology Department, Alipay, Ant Group • Method Name: EchoMimicV3, Base Model: Wan2.1-FUN-inp-480p-1.3B, Strategy: DPO |
|||
| Jul 2025 | EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation |
|
|
|
• Affiliation: Terminal Technology Department, Alipay, Ant Group • Method Name: EchoMimicV3, Base Model: Wan2.1-FUN-inp-480p-1.3B, Strategy: DPO |
|||
| Jul 2025 | LongAnimation: Long Animation Generation with Dynamic Global-Local Memory |
|
|
|
• Affiliation: University of Science and Technology of China • Method Name: LongAnimation, Base Model: CogVideoX-1.5-5B, Strategy: NGR |
|||
| Jun 2025 | Video Perception Models for 3D Scene Synthesis |
|
|
|
• Affiliation: Tsinghua University |
|||
| Jun 2025 | RDPO: Real Data Preference Optimization for Physics Consistency Video Generation |
|
|
|
• Affiliation: Fudan University • Method Name: Real Data Preference Optimization (RDPO), Base Model: LTX-Video-2B, Strategy: DPO |
|||
| Jun 2025 | VQ-Insight: Teaching VLMs for AI-Generated Video Quality Understanding via Progressive Visual Reinforcement Learning |
|
|
|
• Affiliation: School of Electronic and Computer Engineering, Peking University • Method Name: VQ-Insight, Base Model: Qwen-2.5-VL-7B-Instruct, Strategy: GRPO |
|||
| Jun 2025 | Toward Rich Video Human-Motion2D Generation |
|
|
|
• Affiliation: Tongji University • Method Name: RVHM2D, Base Model: None, Strategy: Fine-tuning with an FID-based reward • Benchmark Name: Motion2D-Video-150K, Data Number: 150000, Evaluation Metric: R-Precision, FID, MM Dist, Diversity |
|||
| Jun 2025 | AlignHuman: Improving Motion and Fidelity via Timestep-Segment Preference Optimization for Audio-Driven Human Animation |
|
|
|
• Affiliation: ByteDance • Method Name: AlignHuman, Base Model: , Strategy: Timestep-Segment Preference Optimization (TPO) |
|||
| Jun 2025 | Multimodal Large Language Models: A Survey |
|
|
|
• Affiliation: School of Architecture, Technology and Engineering, University of Brighton • Method Name: Video Diffusion Alignment via Reward Gradients, Base Model: , Strategy: Reward Gradients • Method Name: Diffusion Model Alignment Using Direct Preference Optimization, Base Model: , Strategy: Direct Preference Optimization • Method Name: VADER, Base Model: , Strategy: Backpropagating Reward Gradients • Benchmark Name: MJ-VIDEO, Data Number: , Evaluation Metric: Fine-Grained Benchmarking and Rewarding Video Preferences • Benchmark Name: VideoScore, Data Number: , Evaluation Metric: Simulating Fine-grained Human Feedback for Video Generation |
|||
| Jun 2025 | Multimodal Large Language Models: A Survey |
|
|
|
• Affiliation: School of Architecture, Technology and Engineering, University of Brighton, United Kingdom • Method Name: Video Diffusion Alignment via Reward Gradients, Base Model: , Strategy: Reward Gradients • Method Name: Diffusion Model Alignment Using Direct Preference Optimization, Base Model: , Strategy: Direct Preference Optimization (DPO) • Method Name: VADER, Base Model: , Strategy: Backpropagating Reward Gradients • Benchmark Name: MJ-VIDEO, Data Number: , Evaluation Metric: Fine-Grained Video Preferences • Benchmark Name: VideoScore, Data Number: , Evaluation Metric: Simulating Fine-Grained Human Feedback |
|||
| Jun 2025 | Seedance 1.0: Exploring the Boundaries of Video Generation Models |
|
|
|
• Affiliation: ByteDance • Method Name: Human Feedback Alignment (RLHF), Base Model: , Strategy: Reward feedback learning with multiple reward models (Foundational Reward Model, Motion Reward Model, Aesthetic Reward Model) |
|||
| Jun 2025 | Seedance 1.0: Exploring the Boundaries of Video Generation Models |
|
|
|
• Affiliation: ByteDance • Method Name: Human Feedback Alignment (RLHF), Base Model: , Strategy: Reward Maximization with Multi-Dimensional Reward Models |
|||
| Jun 2025 | ContentV: Efficient Training of Video Generation Models with Limited Compute |
|
|
|
• Affiliation: ByteDance Douyin Content Group • Method Name: Reinforcement Learning from Human Feedback (RLHF), Base Model: Stable Diffusion 3.5 Large (SD3.5L), Strategy: RLHF |
|||
| Jun 2025 | ContentV: Efficient Training of Video Generation Models with Limited Compute |
|
|
|
• Affiliation: ByteDance Douyin Content Group • Method Name: Reinforcement Learning from Human Feedback (RLHF), Base Model: Stable Diffusion 3.5 Large (SD3.5L), Strategy: Optimizing conditional distribution pθ(x1|c) with reward model r(c, x1) and KL-divergence regularization |
|||
| May 2025 | Photography Perspective Composition: Towards Aesthetic Perspective Recommendation |
|
|
|
• Affiliation: East China University of Science and Technology • Method Name: Photography Perspective Composition (PPC), Base Model: , Strategy: DPO |
|||
| May 2025 | Scaling Image and Video Generation via Test-Time Evolutionary Search |
|
|
|
• Affiliation: Hong Kong University of Science and Technology • Method Name: EvoSearch, Base Model: , Strategy: |
|||
| May 2025 | InfLVG: Reinforce Inference-Time Consistent Long Video Generation with GRPO |
|
|
|
• Affiliation: MAPLE Lab, Westlake University • Method Name: InfLVG, Base Model: , Strategy: GRPO • Benchmark Name: CsVBench, Data Number: 1000, Evaluation Metric: HPSv2, Aesthetic Score, CLIP-Flan, ViCLIP, ArcFace-42M, ArcFace-360K, QWen |
|||
| May 2025 | AvatarShield: Visual Reinforcement Learning for Human-Centric Video Forgery Detection |
|
|
|
• Affiliation: School of Electronic and Computer Engineering, Peking University • Method Name: AvatarShield, Base Model: Qwen2.5-VL-7B, Strategy: GRPO • Benchmark Name: FakeHumanVid, Data Number: 15000, Evaluation Metric: AUC |
|||
| May 2025 | RLVR-World: Training World Models with Reinforcement Learning |
|
|
|
• Affiliation: School of Software, BNRist, Tsinghua University • Method Name: RLVR-World, Base Model: , Strategy: GRPO |
|||
| May 2025 | Temporally-Grounded Language Generation: A Benchmark for Real-Time Vision-Language Models |
|
|
|
• Affiliation: University of Michigan • Benchmark Name: Temporally-Grounded Language Generation (TGLG), Data Number: 16487, Evaluation Metric: TRACE |
|||
| May 2025 | Diffusion-NPO: Negative Preference Optimization for Better Preference Aligned Generation of Diffusion Models |
|
|
|
• Affiliation: MMLab, CUHK, Hong Kong • Method Name: Negative Preference Optimization (NPO), Base Model: , Strategy: Diffusion-NPO |
|||
| May 2025 | DanceGRPO: Unleashing GRPO on Visual Generation |
|
|
|
• Affiliation: ByteDance Seed • Method Name: DanceGRPO, Base Model: , Strategy: GRPO |
|||
| May 2025 | VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding |
|
|
|
• Affiliation: University of Maryland, College Park • Benchmark Name: VideoHallu, Data Number: 3000, Evaluation Metric: |
|||
| Apr 2025 | TesserAct: Learning 4D Embodied World Models |
|
|
|
• Affiliation: UMass Amherst • Method Name: TesserAct, Base Model: , Strategy: |
|||
| Apr 2025 | Reasoning Physical Video Generation with Diffusion Timestep Tokens via Reinforcement Learning |
|
|
|
• Affiliation: Zhejiang University • Method Name: Phys-AR, Base Model: Llama3.1-8B, Strategy: GRPO |
|||
| Apr 2025 | SkyReels-V2: Infinite-length Film Generative Model |
|
|
|
• Affiliation: Skywork AI • Method Name: SkyReels-V2, Base Model: Qwen2-VL-7B, Strategy: DPO |
|||
| Apr 2025 | FingER: Content Aware Fine-grained Evaluation with Reasoning for AI-Generated Videos |
|
|
|
• Affiliation: AMAP, Alibaba Group • Method Name: FingER, Base Model: Qwen2.5-VL, Strategy: GRPO • Benchmark Name: FingER-Instruct-60k, Data Number: 60000, Evaluation Metric: |
|||
| Apr 2025 | Aligning Anime Video Generation with Human Feedback |
|
|
|
• Affiliation: Fudan University • Method Name: Gap-Aware Preference Optimization (GAPO), Base Model: , Strategy: Direct Preference Optimization (DPO) • Benchmark Name: AnimeReward, Data Number: 30000, Evaluation Metric: multi-dimensional reward scores |
|||
| Apr 2025 | Discriminator-Free Direct Preference Optimization for Video Diffusion |
|
|
|
• Affiliation: Zhejiang University • Method Name: Discriminator-Free Video Preference Optimization (DF-VPO), Base Model: , Strategy: DPO |
|||
| Apr 2025 | Morpheus: Benchmarking Physical Reasoning of Video Generative Models with Real Physical Experiments |
|
|
|
• Affiliation: University of Trento, Italy • Benchmark Name: Morpheus, Data Number: 80, Evaluation Metric: Dynamical Score |
|||
| Apr 2025 | OmniCam: Unified Multimodal Video Generation via Camera Control |
|
|
|
• Affiliation: Zhejiang University • Method Name: OmniCam, Base Model: Llama3.1, Strategy: PPO • Benchmark Name: OmniTr, Data Number: 1000 trajectories, 10,000 descriptions, 30,000 videos, Evaluation Metric: Mstarttime, Mendtime, Mspeed, Mrotate, Mdirection |
|||
| Mar 2025 | VPO: Aligning Text-to-Video Generation Models with Prompt Optimization |
|
|
|
• Affiliation: The Conversational Artificial Intelligence (CoAI) Group, Tsinghua University • Method Name: VPO, Base Model: LLaMA3-8B-Instruct, Strategy: DPO |
|||
| Mar 2025 | Zero-Shot Human-Object Interaction Synthesis with Multimodal Priors |
|
|
|
• Affiliation: The University of Hong Kong • Method Name: Physics-based HOI Refinement, Base Model: , Strategy: Actor-Critic with Gaussian Policy |
|||
| Mar 2025 | Judge Anything: MLLM as a Judge Across Any Modality |
|
|
|
• Affiliation: Huazhong University of Science and Technology • Benchmark Name: TASKANYTHING, Data Number: 1500, Evaluation Metric: • Benchmark Name: JUDGE ANYTHING, Data Number: 9000, Evaluation Metric: Agreement, Pearson correlation, Spearman correlation, MAE, Accuracy |
|||
| Mar 2025 | MagicID: Hybrid Preference Optimization for ID-Consistent and Dynamic-Preserved Video Customization |
|
|
|
• Affiliation: Zhejiang University • Method Name: MagicID, Base Model: , Strategy: DPO |
|||
| Mar 2025 | Unified Reward Model for Multimodal Understanding and Generation |
|
|
|
• Affiliation: Fudan University • Method Name: UnifiedReward, Base Model: LLaVA-OneVision-7B, Strategy: DPO |
|||
| Feb 2025 | Pre-Trained Video Generative Models as World Simulators |
|
|
|
• Affiliation: Hong Kong University of Science and Technology • Method Name: Dynamic World Simulation (DWS), Base Model: , Strategy: PPO |
|||
| Feb 2025 | Harness Local Rewards for Global Benefits: Effective Text-to-Video Generation Alignment with Patch-level Reward Models |
|
|
|
• Affiliation: Carnegie Mellon University • Method Name: HALO, Base Model: , Strategy: DPO |
|||
| Feb 2025 | IPO: Iterative Preference Optimization for Text-to-Video Generation |
|
|
|
• Affiliation: Shanghai Academy of Artificial Intelligence for Science • Method Name: Iterative Preference Optimization (IPO), Base Model: , Strategy: DPO |
|||
| Feb 2025 | MJ-VIDEO: Fine-Grained Benchmarking and Rewarding Video Preferences in Video Generation |
|
|
|
• Affiliation: UNC-Chapel Hill • Benchmark Name: MJ-BENCH-VIDEO, Data Number: 5421, Evaluation Metric: |
|||
| Feb 2025 | HuViDPO:Enhancing Video Generation through Direct Preference Optimization for Human-Centric Alignment |
|
|
|
• Affiliation: Zhejiang University • Method Name: HuViDPO, Base Model: , Strategy: DPO |
|||
| Feb 2025 | Zeroth-order Informed Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer |
|
|
|
• Affiliation: Guanghua School of Management, Peking University • Method Name: Recursive Likelihood Ratio (RLR) optimizer, Base Model: , Strategy: |
|||
| Jan 2025 | Improving Video Generation with Human Feedback |
|
|
|
• Affiliation: The Chinese University of Hong Kong • Method Name: Flow-DPO, Base Model: , Strategy: DPO • Method Name: Flow-RWR, Base Model: , Strategy: RWR • Method Name: Flow-NRG, Base Model: , Strategy: Reward Guidance • Benchmark Name: VideoGen-RewardBench, Data Number: 26500, Evaluation Metric: |
|||
| Dec 2024 | VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation |
|
|
|
• Affiliation: Tsinghua University • Method Name: Multi-Objective Preference Optimization (MPO), Base Model: , Strategy: DPO |
|||
| Dec 2024 | OnlineVPO: Align Video Diffusion Model with Online Video-Centric Preference Optimization |
|
|
|
• Affiliation: The University of Hong Kong • Method Name: OnlineVPO, Base Model: , Strategy: DPO |
|||
| Dec 2024 | VideoDPO: Omni-Preference Alignment for Video Diffusion Generation |
|
|
|
• Affiliation: HKUST • Method Name: VideoDPO, Base Model: , Strategy: DPO |
|||
| Dec 2024 | FLIP: Flow-Centric Generative Planning for General-Purpose Manipulation Tasks |
|
|
|
• Affiliation: National University of Singapore • Method Name: FLIP, Base Model: , Strategy: model-based planning |
|||
| Dec 2024 | The Matrix: Infinite-Horizon World Generation with Real-Time Moving Control |
|
|
|
• Affiliation: Tongyi Lab • Method Name: The Matrix, Base Model: , Strategy: Shift-Window Denoising Process Model (Swin-DPM) |
|||
| Dec 2024 | Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback |
|
|
|
• Affiliation: The University of Tokyo • Method Name: RL-Finetuning for Text-to-Video Models, Base Model: , Strategy: RWR, DPO |
|||
| Nov 2024 | Free$^2$Guide: Gradient-Free Path Integral Control for Enhancing Text-to-Video Generation with Large Vision-Language Models |
|
|
|
• Affiliation: Kim Jaechul Graduate School of AI, KAIST • Method Name: Free2Guide, Base Model: , Strategy: Path Integral Control |
|||
| Nov 2024 | A Reinforcement Learning-Based Automatic Video Editing Method Using Pre-trained Vision-Language Model |
|
|
|
• Affiliation: SSE, The Chinese University of Hong Kong, Shenzhen • Method Name: RL-based editing framework, Base Model: , Strategy: actor-critic |
|||
| Oct 2024 | Video to Video Generative Adversarial Network for Few-shot Learning Based on Policy Gradient |
|
|
|
• Affiliation: Northwestern University • Method Name: RL-V2V-GAN, Base Model: , Strategy: Policy Gradient |
|||
| Oct 2024 | WorldSimBench: Towards Video Generation Models as World Simulators |
|
|
|
• Affiliation: The Chinese University of Hong Kong, Shenzhen • Benchmark Name: WorldSimBench, Data Number: 35701, Evaluation Metric: Human Preference Evaluator |
|||
| Oct 2024 | Animating the Past: Reconstruct Trilobite via Video Generation |
|
|
|
• Affiliation: AI Lab, Yishi Inc. • Method Name: Automatic T2V Prompt Learning Method, Base Model: , Strategy: KTO |
|||
| Oct 2024 | VideoAgent: Self-Improving Video Generation |
|
|
|
• Affiliation: University of Waterloo • Method Name: VideoAgent, Base Model: , Strategy: self-improvement through online finetuning |
|||
| Oct 2024 | E-Motion: Future Motion Simulation via Event Sequence Diffusion |
|
|
|
• Affiliation: Xidian University • Method Name: Event-Sequence Diffusion Network, Base Model: , Strategy: PPO |
|||
| Oct 2024 | DART: A Diffusion-Based Autoregressive Motion Model for Real-Time Text-Driven Motion Control |
|
|
|
• Affiliation: ETH Z__rich • Method Name: DART, Base Model: , Strategy: PPO |
|||
| Oct 2024 | SePPO: Semi-Policy Preference Optimization for Diffusion Alignment |
|
|
|
• Affiliation: University of Rochester • Method Name: SePPO, Base Model: , Strategy: DPO |
|||
| Jul 2024 | Video Diffusion Alignment via Reward Gradients |
|
|
|
• Affiliation: Carnegie Mellon University • Method Name: VADER, Base Model: , Strategy: Reward Gradients |
|||
| Dec 2023 | InstructVideo: Instructing Video Diffusion Models with Human Feedback |
|
|
|
• Affiliation: Zhejiang University • Method Name: InstructVideo, Base Model: , Strategy: reward fine-tuning |
|||
| Nov 2023 | AdaDiff: Adaptive Step Selection for Fast Diffusion Models |
|
|
|
• Affiliation: Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University • Method Name: AdaDiff, Base Model: , Strategy: policy gradient |
|||
If you have a paper or are aware of relevant research that should be incorporated, please contribute via pull requests, issues, email, or other suitable methods.