Skip to content

KlingTeam/VANS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO

Junhao Cheng1†, Liang Hou2, Xin Tao2, Jing Liao1
1City University of Hong Kong 2Kling Team, Kuaishou Technology
This work was conducted during the author's internship at Kling Team, Kuaishou Technology

Website arXiv HF Dataset: Video--as--Answer

🔎 Introduction

Teaser Image

We pioneer Video-Next-Event Prediction (VNEP), extending text-based next-event prediction to dynamic video responses. This shift from telling to showing enables more intuitive and customized answers for procedural learning and creative exploration.

To tackle VNEP, we propose VANS, a model that aligns a Vision-Language Model (VLM) with a Video Diffusion Model (VDM) through our Joint-GRPO post-training approach. Our method bridges the semantic-to-visual gap of VLM and VDM, enabling high-quality video event prediction and generation.

🏗️ Method

VANS Architecture
VANS Architecture: Dual-path processing with VLM for reasoning and VDM for generation
Joint-GRPO
Joint-GRPO: Two-stage co-steering optimization

Key Components

VANS Architecture: Processes input videos and questions through dual pathways:

  • VLM Path: Performs instruction-grounded reasoning to generate textual captions
  • VDM Path: Synthesizes videos conditioned on semantic captions and visual context

Joint-GRPO: Our two-stage reinforcement learning approach:

  • Stage 1: Visualization-friendly VLM tuning - optimizes captions for visual plausibility
  • Stage 2: Context-faithful VDM adaptation - ensures semantic alignment and visual coherence

🎬 Results

🍳 Procedural Teaching

Case Input Video Question VANS Output
1 Input Video 1 "Show me the next step for baked chicken Parmesan." Output Video 1
2 Input Video 2 "Hi, I want to make slime. What should I do next?" Output Video 2
3 Input Video 3 "Hey AI assistant, I'm making a paper windmill and just uploaded a video. What should I do next?" Output Video 3

🔮 Multi-Future Prediction

Same input video, different questions lead to diverse future predictions:

Input Video
Kitchen Input
Realistic Reaction
"What if she gets burned in her daily life?"
Dramatic Reaction
"What if she gets burned in an exaggerated movie?"
Comedic Reaction
"What if she eats something spicy in an exaggerated movie?"
Input Video
Emotional Input
Grandson Reaction
"Show her reaction if she sees her grandson."
Husband Reaction
"Show her reaction if she sees her husband."
Death Reaction
"Show her reaction if she sees the personification of death."

🚀 Quick Start

🎯 Environment Setup

To set up the environment for inference, you can run the following command:

git clone https://github.com/KlingTeam/VANS.git
cd VANS

conda create -n VANS python==3.12 -y
conda activate VANS

pip install requirements.txt
cd vans/models_mllm/qwen-vl-utils
pip install -e .[decord]
cd ...

🌎 Download Models

To get started, download the VANS base models:

Then download the complete VANS model:
VANS Model Download (Coming Soon)

🧸 Demo

To run local gradio demo:

python app.py

🚩 Plan

  • Release VANS-Data-100K dataset
  • Release VANS model
  • Release training codes
  • Release inference codes
  • Release paper

📜 Citation

If you find our work helpful, please consider giving a star 🌟 and citation 📝

@article{cheng2025video,
  title={Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO},
  author={Cheng, Junhao and Hou, Liang and Tao, Xin and Liao, Jing},
  journal={arXiv preprint arXiv:2511.16669},
  year={2025}
}

About

Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages