Nice work!
According to my understanding, you pass a video in ImageGrid format to Image VL model.
For example.
Fast path.
Frame-1, Frame-2, Frame-3, Frame-4, Frame-5
Slow path.
Then you concat them, got
Frame-1, Frame-2, Frame-3, Frame-4, Frame-5, Frame-2, Frame-4
Finally, you feed these tokens to VL model to get result.
Why can the model understand a video where the content is repeated twice?