[Question] Why can the image VL model understand the video with repeated content twice?

Nice work!

According to my understanding, you pass a video in ImageGrid format to Image VL model.

For example.

Fast path.
```
Frame-1, Frame-2, Frame-3, Frame-4, Frame-5
```

Slow path.

```
Frame-2, Frame-4
```

Then you concat them, got

```
Frame-1, Frame-2, Frame-3, Frame-4, Frame-5, Frame-2, Frame-4
```

Finally, you feed these tokens to VL model to get result.

**Why can the model understand a video where the content is repeated twice?**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Question] Why can the image VL model understand the video with repeated content twice? #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question] Why can the image VL model understand the video with repeated content twice? #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions