Open
Description
Thank you for open-sourcing this impressive work! I have a question about your time-grounding procedure:
When you prompt the model to predict a time interval, do you feed it explicit timestamps for every frame? If not, how does the model infer the video’s temporal scale—such as the frames-per-second rate—from a sequence of images alone?
Metadata
Metadata
Assignees
Labels
No labels