Question about time grounding

Thank you for open-sourcing this impressive work! I have a question about your time-grounding procedure:

When you prompt the model to predict a time interval, do you feed it explicit timestamps for every frame? If not, how does the model infer the video’s temporal scale—such as the frames-per-second rate—from a sequence of images alone?