-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Description
Thank you for your interesting work. I noticed that the performance on SQA3D is based on GPT4Scene's data, which leverages Mask3D priors to preprocess video frames and annotate object identifiers on the images. However, I believe it may not be reasonable to rely on such strong 3D priors in a unified video-based MLLM framework.
Have you evaluated the performance on SQA3D without using object identifiers?
Metadata
Metadata
Assignees
Labels
No labels