Description
Checklist
- 1. I have searched related issues but cannot get the expected help.
- 2. The bug has not been fixed in the latest version.
- 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
Describe the bug
When using opengvlab/internvl3-14b for 2d grounding task, I have observed inaccurate bbox with offset, is this expected? I am aware internvl uses 1000x1000 pixel coordinate system and performed below scaling
abs_x1 = int(min(max(x1 / 1000 * width, 0), width))
abs_y1 = int(min(max(y1 / 1000 * height, 0), height))
abs_x2 = int(min(max(x2 / 1000 * width, 0), width))
abs_y2 = int(min(max(y2 / 1000 * height, 0), height))
Reproduction
Model: opengvlab/internvl3-14b
Prompt: "Outline the position of every vehicle in this image and output all the coordinates in JSON format with the structure: [{"bbox_2d": [x1, y1, x2, y2], "label": "vehicle_type"}]. Include cars, trucks, motorcycles, buses, and any other vehicles you can see."
Response:
To outline the position of every vehicle in the image, I'll identify each vehicle and provide their bounding boxes in the format [x1, y1, x2, y2], where (x1, y1) is the top-left corner and (x2, y2) is the bottom-right corner of the bounding box. Here's the JSON output:
[
{
"bbox_2d": [10, 270, 200, 550],
"label": "car"
},
{
"bbox_2d": [230, 360, 360, 460],
"label": "car"
},
{
"bbox_2d": [290, 390, 350, 440],
"label": "motorcycle"
},
{
"bbox_2d": [330, 260, 920, 830],
"label": "car"
},
]
### Environment
```Shell
NA
Error traceback
NA