You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for your excellent research and for sharing your work.
I am currently working on an approach where, after detecting objects captured by a fixed camera, I perform classification on the bounding boxes using CLIP. While the base model provides a reasonable level of accuracy, I am looking to fine-tune it to better adapt to the characteristics of the objects appearing in this specific camera feed.
Now, I am applying a CNN classifier with categories such as "person" and "other." However, the "other" class contains a variety of miscellaneous elements like shadows and structures. As a first step, I attempted to fine-tune the model by grouping those into a single "scenery" label.
However, perhaps due to the broad and ambiguous nature of "scenery," the model's predictions began favoring the "scenery" class more often compared to the base model. As a result, the precision for the "scenery" class decreased, and the recall for the "person" class also dropped.
Based on this observation, I am now considering a different approach: instead of lumping everything into a single "scenery" class, I plan to assign more specific positive class labels like "shadow" and "pipe," and use these as textual labels for training.
If you have any suggestions or alternative ideas, I would greatly appreciate your advice.
This discussion was converted from issue #1075 on May 21, 2025 16:08.
Heading
Bold
Italic
Quote
Code
Link
Numbered list
Unordered list
Task list
Attach files
Mention
Reference
Menu
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Thank you for your excellent research and for sharing your work.
I am currently working on an approach where, after detecting objects captured by a fixed camera, I perform classification on the bounding boxes using CLIP. While the base model provides a reasonable level of accuracy, I am looking to fine-tune it to better adapt to the characteristics of the objects appearing in this specific camera feed.
Now, I am applying a CNN classifier with categories such as "person" and "other." However, the "other" class contains a variety of miscellaneous elements like shadows and structures. As a first step, I attempted to fine-tune the model by grouping those into a single "scenery" label.
However, perhaps due to the broad and ambiguous nature of "scenery," the model's predictions began favoring the "scenery" class more often compared to the base model. As a result, the precision for the "scenery" class decreased, and the recall for the "person" class also dropped.
Based on this observation, I am now considering a different approach: instead of lumping everything into a single "scenery" class, I plan to assign more specific positive class labels like "shadow" and "pipe," and use these as textual labels for training.
If you have any suggestions or alternative ideas, I would greatly appreciate your advice.
Beta Was this translation helpful? Give feedback.
All reactions