ICML 2025
Yu Zhou*,
Bingxuan Li*,
Mohan Tang*,
Xiaomeng Jin,
Te-Lin Wu,
Kuan-Hao Huang,
Heng Ji,
Kai-Wei Chang,
Nanyun Peng
Large multimodal models (LMMs) often struggle to recognize novel concepts, as they rely on pre-trained knowledge and have limited ability to capture subtle visual details. Domain-specific knowledge gaps in training also make them prone to confusing visually similar, commonly misrepresented, or low-resource concepts. To help LMMs better align nuanced visual features with language, improving their ability to recognize and reason about novel or rare concepts, we propose a Contrastive visual Data Augmentation (CoDA) strategy. CoDA extracts key contrastive textual and visual features of target concepts against the known concepts they are misrecognized as, and then uses multimodal generative models to produce targeted synthetic data. Automatic filtering of extracted features and augmented images is implemented to guarantee their quality, as verified by human annotators. We show the effectiveness and efficiency of CoDA on low-resource concept and diverse scene recognition datasets including INaturalist and SUN. We additionally collect NovelSpecies, a benchmark dataset consisting of newly discovered animal species that are guaranteed to be unseen by LMMs. LLaVA-1.6 1-shot updating results on these three datasets show CoDA significantly improves SOTA visual data augmentation strategies by 12.3% (NovelSpecies), 5.1% (SUN), and 6.0% (iNat) absolute gains in accuracy
Create a config.json
file in your working directory with the following structure:
{
"num_imgs_per_class": 100,
"models": ["34b"],
"prompts_types": [],
"synthetic_images_path": PATH,
"real_images_path": PATH,
"val_images_path": PATH,
"pairs": [
{
"ground_truth": "class1",
"ground_truth_full_name": "class1_full",
"confusing_class": "class2",
"confusing_class_full_name": "class2_full"
}
]
}
(see examples/config.json for reference).
Be sure to prepare the config.json file and put in under the working directory.
Run the script with the following arguments:
python generation.py --prompts contrastive_visual_text --num_images 50 --num_test 5 --working_dir YOUR_WORKING_PATH
--prompts
: Type of prompts to generate (default: "contrastive_visual_text")--num_images
: Number of images to generate per class (default: 50)--num_test
: Number of test images for attribute evaluation (default: 5)--working_dir
: Working directory path
working_dir/
├── synthetic_improved/
│ └── 34b/
│ └── contrastive_visual_text/
│ └── class_name/
│ ├── attributes.json
│ ├── attributes_contrastiveness_statistics.json
│ └── images/
python naive_augmentation.py --num_images 50 \
--output_path OUTPUT_PATH \
--working_dir WORKING_DIR \
--prompts flip,crop \
--SUN False
--num_images
: Number of augmented images to generate (default: 50)--output_path
: Output directory path--working_dir
: Working directory containing source images--prompts
: Augmentation types to apply (default: "flip,crop")--SUN
: Flag for SUN dataset directory structure
output_path/
├── flip/
│ └── class_name/
│ └── augmented_images
├── crop/
│ └── class_name/
│ └── augmented_images
└── failed_pairs.json
Run the script with the required parameters:
python verification.py --data_config path/to/config.json --output_path path/to/output.json --attributes_prompts text contrastive_text
--data_config: Path to the JSON file containing dataset configuration.
--output_path: Path to store the results in JSON format.
--attributes_prompts: List of attribute extraction methods (e.g., text, contrastive_text).
The data_config.json file should contain:
{
"pairs": [
{
"ground_truth": "class1",
"ground_truth_full_name": "Full Name 1",
"confusing_class": "class2",
"confusing_class_full_name": "Full Name 2"
}
],
"synthetic_images_path": "path/to/synthetic/images",
"real_images_path": "path/to/real/images"
}
The output JSON file will contain verification scores and extracted attributes in the following format:
{
"text": {
"class1": [
{
"img": "image1.jpg",
"target_attributes": ["attribute1", "attribute2"],
"result": { "attribute1": 1, "attribute2": 0 },
"score": 0.5
}
]
}
}
Run the script using the following command:
python finetune.py [arguments]
--data_path
: Path to your dataset (default: "YOUR_DATA_PATH")--working_dir
: Working directory for outputs (default: "YOUR_WORKING_DIR")--port
: Ports for distributed training (default: "4,5,6")--base_model
: Base model to fine-tune (default: "liuhaotian/llava-v1.6-34b")--number_of_images
: Number of images per training instance (default: 5)--number_of_epochs
: Number of training epochs (default: 30)--synthetic_imgs_num
: Number of synthetic images (default: 5)--real_imgs_num
: Number of real images (default: 5)--prompt_types
: Comma-separated list of prompt types (default: "contrastive_visual,visual,text")--seed
: Random seed for reproducibility (default: 0)
python finetune.py \
--data_path /path/to/data \
--working_dir /path/to/working/dir \
--number_of_images 10 \
--number_of_epochs 50 \
--synthetic_imgs_num 8 \
--real_imgs_num 8 \
--prompt_types contrastive_visual,visual
The script creates the following directory structure for each experiment:
working_dir/
├── finetune_images/
│ └── {model}_{prompt_type}_{synthetic_num}_{real_num}_{seed}/
│ └── train_data.json
├── ckpts/
│ └── {model}_{prompt_type}_{synthetic_num}_{real_num}_{seed}/
└── logs/
└── {model}_{prompt_type}_{synthetic_num}_{real_num}_{seed}/
└── {num_images}_{num_epochs}.log
python evaluation.py \
--data_path /path/to/data \
--workspace /path/to/workspace \
--feature_extraction_approachs contrastive_visual \
--model 34b \
--num_val 20 \
--num_epochs 30 \
--batch_size 5 \
--zeroshot True
Required config.json
structure:
{
"pairs": [
{
"ground_truth": "class1",
"ground_truth_full_name": "class1_full",
"confusing_class": "class2",
"confusing_class_full_name": "class2_full"
}
],
"val_images_path": "path/to/val",
"synthetic_images_path": "path/to/synthetic"
}
Results are saved as JSON files containing confusion matrices:
{
"class1_VS_class2": {
"class1": { "class1": 0.8, "class2": 0.2 },
"class2": { "class1": 0.1, "class2": 0.9 }
}
}
To download the NovelSpecies Dataset and our subsets of the iNaturalist and SUN datasets, please go to our Huggingface Dataset. The dataset structure is:
data/
├── train/
│ └── iNaturalist/
│ | └── {class_name}
│ | └── images/
│ └── NovelSpecies/
│ | └── {class_name}
│ | └── images/
│ └── SUN/
│ └── {class_name_first_letter}/
│ └── {class_name}/
│ └── images/
├── val/
│ └── iNaturalist/
│ | └── {class_name}/
│ | └── images/
│ └── NovelSpecies/
│ | └── {class_name}/
│ | └── images/
│ └── SUN/
│ └── {class_name_first_letter}/
│ └── {class_name}/
│ └── images/
If you find our work helpful, please kindly cite our work :)
@misc{zhou2025contrastivevisualdataaugmentation,
title={Contrastive Visual Data Augmentation},
author={Yu Zhou and Bingxuan Li and Mohan Tang and Xiaomeng Jin and Te-Lin Wu and Kuan-Hao Huang and Heng Ji and Kai-Wei Chang and Nanyun Peng},
year={2025},
eprint={2502.17709},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2502.17709},
}