Qihang Ma · Shengyu Li · Jie Tang · Dingkang Yang · Shaodong Chen · Yingyi Zhang · Chao Feng+ · Jiao Ran
ByteDance Douyin Content Group
+corresponding authors
- 2025.10.09 arXiv preprint released.
- 2025.10.09 Code released.
- 2025.08.21 🎉 DynamicCoT is accepted by EMNLP 2025.
Multi-modal keyphrase prediction (MMKP) aims to advance beyond text-only methods by incorporating multiple modalities of input information to produce a set of conclusive phrases. Traditional multi-modal approaches have been proven to have significant limitations in handling the challenging absence and unseen scenarios. Additionally, we identify shortcomings in existing benchmarks that overestimate model capability due to significant overlap in training tests. In this work, we propose leveraging vision-language models (VLMs) for the MMKP task. Firstly, we use two widely-used strategies, e.g., zero-shot and supervised fine-tuning (SFT) to assess the lower bound performance of VLMs. Next, to improve the complex reasoning capabilities of VLMs, we adopt Fine-tune-CoT, which leverages high-quality CoT reasoning data generated by a teacher model to finetune smaller models. Finally, to address the “overthinking” phenomenon, we propose a dynamic CoT strategy which adaptively injects CoT data during training, allowing the model to flexibly leverage its reasoning capabilities during the inference stage. We evaluate the proposed strategies on various datasets and the experimental results demonstrate the effectiveness of the proposed approaches. The code and datasets will be made publicly available upon acceptance of the paper.
step 1. Prepare environment.
pip3 install -e ".[torch,metrics,deepspeed]"
# we use transformers==4.52.1 for InternVL3 and transformers==4.49.0 for other models
pip3 install transformers
pip3 install vllm==0.7.3
step 2. Prepare dataset. You need to download raw images from CMKP.
python3 data/preprocess_datasets.py /path/to/images data/
bash train_full_sft.sh {/path/to/model} {/path/to/output} --template {template} --run_name {wandb_run_name} --dataset {train_dataset} --per_device_train_batch_size 1 --num_train_epochs {epoch}
# for InternVL3, source_txt in data/mmkp_source/
bash eval_internvl.sh {/path/to/model} {/path/to/source_txt} --template {template} --dataset {test_dataset}
# for other models
bash eval_full_sft.sh {/path/to/model} {/path/to/source_txt} --template {template} --dataset {test_dataset}
This project is not possible without multiple great open-sourced code bases. We list some notable examples below.
If this work is helpful for your research, please consider citing the following BibTeX entry.
@article{ma2025dynamiccot,
title={Boosting Multi-modal Keyphrase Prediction with Dynamic Chain-of-Thought in Vision-Language Models},
author={Ma, Qihang and Li, Shengyu and Tang, Jie and Yang, Dingkang and Chen, shaodong and Zhang, Yingyi and Feng, Chao and Ran, Jiao},
journal={},
year={2025}
}