GitHub - bytedance/DynamicCoT: 🔥 [EMNLP 2025] Official open-source repo for Boosting Multi-modal Keyphrase Prediction with Dynamic Chain-of-Thought in Vision-Language Models

Boosting Multi-modal Keyphrase Prediction with Dynamic Chain-of-Thought in Vision-Language Models

If you like DynamicCoT, please give us a star ⭐ on GitHub for the latest update.

Qihang Ma · Shengyu Li · Jie Tang · Dingkang Yang · Shaodong Chen · Yingyi Zhang · Chao Feng⁺ · Jiao Ran

ByteDance Douyin Content Group

⁺corresponding authors

🚀 News

2025.10.09 arXiv preprint released.
2025.10.09 Code released.
2025.08.21 🎉 DynamicCoT is accepted by EMNLP 2025.

📝 Introduction

Multi-modal keyphrase prediction (MMKP) aims to advance beyond text-only methods by incorporating multiple modalities of input information to produce a set of conclusive phrases. Traditional multi-modal approaches have been proven to have significant limitations in handling the challenging absence and unseen scenarios. Additionally, we identify shortcomings in existing benchmarks that overestimate model capability due to significant overlap in training tests. In this work, we propose leveraging vision-language models (VLMs) for the MMKP task. Firstly, we use two widely-used strategies, e.g., zero-shot and supervised fine-tuning (SFT) to assess the lower bound performance of VLMs. Next, to improve the complex reasoning capabilities of VLMs, we adopt Fine-tune-CoT, which leverages high-quality CoT reasoning data generated by a teacher model to finetune smaller models. Finally, to address the “overthinking” phenomenon, we propose a dynamic CoT strategy which adaptively injects CoT data during training, allowing the model to flexibly leverage its reasoning capabilities during the inference stage. We evaluate the proposed strategies on various datasets and the experimental results demonstrate the effectiveness of the proposed approaches. The code and datasets will be made publicly available upon acceptance of the paper.

🔧 Get Started

Installation and Data Preparation

step 1. Prepare environment.

pip3 install -e ".[torch,metrics,deepspeed]"
# we use transformers==4.52.1 for InternVL3 and transformers==4.49.0 for other models
pip3 install transformers
pip3 install vllm==0.7.3

step 2. Prepare dataset. You need to download raw images from CMKP.

python3 data/preprocess_datasets.py /path/to/images data/

Train model

bash train_full_sft.sh {/path/to/model} {/path/to/output} --template {template} --run_name {wandb_run_name} --dataset {train_dataset} --per_device_train_batch_size 1 --num_train_epochs {epoch}

Test model

# for InternVL3, source_txt in data/mmkp_source/
bash eval_internvl.sh {/path/to/model} {/path/to/source_txt} --template {template} --dataset {test_dataset}
# for other models
bash eval_full_sft.sh {/path/to/model} {/path/to/source_txt} --template {template} --dataset {test_dataset}

💡 Method

💡 Case Study

🙏 Acknowledgement

This project is not possible without multiple great open-sourced code bases. We list some notable examples below.

📃 Bibtex

If this work is helpful for your research, please consider citing the following BibTeX entry.

@article{ma2025dynamiccot,
  title={Boosting Multi-modal Keyphrase Prediction with Dynamic Chain-of-Thought in Vision-Language Models},
  author={Ma, Qihang and Li, Shengyu and Tang, Jie and Yang, Dingkang and Chen, shaodong and Zhang, Yingyi and Feng, Chao and Ran, Jiao},
  journal={},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
data		data
evaluation		evaluation
examples		examples
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
eval_full_sft.sh		eval_full_sft.sh
eval_internvl.sh		eval_internvl.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
train_full_sft.sh		train_full_sft.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Boosting Multi-modal Keyphrase Prediction with Dynamic Chain-of-Thought in Vision-Language Models

If you like DynamicCoT, please give us a star ⭐ on GitHub for the latest update.

🚀 News

📝 Introduction

🔧 Get Started

Installation and Data Preparation

Train model

Test model

💡 Method

💡 Case Study

🙏 Acknowledgement

📃 Bibtex

About

Uh oh!

Releases

Packages

Languages

License

bytedance/DynamicCoT

Folders and files

Latest commit

History

Repository files navigation

Boosting Multi-modal Keyphrase Prediction with Dynamic Chain-of-Thought in Vision-Language Models

If you like DynamicCoT, please give us a star ⭐ on GitHub for the latest update.

🚀 News

📝 Introduction

🔧 Get Started

Installation and Data Preparation

Train model

Test model

💡 Method

💡 Case Study

🙏 Acknowledgement

📃 Bibtex

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages