Weiting Tan, Jiachen Lian, Hirofumi Inaguma, Paden Tomasello, Philipp Koehn, Xutai Ma
AVLM is a research project that targets modality fusion, integrating visual and speech representation into pre-trained SpeechLM for expressive generation.
AVLM/
βββ scripts/ # main scripts
β βββ avlm/ # folder for AVLM pretraining (with different fusion strategies)
β βββ avlm_avsr/ # folder for fine-tuning AVLM to perform AVSR task
β βββ avlm_emo/ # folder for fine-tuning AVLM for expressive speech generation
β βββ global.sh # config script for paths
βββ src/ # Core source code
β βββ data_utils/ # Data loading and preprocessing
β βββ models/ # Customized SpiritLM model
β βββ exp/ # spiritlm source code
β βββ preprocess/ # video data preprocessing
β βββ task/ # lighting trainer files
β ββββββ avlm_iemocap_tune.py # fine-tune AVLM for expressive dialogue generation
β ββββββ train_avlm.py # pre-train AVLM with differenet fusion strategies or fine-tune AVLM for AVSR task@misc{tan2025seeingbelievingemotionawareaudiovisual,
title={Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation},
author={Weiting Tan and Jiachen Lian and Hirofumi Inaguma and Paden Tomasello and Philipp Koehn and Xutai Ma},
year={2025},
eprint={2508.16188},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.16188},
}