Skip to content

d-gurgurov/Knowledge-Driven-Adaptation-LLMs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

82 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Small Models, Big Impact

arXiv

This repository contains the code, data, and models (soon!) associated with the paper "Small Models, Big Impact: Efficient Corpus and Graph-Based Adaptation of Small Multilingual Language Models for Low-Resource Languages," focusing on enhancing multilingual Language Models (mLMs) for low-resource languages (LRLs). LRLs often face significant challenges in natural language processing (NLP) due to the scarcity of data.

We explore parameter-efficient adaptation of small mLMs for LRLs, comparing adapters, continued pre-training, and large-scale LM prompting. Our findings show that:

  • (1) limited adaptation data (≤1 GB text or a few MB of KG data) provides significant gains, with Sequential Bottleneck excelling in MLM and Invertible Bottleneck in downstream tasks;
  • (2) smaller mLMs like XLM-R outperform massive LLMs (e.g., GPT-3.5, LLaMA-3) for LRLs;
  • (3) pre-training data size strongly influences performance, with adaptation yielding diminishing returns for the languages with the extensive inclusion in a mode's pre-training data.

Overview

This research develops and experiments with language adapters trained on both structured and unstructured data sources:

  • Structured Knowledge: ConceptNet, a multilingual knowledge graph providing valuable relational knowledge across 304 languages. We convert ConceptNet triples into natural language sentences using predefined predicates.
  • Unstructured Data: GlotCC-V1, a large-scale multilingual corpus derived from CommonCrawl, emphasizing LRLs and providing high-quality text in 1,000 languages.

We systematically investigate parameter-efficient adapter-based methods for adapting mLMs to LRLs, evaluating three architectures: Sequential Bottleneck, Invertible Bottleneck, and Low-Rank Adaptation.

Experimental Setup

  • Languages: 30 diverse LRLs were selected, including Thai, Romanian, Bulgarian, and others (see Table 5 in the paper's Appendix B for the full list and details).
  • Data Preprocessing: ConceptNet triples were converted into natural language sentences. GlotCC data was limited to 1GB per language (if exceeding this limit) and cleaned.
  • Training Details:
    • Language adapters were trained on mBERT and XLM-R using MLM with GlotCC and ConceptNet data.
    • For LLaMA-3-8B, GlotCC data was used with the Seq_bn_inv architecture and CLM objective for a subset of 5 languages.
    • Training consisted of up to 100,000 steps for GlotCC and 25,000 steps for ConceptNet, with a batch size of 16 and a learning rate of 1e-4.
  • Evaluation Tasks:
    • Masked Language Modeling (MLM): Evaluated using the FLORES-200 devtest set.
    • Topic Classification (TC): Evaluated using the 7-class SIB-200 dataset.
    • Sentiment Analysis (SA): Evaluated using binary-class datasets from multiple sources.
    • Named Entity Recognition (NER): Evaluated using the WikiANN dataset.

Citation

@misc{gurgurov2025smallmodelsbigimpact,
      title={Small Models, Big Impact: Efficient Corpus and Graph-Based Adaptation of Small Multilingual Language Models for Low-Resource Languages}, 
      author={Daniil Gurgurov and Ivan Vykopal and Josef van Genabith and Simon Ostermann},
      year={2025},
      eprint={2502.10140},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.10140}, 
}

Acknowledgements

  • This work was supported by DisAI.
  • We thank the open-source community for providing valuable resources and tools.

Contact

For questions or issues, please contact [email protected].

About

A repository for my thesis related code and data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published