Langsfer, a library for language transfer methods and algorithms.
Language transfer refers to a few related things:
- initializing a Large Language Model (LLM) in a new, typically low-resource, target language (e.g. German, Arabic) from another LLM trained in high-resource source language (e.g. English),
- extending the vocabulary of an LLM by adding new tokens and initializing their embeddings in a manner that allows them to be used with little to no extra training,
- specializing the vocabulary of a multilingual LLM to one of its supported languages.
The library currently implements the following methods:
- WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models. Minixhofer, Benjamin, Fabian Paischer, and Navid Rekabsaz. arXiv preprint arXiv:2112.06598 (2021).
- CLP-Transfer: Efficient language model training through cross-lingual and progressive transfer learning. Ostendorff, Malte, and Georg Rehm. arXiv preprint arXiv:2301.09626 (2023).
- FOCUS: Effective Embedding Initialization for Specializing Pretrained Multilingual Models on a Single Language. Dobler, Konstantin, and Gerard de Melo. arXiv preprint arXiv:2305.14481 (2023).
To install the latest stable version from PyPI use:
pip install langsfer
To install the latest development version from TestPyPI use:
pip install -i https://test.pypi.org/simple/ langsfer
To install the latest development version from the repository use:
git clone [email protected]:AnesBenmerzoug/langsfer.git
cd langsfer
pip install .
The following notebooks serve as tutorials for users of the package:
The package provide high-level interfaces to instantiate each of the methods, without worrying too much about the package's internals.
For example, for the WECHSEL method, you would use:
from langsfer.high_level import wechsel
from langsfer.embeddings import FastTextEmbeddings
from langsfer.utils import download_file
from transformers import AutoTokenizer
source_tokenizer = AutoTokenizer.from_pretrained("roberta-base")
target_tokenizer = AutoTokenizer.from_pretrained("benjamin/roberta-base-wechsel-german")
source_model = AutoModel.from_pretrained("roberta-base")
source_embeddings_matrix = source_model.get_input_embeddings().weight.detach().numpy()
source_auxiliary_embeddings = FastTextEmbeddings.from_model_name_or_path("en")
target_auxiliary_embeddings = FastTextEmbeddings.from_model_name_or_path("de")
bilingual_dictionary_file = download_file(
"https://raw.githubusercontent.com/CPJKU/wechsel/main/dicts/data/german.txt",
"german.txt",
)
embedding_initializer = wechsel(
source_tokenizer=source_tokenizer,
source_embeddings_matrix=source_embeddings_matrix,
target_tokenizer=target_tokenizer,
target_auxiliary_embeddings=target_auxiliary_embeddings,
source_auxiliary_embeddings=source_auxiliary_embeddings,
bilingual_dictionary_file=bilingual_dictionary_file,
)
To initialize the target embeddings you would then use:
target_embeddings_matrix = embedding_initializer.initialize(seed=16, show_progress=True)
The result is an object of type TransformersEmbeddings
that contain the initialized
embeddings in its embeddings_matrix
field and the target tokenizer in its tokenizer
field.
We can then replace the source model's embeddings matrix with this newly initialized embeddings matrix:
import torch
from transformers import AutoModel
target_model = AutoModel.from_pretrained("roberta-base")
# Resize its embedding layer
target_model.resize_token_embeddings(len(target_tokenizer))
# Replace the source embeddings matrix with the target embeddings matrix
target_model.get_input_embeddings().weight.data = torch.as_tensor(target_embeddings_matrix)
# Save the new model
target_model.save_pretrained("path/to/target_model")
Refer to the contributing guide for instructions on you can make contributions to this repository.
The langsfer logo was created by my good friend Zakaria Taleb Hacine, a 3D artist with industry experience and a packed portfolio.
The logo contains the latin alphabet letters A and I which are an acronym for Artificial Intelligence and the arabic alphabet letters أ and ذ which are an acronym for ذكاء اصطناعي, which is Artificial Intelligence in arabic.
The fonts used are Ethnocentric Regular and Readex Pro.
This package is license under the LGPL-2.1 license.