Skip to content

A library for language transfer methods and algorithms.

License

Notifications You must be signed in to change notification settings

AnesBenmerzoug/langsfer

Repository files navigation

Langsfer Logo

Langsfer, a library for language transfer methods and algorithms.

CI TestPyPI Version License

Language transfer refers to a few related things:

  • initializing a Large Language Model (LLM) in a new, typically low-resource, target language (e.g. German, Arabic) from another LLM trained in high-resource source language (e.g. English),
  • extending the vocabulary of an LLM by adding new tokens and initializing their embeddings in a manner that allows them to be used with little to no extra training,
  • specializing the vocabulary of a multilingual LLM to one of its supported languages.

The library currently implements the following methods:

Quick Start

Installation

To install the latest stable version from PyPI use:

pip install langsfer

To install the latest development version from TestPyPI use:

pip install -i https://test.pypi.org/simple/ langsfer

To install the latest development version from the repository use:

git clone [email protected]:AnesBenmerzoug/langsfer.git
cd langsfer
pip install .

Tutorials

The following notebooks serve as tutorials for users of the package:

Example

The package provide high-level interfaces to instantiate each of the methods, without worrying too much about the package's internals.

For example, for the WECHSEL method, you would use:

from langsfer.high_level import wechsel
from langsfer.embeddings import FastTextEmbeddings
from langsfer.utils import download_file
from transformers import AutoTokenizer

source_tokenizer = AutoTokenizer.from_pretrained("roberta-base")
target_tokenizer = AutoTokenizer.from_pretrained("benjamin/roberta-base-wechsel-german")

source_model = AutoModel.from_pretrained("roberta-base")
source_embeddings_matrix = source_model.get_input_embeddings().weight.detach().numpy()

source_auxiliary_embeddings = FastTextEmbeddings.from_model_name_or_path("en")
target_auxiliary_embeddings = FastTextEmbeddings.from_model_name_or_path("de")

bilingual_dictionary_file = download_file(
    "https://raw.githubusercontent.com/CPJKU/wechsel/main/dicts/data/german.txt",
    "german.txt",
)

embedding_initializer = wechsel(
    source_tokenizer=source_tokenizer,
    source_embeddings_matrix=source_embeddings_matrix,
    target_tokenizer=target_tokenizer,
    target_auxiliary_embeddings=target_auxiliary_embeddings,
    source_auxiliary_embeddings=source_auxiliary_embeddings,
    bilingual_dictionary_file=bilingual_dictionary_file,
)

To initialize the target embeddings you would then use:

target_embeddings_matrix = embedding_initializer.initialize(seed=16, show_progress=True)

The result is an object of type TransformersEmbeddings that contain the initialized embeddings in its embeddings_matrix field and the target tokenizer in its tokenizer field.

We can then replace the source model's embeddings matrix with this newly initialized embeddings matrix:

import torch
from transformers import AutoModel

target_model = AutoModel.from_pretrained("roberta-base")
# Resize its embedding layer
target_model.resize_token_embeddings(len(target_tokenizer))
# Replace the source embeddings matrix with the target embeddings matrix
target_model.get_input_embeddings().weight.data = torch.as_tensor(target_embeddings_matrix)
# Save the new model
target_model.save_pretrained("path/to/target_model")

Contributing

Refer to the contributing guide for instructions on you can make contributions to this repository.

Logo

The langsfer logo was created by my good friend Zakaria Taleb Hacine, a 3D artist with industry experience and a packed portfolio.

The logo contains the latin alphabet letters A and I which are an acronym for Artificial Intelligence and the arabic alphabet letters أ and ذ which are an acronym for ذكاء اصطناعي, which is Artificial Intelligence in arabic.

The fonts used are Ethnocentric Regular and Readex Pro.

License

This package is license under the LGPL-2.1 license.

About

A library for language transfer methods and algorithms.

Resources

License

Stars

Watchers

Forks

Releases

No releases published