This repository contains in-house code used in training and evaluating NorBERT-1 and NorBERT-2: large-scale Transformer-based language models for Norwegian. The models were trained by the Language Technology Group at the University of Oslo. The computations were performed on resources provided by UNINETT Sigma2 - the National Infrastructure for High Performance Computing and Data Storage in Norway.
For most of the training, BERT For TensorFlow from NVIDIA was used.
We made minor changes to their code, see the patches_for_NVIDIA_BERT
subdirectory.
NorBERT models training was conducted as a part of the NorLM project. Check this paper for more details:
Andrey Kutuzov, Jeremy Barnes, Erik Velldal, Lilja Øvrelid, Stephan Oepen. Large-Scale Contextualised Language Modelling for Norwegian, NoDaLiDa'21 (2021)
- Read about NorBERT
- Download NorBERT-1 from our repository or from HuggingFace
- Download NorBERT-2 from our repository or from HuggingFace
In 2023, we released a new family of NorBERT-3 language models for Norwegian. In general, we now recommend using these models:
- NorBERT 3 xs (15M parameters)
- NorBERT 3 small (40M parameters)
- NorBERT 3 base (123M parameters)
- NorBERT 3 large (323M parameters)
NorBERT-3 is described in detail in this paper: NorBench – A Benchmark for Norwegian Language Models (Samuel et al., NoDaLiDa 2023)