Danish Foundation Models is a collaborative project for training foundational Danish language model. Which seeks to:
- Develop and maintain state-of-the-art models for Danish,
 - which are well-validated across a wide range of tasks.
 - Furthermore, we wish to ensure good documentation, which allows users to assess the model for their use-case critically
 - Open-source, both model and source code
 
Note: This repository is intended for the data processing of DFM.
└── dfm-processing/
    ├── .github
    │   └── workflows
    ├── LICENSE
    ├── README.md
    ├── config
    │   └── example.yaml
    ├── pyproject.toml
    ├── src
    │   └── dfm_processing
    ├── tests
    │   ├── cli
    │   ├── data_pipeline
    │   └── document_processing
    └── uv.lockThis project requires the following dependencies:
- Programming Language: Python
 - Package Manager: Uv
 
Build dfm-processing from the source and intsall dependencies:
- 
Clone the repository:
❯ git clone https://github.com/danish-foundation-models/dfm-processing
 - 
Navigate to the project directory:
❯ cd dfm-processing - 
Install the dependencies:
Using uv:
❯ uv sync --all-extras
 
The CLI is divided into two sections, "document" and "pipeline". Each section contains specific commands for different tasks.
- 
Process Directory:
- Purpose: Extract text data from various file types in a directory.
 - Usage:
uv run dfm-processing document process-directory path_to_dir output_dir dataset_name
 - Example:
uv run dfm-processing document process-directory ./data ./output my_dataset
 
 - 
Process Web Crawl:
- Purpose: Extract text data from a web crawl.
 - Usage:
uv run dfm-processing document process-web-crawl crawl_log output_dir crawled_data dataset_name
 - Example:
uv run dfm-processing document process-web-crawl example.com.log ./output ./crawled_data/ example.com
 
 
- 
Filter:
- Purpose: Run a filtering pipeline on a dataset to filter out "poor" quality data.
 - Usage:
uv run dfm-processing pipeline filter yaml_config
 - Example:
uv run dfm-processing pipeline filter ./config/example.yaml
 
 - 
Sentence Deduplication (
sent_dedup):- Purpose: Perform sentence deduplication on a given dataset.
 - Usage:
uv run dfm-processing pipeline sent_dedup yaml_config
 - Example:
uv run dfm-processing pipeline sent_dedup ./config/example.yaml
 
 - 
MinHash Deduplication (
minhash-dedup):- Purpose: Perform MinHash Deduplication on a given dataset.
 - Usage:
uv run dfm-processing pipeline minhash-dedup yaml_config
 - Example:
uv run dfm-processing pipeline minhash-dedup ./config/example.yaml
 
 
For more information please check out the following links:
| 📑 About | A overview of the DFM project | 
| Research Paper | An paper introducing DFM and its rationale | 
| 🚀 Models | A overview of current models available through the DFM project | 
| 💽 Datasets | Includes datasheets about the datasets which includes preprocessing, reason for constructions and more. | 
DFM is considered a collaborative project for training and maintaining Danish Language models. If you wish to contribute don't hesitate to reach out using one of the following channels:
| 🗣 DDSC Slack | Join the discussion in the "danish-foundation-models"-channel | 
| 💬 GitHub Discussion | Ask questions or start a discussion | 
| 🚨 GitHub Issues | Notices a bug in the code? Please create an issue | 
You can contribute both:
- Developer time, the lifeblood of any open-source project
 - Pre-training datasets you wish to include in the model training
 - Validation tasks can even be private benchmarks where you only wish to share the performance metrics.
 - And probably in many other ways
 
