DFM-PROCESSING

Effortlessly Deduplicate and Process Data at Scale

Overview

Danish Foundation Models is a collaborative project for training foundational Danish language model. Which seeks to:

Develop and maintain state-of-the-art models for Danish,
which are well-validated across a wide range of tasks.
Furthermore, we wish to ensure good documentation, which allows users to assess the model for their use-case critically
Open-source, both model and source code

Note: This repository is intended for the data processing of DFM.

Project Structure

└── dfm-processing/
    ├── .github
    │   └── workflows
    ├── LICENSE
    ├── README.md
    ├── config
    │   └── example.yaml
    ├── pyproject.toml
    ├── src
    │   └── dfm_processing
    ├── tests
    │   ├── cli
    │   ├── data_pipeline
    │   └── document_processing
    └── uv.lock

Getting Started

Prerequisites

This project requires the following dependencies:

Programming Language: Python
Package Manager: Uv

Installation

Build dfm-processing from the source and intsall dependencies:

Clone the repository:

❯ git clone https://github.com/danish-foundation-models/dfm-processing

Navigate to the project directory:
```
❯ cd dfm-processing
```
Install the dependencies:

Using uv:
```
❯ uv sync --all-extras
```

CLI Usage

The CLI is divided into two sections, "document" and "pipeline". Each section contains specific commands for different tasks.

Document Processing (`document`)

Process Directory:

Purpose: Extract text data from various file types in a directory.

Usage:

uv run dfm-processing document process-directory path_to_dir output_dir dataset_name

Example:

uv run dfm-processing document process-directory ./data ./output my_dataset

Process Web Crawl:

Purpose: Extract text data from a web crawl.

Usage:

uv run dfm-processing document process-web-crawl crawl_log output_dir crawled_data dataset_name

Example:

uv run dfm-processing document process-web-crawl example.com.log ./output ./crawled_data/ example.com

Data Pipeline (`pipeline`)

Filter:

Purpose: Run a filtering pipeline on a dataset to filter out "poor" quality data.

Usage:

uv run dfm-processing pipeline filter yaml_config

Example:

uv run dfm-processing pipeline filter ./config/example.yaml

Sentence Deduplication (sent_dedup):

Purpose: Perform sentence deduplication on a given dataset.

Usage:

uv run dfm-processing pipeline sent_dedup yaml_config

Example:

uv run dfm-processing pipeline sent_dedup ./config/example.yaml

MinHash Deduplication (minhash-dedup):

Purpose: Perform MinHash Deduplication on a given dataset.

Usage:

uv run dfm-processing pipeline minhash-dedup yaml_config

Example:

uv run dfm-processing pipeline minhash-dedup ./config/example.yaml

More information:

For more information please check out the following links:


📑 About	A overview of the DFM project
Research Paper	An paper introducing DFM and its rationale
🚀 Models	A overview of current models available through the DFM project
💽 Datasets	Includes datasheets about the datasets which includes preprocessing, reason for constructions and more.

Wish to contribute?

DFM is considered a collaborative project for training and maintaining Danish Language models. If you wish to contribute don't hesitate to reach out using one of the following channels:


🗣 DDSC Slack	Join the discussion in the "danish-foundation-models"-channel
💬 GitHub Discussion	Ask questions or start a discussion
🚨 GitHub Issues	Notices a bug in the code? Please create an issue

You can contribute both:

Developer time, the lifeblood of any open-source project
Pre-training datasets you wish to include in the model training
Validation tasks can even be private benchmarks where you only wish to share the performance metrics.
And probably in many other ways

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.github/workflows		.github/workflows
config		config
src/dfm_processing		src/dfm_processing
static		static
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DFM-PROCESSING

Table of Contents

Overview

Project Structure

Getting Started

Prerequisites

Installation

CLI Usage

Document Processing (`document`)

Data Pipeline (`pipeline`)

More information:

Wish to contribute?

About

Uh oh!

Releases 1

Uh oh!

Languages

License

danish-foundation-models/dfm-processing

Folders and files

Latest commit

History

Repository files navigation

DFM-PROCESSING

Table of Contents

Overview

Project Structure

Getting Started

Prerequisites

Installation

CLI Usage

Document Processing (document)

Data Pipeline (pipeline)

More information:

Wish to contribute?

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Uh oh!

Languages

Document Processing (`document`)

Data Pipeline (`pipeline`)