Skip to content

danish-foundation-models/dfm-processing

Repository files navigation

Project Logo

DFM-PROCESSING

Effortlessly Deduplicate and Process Data at Scale

license last-commit repo-top-language repo-language-count


Table of Contents


Overview

Danish Foundation Models is a collaborative project for training foundational Danish language model. Which seeks to:

  • Develop and maintain state-of-the-art models for Danish,
  • which are well-validated across a wide range of tasks.
  • Furthermore, we wish to ensure good documentation, which allows users to assess the model for their use-case critically
  • Open-source, both model and source code

Note: This repository is intended for the data processing of DFM.


Project Structure

└── dfm-processing/
    ├── .github
    │   └── workflows
    ├── LICENSE
    ├── README.md
    ├── config
    │   └── example.yaml
    ├── pyproject.toml
    ├── src
    │   └── dfm_processing
    ├── tests
    │   ├── cli
    │   ├── data_pipeline
    │   └── document_processing
    └── uv.lock

Getting Started

Prerequisites

This project requires the following dependencies:

  • Programming Language: Python
  • Package Manager: Uv

Installation

Build dfm-processing from the source and intsall dependencies:

  1. Clone the repository:

    ❯ git clone https://github.com/danish-foundation-models/dfm-processing
  2. Navigate to the project directory:

    cd dfm-processing
  3. Install the dependencies:

    Using uv:

    ❯ uv sync --all-extras

CLI Usage

The CLI is divided into two sections, "document" and "pipeline". Each section contains specific commands for different tasks.

Document Processing (document)

  1. Process Directory:

    • Purpose: Extract text data from various file types in a directory.
    • Usage:
      uv run dfm-processing document process-directory path_to_dir output_dir dataset_name
    • Example:
      uv run dfm-processing document process-directory ./data ./output my_dataset
  2. Process Web Crawl:

    • Purpose: Extract text data from a web crawl.
    • Usage:
      uv run dfm-processing document process-web-crawl crawl_log output_dir crawled_data dataset_name
    • Example:
      uv run dfm-processing document process-web-crawl example.com.log ./output ./crawled_data/ example.com

Data Pipeline (pipeline)

  1. Filter:

    • Purpose: Run a filtering pipeline on a dataset to filter out "poor" quality data.
    • Usage:
      uv run dfm-processing pipeline filter yaml_config
    • Example:
      uv run dfm-processing pipeline filter ./config/example.yaml
  2. Sentence Deduplication (sent_dedup):

    • Purpose: Perform sentence deduplication on a given dataset.
    • Usage:
      uv run dfm-processing pipeline sent_dedup yaml_config
    • Example:
      uv run dfm-processing pipeline sent_dedup ./config/example.yaml
  3. MinHash Deduplication (minhash-dedup):

    • Purpose: Perform MinHash Deduplication on a given dataset.
    • Usage:
      uv run dfm-processing pipeline minhash-dedup yaml_config
    • Example:
      uv run dfm-processing pipeline minhash-dedup ./config/example.yaml

More information:

For more information please check out the following links:

📑 About A overview of the DFM project
Research Paper An paper introducing DFM and its rationale
🚀 Models A overview of current models available through the DFM project
💽 Datasets Includes datasheets about the datasets which includes preprocessing, reason for constructions and more.

Wish to contribute?

DFM is considered a collaborative project for training and maintaining Danish Language models. If you wish to contribute don't hesitate to reach out using one of the following channels:

🗣 DDSC Slack Join the discussion in the "danish-foundation-models"-channel
💬 GitHub Discussion Ask questions or start a discussion
🚨 GitHub Issues Notices a bug in the code? Please create an issue

You can contribute both:

  • Developer time, the lifeblood of any open-source project
  • Pre-training datasets you wish to include in the model training
  • Validation tasks can even be private benchmarks where you only wish to share the performance metrics.
  • And probably in many other ways


About

Toolkit for processing data in the danish foundation models project.

Topics

Resources

License

Stars

Watchers

Forks

Languages