| license | configs | annotations_creators | language_creators | language | multilinguality | source_datasets | task_categories | task_ids | pretty_name | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
other  | 
  
  | 
  
  | 
  
  | 
  
  | 
  
  | 
  
  | 
  
  | 
  
  | 
  DFM Datasheets  | 
  
This repository contains the datasheets for DFM. This repostory documents.
| Version | 1.0.1 (Changelog) | 
| License | Non publicly available | 
| Models | Currently not model is publicly available that is trained on the data | 
| Contact | If you have question about this project please create an issue here | 
- Number of samples: 230.07M
 - Number of tokens (Llama 3): 430.24B
 - Average document length in tokens (min, max): 1.87K (1, 51.77M)
 
The DFM Datasheets is a collection of datasheets for datasets used for Danish Foundation Models. This repository ensure documentation to data along with FAIR data practices.
These datasets were collected and curated with the intention of developing language models for Danish.
This dataset includes the following languages:
- Danish
 - English
 - Swedish
 - Norwegian Bokmål
 - Norwegian Nynorsk
 
Below is a visualisation of the main languages in each of the datasets.
This dynaword consist of data from various domains (e.g., legal, books, social media). The following table and figure give an overview of the relative distributions of these domains. To see a full overview of the source check out the source data section
The following gives an overview of the licensing in the Dynaword. To get the exact license of the individual datasets check out the overview table. These license is applied to the constituent data, i.e., the text. The collection of datasets (metadata, quality control, etc.) is licensed under CC-0.
Below follows a brief overview of the sources in the corpus along with their individual license. To get more information about the individual dataset click the hyperlink in the table.
Overview Table (click to unfold)
You can learn more about each dataset by pressing the link in the first column.
| Source | Description | Domain | Language | N. Tokens | License | 
|---|---|---|---|---|---|
| uspto_filtered | In the United States, patent documents are released into the public domain as government works | Legal | English | 142.39B | Public Domain | 
| stackv2_edu_filtered | Stack V2 Edu is a dataset containing files in various programming and markup languages from openly licensed projects | Other | English | 62.39B | Various - MIT, BSD-3-Clause, Apache-2.0, etc. | 
| peS2o_filtered | A set of openly licensed scientific articles | Scientific | English | 39.51B | CC-BY-SA 4.0 | 
| pubmed_filtered | A set of permissively licensed papers collected from tthe PubMed Central | Medical | English | 35.32B | CC-BY-SA 4.0 | 
| stackexchange_filtered | StackExchange is a collection of Q&A communities spanning a wide variety of topics | Conversation | English | 21.83B | CC-BY-SA 4.0 | 
| caselaw_access_project_filtered | The Caselaw Access Project consists of nearly 40 million pages of U.S. federal and state court decisions and judges’ opinions from the last 365 years | Legal | English | 17.36B | CC-0 | 
| wikimedia_filtered | Official Wikimedia wikis are released under a CC BY-SA license | Encyclopedic | English | 14.08B | CC-BY-SA 4.0 | 
| cccc_filtered | A dataset consisting of permissively licensed web pages processed from common crawl | Web | English | 13.99B | CC-0 | 
| pre_1929_books_filtered | A set of books published in the US pre-1929 | Books | English | 10.56B | CC-0 | 
| github_archive_filtered | A large set of issues and pull request descriptions along with their comments | Conversation | English | 10.21B | Various - MIT, BSD-3-Clause, Apache-2.0, etc. | 
| biodiversity_heritage_library_filtered | A set of ~15 million public domain books and documents from the BHL collection | Books | English | 8.62B | CC-0 | 
| library_of_congress_filtered | A large set of public domain books from the "Selected Digitized Books" collection | Books | English | 8.06B | CC-0 | 
| usgpo_filtered | The United States Government Publishing Office (USGPO) is a federal agency responsible for disseminating official documents authored by the U.S. government | Governmental | English | 7.78B | Public Domain | 
| arxiv_papers_filtered | Set of public domain papers, published on ArXiv | Scientific | English | 6.11B | CC-0 | 
| project_gutenberg_filtered | All books from project gutenberg marked as english and public domain | Books | English | 4.91B | CC-0 | 
| youtube_filtered | YouTube is a large-scale video-sharing platform where users have the option of uploading content under a CC BY license | Speeches | English | 4.07B | CC-BY-SA 4.0 | 
| wikiteam_filtered | There are many wikis on the internet that are not managed by the Wikimedia Foundation, but do use their MediaWiki software to power their wiki | Encyclopedic | English | 2.94B | CC-BY-SA 4.0 | 
| doab_filtered | The Directory of Open Access Books (DOAB) is an online index of over 94,000 peer-reviewed books curated from trusted open-access publishers | Books | English | 2.80B | CC-BY-SA 4.0 | 
| cvr-reports | Annual reports from danish companies in the period 2010-2025 | Financial | Danish | 2.32B | Verbal agreement | 
| uk_hansard_filtered | Hansard represents the official record of parliamentary proceedings across the United Kingdom’s legislative bodies | Governmental | English | 2.01B | Open Parliament License | 
| ubuntu_irc_filtered | Ubuntu-hosted Internet Relay Chat (IRC) is an online chat service | Conversation | English | 1.76B | Public Domain | 
| regulations_filtered | This dataset includes all plain-text regulatory documents published by a variety of U.S. federal agencies on Regulations.Gov | Governmental | English | 1.28B | Public Domain | 
| cellar | The official digital repository for European Union legal documents and open data | Legal | Danish | 1.15B | CC-BY-SA 4.0 | 
| enevaeldens_nyheder | High quality OCR'd texts from Danish and Norwegian newspapers during the period of constitutional absolutism in Denmark (1660–1849) | News | Danish | 1.03B | CC-0 | 
| plandata | A comprehensive dataset consisting of municipal planning documents from across Denmark, including local development plans, municipal plans, planning strategies, and related document types | Governmental | Danish | 1.03B | Written agreement (public models, private data) | 
| retsinformationdk | retsinformation.dk (legal-information.dk) the official legal information system of Denmark | Legal | Danish | 818.25M | Danish Copyright Law | 
| data_provenance_initiative_filtered | The Data Provenance Initiative is a digital library of supervised datasets that have been manually annotated with their source and license information | Other | English | 817.36M | CC-0 | 
| dbc-abstracts | dbc-abstracts consists of more than 11.6 million abstracts of books and other materials collected and created by DBC D1G1TAL (former Dansk Bibliotekscenter) | Books | Danish | 694.42M | Written agreement (public models, private data) | 
| danish-pd | PleIAs - Danish Public Domain is a large collection aiming to aggregate all Danish monographies and periodicals in the public domain | Books | Danish | 532.43M | Public Domain | 
| ncc_books | Danish books extracted from the Norwegian Colossal Corpus derived from OCR | Books | Danish | 531.97M | CC-0 | 
| arxiv_abstracts_filtered | A set of public domain arxiv paper abstracts | Scientific | English | 524.45M | CC-0 | 
| hest | Samples from the Danish debate forum www.heste-nettet.dk | Social Media | Danish | 389.32M | CC-0 | 
| ncc_parliament | Collections from the Norwegian parliament in Danish. Extracted from the Norwegian Colossal Corpus derived from ocr | Other | Danish | 338.87M | NLOD 2.0 | 
| opensubtitles | Danish subsection of OpenSubtitles | Conversation | Danish | 271.60M | CC-0 | 
| wiki | The Danish subsection of wikipedia | Encyclopedic | Danish | 172.43M | CC-0 | 
| ai-aktindsigt | Multiple web scrapes from municipality websites collected as a part of the AI-aktindsigt project | Web | Danish | 139.23M | Apache 2.0 | 
| miljoeportalen | Data from Danmarks Miljøportalen (Denmark's Environment Portal) | Legal | Danish | 127.38M | CC-0 | 
| pressbooks_filtered | A set of openly licensed books | Books | English | 125.65M | CC-BY-SA 4.0 | 
| skat | Skat is the Danish tax authority. This dataset contains content from its website skat.dk | Legal | Danish | 122.11M | CC-0 | 
| ft | Records from all meetings of The Danish parliament (Folketinget) in the parliament hall | Conversation | Danish | 114.09M | CC-0 | 
| memo | The MeMo corpus comprising almost all Danish novels from the period 1870-1899, known as the Modern Breakthrough | Books | Danish | 113.74M | CC-BY-SA 4.0 | 
| ep | The Danish subsection of Europarl | Conversation | Danish | 100.84M | CC-0 | 
| domsdatabasen | Domsdatabasen.dk is a public database containing selected judgments from the Danish courts | Legal | Danish | 86.35M | Danish Copyright Law | 
| libretexts_filtered | A catalog of open-access text books | Books | English | 84.19M | CC-BY-SA 4.0 | 
| dsk-dkmedier | A collection of ~100K news articles from DK Medier, written in the period 2000-2024 | News | Danish | 63.64M | DSK-1 | 
| adl | Danish literature from 1700-2023 from the Archive for Danish Literature (ADL) | Books | Danish | 58.49M | CC-0 | 
| retspraksis | Case law or judical practice in Denmark derived from Retspraksis | Legal | Danish | 56.26M | CC-0 | 
| dbc-reviews | dbc-reviews consists of more than 214 thousand reviews of books and other materials collected and created by DBC D1G1TAL (former Dansk Bibliotekscenter) | Books | Danish | 53.96M | Written agreement (public models, private data) | 
| news_filtered | A set of news stories, scraped from opennewswire | News | English | 53.77M | CC-BY-SA 4.0 | 
| fm-udgivelser | The official publication series of the Danish Ministry of Finance containing economic analyses, budget proposals, and fiscal policy documents | Legal | Danish | 50.34M | CC-BY-SA 4.0 | 
| nordjyllandnews | Articles from the Danish Newspaper TV2 Nord | News | Danish | 37.90M | CC-0 | 
| eur-lex-sum-da | The Danish subsection of EUR-lex SUM consisting of EU legislation paired with professionally written summaries | Legal | Danish | 31.37M | CC-BY-SA 4.0 | 
| ncc_maalfrid | Danish content from Norwegian institutions websites | Web | Danish | 29.26M | NLOD 2.0 | 
| dsk-vejle | A collection of crawled webpages that is managed by Vejle Kommune. Contains various information, covering everything from tourists to garbage collection to historical knowledge of the area | Web | Danish | 27.99M | DSK-1 | 
| health_hovedstaden | Guidelines and informational documents for healthcare professionals from the Capital Region | Medical | Danish | 27.07M | CC-0 | 
| tv2r | Contemporary Danish newswire articles published between 2010 and 2019 | News | Danish | 21.67M | CC-BY-SA 4.0 | 
| oercommons_filtered | OERCommons is an online platform where educators share open-access instructional materials—such as textbooks, lesson plans, problem sets, course syllabi, and worksheets—with the goal of expanding access to affordable education | Books | English | 10.82M | CC-BY-SA 4.0 | 
| grundtvig | The complete collection of Grundtvig (1783-1872) one of Denmark’s most influential figures | Books | Danish | 10.53M | CC-0 | 
| dsk-salling | A collection of crawled webpages that is managed by Salling Group. The dataset consists mainly of product pages from online stores such as bilka.dk, br.dk and such. The data consists of ~24K webpages | Web | Danish | 9.79M | DSK-1 | 
| danske-taler | Danish Speeches from dansketaler.dk | Conversation | Danish | 8.72M | CC-0 | 
| wikibooks | The Danish Subsection of Wikibooks | Books | Danish | 7.63M | CC-0 | 
| nota | The text only part of the Nota lyd- og tekstdata dataset | Readaloud | Danish | 7.30M | CC-0 | 
| gutenberg | The Danish subsection from Project Gutenberg | Books | Danish | 6.76M | Gutenberg | 
| wikisource | The Danish subsection of Wikisource | Encyclopedic | Danish | 6.28M | CC-0 | 
| dsk-cbrain | A collection of Marketing material, product guides, and datasheets produced by cBrain for their products | Other | Danish | 4.19M | DSK-1 | 
| jvj | The works of the Danish author and poet, Johannes V. Jensen | Books | Danish | 3.55M | CC-BY-SA 4.0 | 
| dsk-atp | A collection of crawled webpages that is managed by ATP | Web | Danish | 2.86M | DSK-1 | 
| python_enhancement_proposals_filtered | This set consists of almost all PEPs created | Technical | English | 2.54M | Public Domain | 
| dbc-faktalink | dbc-faktalink consists of more than 5 hundred articles created by DBC D1G1TAL (former Dansk Bibliotekscenter) | Encyclopedic | Danish | 1.99M | Written agreement (public models, private data) | 
| spont | Conversational samples collected as a part of research projects at Aarhus University | Conversation | Danish | 1.56M | CC-0 | 
| public_domain_review_filtered | A set of articles describing works of art that is part of public domain | Other | English | 1.51M | CC-BY-SA 4.0 | 
| dannet | DanNet is a Danish WordNet | Other | Danish | 1.48M | DanNet 1.0 | 
| dbc-forfatterweb | dbc-forfatterweb consists of more than 1 thousand articles created by DBC D1G1TAL (former Dansk Bibliotekscenter) | Encyclopedic | Danish | 1.42M | Written agreement (public models, private data) | 
| relig | Danish religious text from the 1700-2022 | Books | Danish | 1.24M | CC-0 | 
| dsk-odense | A set of newsletters stories, covering events in Odense Municipality. Have been published on their website | News | Danish | 1.18M | DSK-1 | 
| dsk-danskerhverv | A set of newsletters written by Dansk Erhverv, primarily focusing on financials and companies world wide | News | Danish | 1.12M | DSK-1 | 
| ncc_newspaper | OCR'd Newspapers derived from NCC | News | Danish | 1.05M | CC-0 | 
| dsk-plesner | A combination of crawled webpages from Plesners own website, and a series of internal documents outlining procedures | Other | Danish | 896.33K | DSK-1 | 
| botxt | The Bornholmsk Ordbog Dictionary Project | Dialect | Danish | 847.97K | CC-0 | 
| dsk-alexandra | A collection of crawled webpages that is managed by Alexandra Institutet | Web | Danish | 584.35K | DSK-1 | 
| dsk-vitec | A collection of documents covering product descriptions, to newsletters, to internal documentation | Other | Danish | 537.07K | DSK-1 | 
| dsk-ida | A collection of newsletters, articles and other texts produced by IDA | News | Danish | 417.32K | DSK-1 | 
| naat | Danish speeches from 1930-2022 | Conversation | Danish | 286.68K | CC-0 | 
| depbank | The Danish subsection of the Universal Dependencies Treebank | Other | Danish | 185.45K | CC-BY-SA 4.0 | 
| dsk-hofor | A collection of articles, guides and newsletters written by HOFOR for their customers | Other | Danish | 143.49K | DSK-1 | 
| synne | Dataset collected from synnejysk forening's website, covering the Danish dialect sønderjysk | Other | Danish | 52.02K | CC-0 | 
| Total | 430.24B | 
The following plot pr. dataset histograms displaying document lengths.
Currently no citation information is provided.
We do not own any of the text from which the data has been extracted. If you believe that we are not allowed to train on any of the datasets noted please do contact us.
Notice: Should you consider that our data contains material that is owned by you and should therefore not be included in the training of LLMs here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
 - Clearly identify the copyrighted work claimed to be infringed.
 - Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
 
You can contact us by making an issue.
Take down: We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
