Skip to content

danish-foundation-models/datasheets

Repository files navigation

license configs annotations_creators language_creators language multilinguality source_datasets task_categories task_ids pretty_name
other
config_name set
adl
dfm
config_name set
botxt
dfm
config_name set
dannet
dfm
config_name set
depbank
dfm
config_name set
ep
dfm
config_name set
ft
dfm
config_name set
gutenberg
dfm
config_name set
hest
dfm
config_name set
jvj
dfm
config_name set
naat
dfm
config_name set
relig
dfm
config_name set
retsinformationdk
dfm
config_name set
retspraksis
dfm
config_name set
skat
dfm
config_name set
spont
dfm
config_name set
synne
dfm
config_name set
tv2r
dfm
config_name set
wiki
dfm
config_name set
wikibooks
dfm
config_name set
wikisource
dfm
config_name set
dsk-alexandra
dfm
config_name set
dsk-atp
dfm
config_name set
dsk-cbrain
dfm
config_name set
dsk-danskerhverv
dfm
config_name set
dsk-dkmedier
dfm
config_name set
dsk-hofor
dfm
config_name set
dsk-ida
dfm
config_name set
dsk-odense
dfm
config_name set
dsk-plesner
dfm
config_name set
dsk-salling
dfm
config_name set
dsk-vejle
dfm
config_name set
dsk-vitec
dfm
config_name set
plandata
dfm
config_name set
ai-aktindsigt
dfm
config_name set
danske-taler
dfm
config_name set
fm-udgivelser
dfm
config_name set
eur-lex-sum-da
dfm
config_name set
memo
dfm
config_name set
miljoeportalen
dfm
config_name set
nordjyllandnews
dfm
config_name set
nota
dfm
config_name set
opensubtitles
dfm
config_name set
cellar
dfm
config_name set
ncc_books
dfm
config_name set
ncc_maalfrid
dfm
config_name set
ncc_newspaper
dfm
config_name set
ncc_parliament
dfm
config_name set
dbc-abstracts
dfm
config_name set
dbc-faktalink
dfm
config_name set
dbc-forfatterweb
dfm
config_name set
dbc-reviews
dfm
config_name set
danish-pd
dfm
config_name set
cvr-reports
dfm
config_name set
health_hovedstaden
dfm
config_name set
grundtvig
dfm
config_name set
domsdatabasen
dfm
config_name set
enevaeldens_nyheder
dfm
config_name set
arxiv_abstracts_filtered
common-pile
config_name set
arxiv_papers_filtered
common-pile
config_name set
biodiversity_heritage_library_filtered
common-pile
config_name set
caselaw_access_project_filtered
common-pile
config_name set
cccc_filtered
common-pile
config_name set
data_provenance_initiative_filtered
common-pile
config_name set
doab_filtered
common-pile
config_name set
github_archive_filtered
common-pile
config_name set
library_of_congress_filtered
common-pile
config_name set
libretexts_filtered
common-pile
config_name set
news_filtered
common-pile
config_name set
oercommons_filtered
common-pile
config_name set
peS2o_filtered
common-pile
config_name set
pre_1929_books_filtered
common-pile
config_name set
pressbooks_filtered
common-pile
config_name set
project_gutenberg_filtered
common-pile
config_name set
public_domain_review_filtered
common-pile
config_name set
pubmed_filtered
common-pile
config_name set
python_enhancement_proposals_filtered
common-pile
config_name set
regulations_filtered
common-pile
config_name set
stackexchange_filtered
common-pile
config_name set
stackv2_edu_filtered
common-pile
config_name set
ubuntu_irc_filtered
common-pile
config_name set
uk_hansard_filtered
common-pile
config_name set
usgpo_filtered
common-pile
config_name set
uspto_filtered
common-pile
config_name set
wikimedia_filtered
common-pile
config_name set
wikiteam_filtered
common-pile
config_name set
youtube_filtered
common-pile
no-annotation
crowdsourced
da
en
se
nb
nn
multilingual
original
text-generation
language-modeling
DFM Datasheets

DFM Datasheets

This repository contains the datasheets for DFM. This repostory documents.

Version 1.0.1 (Changelog)
License Non publicly available
Models Currently not model is publicly available that is trained on the data
Contact If you have question about this project please create an issue here

Table of Contents

Dataset Description

  • Number of samples: 230.07M
  • Number of tokens (Llama 3): 430.24B
  • Average document length in tokens (min, max): 1.87K (1, 51.77M)

Summary

The DFM Datasheets is a collection of datasheets for datasets used for Danish Foundation Models. This repository ensure documentation to data along with FAIR data practices.

Curation Rationale

These datasets were collected and curated with the intention of developing language models for Danish.

Languages

This dataset includes the following languages:

  • Danish
  • English
  • Swedish
  • Norwegian Bokmål
  • Norwegian Nynorsk

Below is a visualisation of the main languages in each of the datasets.

Domains

This dynaword consist of data from various domains (e.g., legal, books, social media). The following table and figure give an overview of the relative distributions of these domains. To see a full overview of the source check out the source data section

Domain Sources N. Tokens
Legal retsinformationdk, retspraksis, skat, fm-udgivelser, eur-lex-sum-da, miljoeportalen, cellar, domsdatabasen, caselaw_access_project_filtered, uspto_filtered 162.19B
Other dannet, depbank, synne, dsk-cbrain, dsk-hofor, dsk-plesner, dsk-vitec, ncc_parliament, data_provenance_initiative_filtered, public_domain_review_filtered, stackv2_edu_filtered 63.56B
Scientific arxiv_abstracts_filtered, arxiv_papers_filtered, peS2o_filtered 46.15B
Books adl, gutenberg, jvj, relig, wikibooks, memo, ncc_books, dbc-abstracts, dbc-reviews, danish-pd, grundtvig, biodiversity_heritage_library_filtered, doab_filtered, library_of_congress_filtered, libretexts_filtered, oercommons_filtered, pre_1929_books_filtered, pressbooks_filtered, project_gutenberg_filtered 37.19B
Medical health_hovedstaden, pubmed_filtered 35.35B
Conversation ep, ft, naat, spont, danske-taler, opensubtitles, github_archive_filtered, stackexchange_filtered, ubuntu_irc_filtered 34.30B
Encyclopedic wiki, wikisource, dbc-faktalink, dbc-forfatterweb, wikimedia_filtered, wikiteam_filtered 17.21B
Web dsk-alexandra, dsk-atp, dsk-salling, dsk-vejle, ai-aktindsigt, ncc_maalfrid, cccc_filtered 14.20B
Governmental plandata, regulations_filtered, uk_hansard_filtered, usgpo_filtered 12.10B
Speeches youtube_filtered 4.07B
Financial cvr-reports 2.32B
News tv2r, dsk-danskerhverv, dsk-dkmedier, dsk-ida, dsk-odense, nordjyllandnews, ncc_newspaper, enevaeldens_nyheder, news_filtered 1.22B
Social Media hest 389.32M
Readaloud nota 7.30M
Technical python_enhancement_proposals_filtered 2.54M
Dialect botxt 847.97K
Total 430.24B

Licensing

The following gives an overview of the licensing in the Dynaword. To get the exact license of the individual datasets check out the overview table. These license is applied to the constituent data, i.e., the text. The collection of datasets (metadata, quality control, etc.) is licensed under CC-0.

License Sources N. Tokens
Public Domain danish-pd, python_enhancement_proposals_filtered, regulations_filtered, ubuntu_irc_filtered, usgpo_filtered, uspto_filtered 153.74B
CC-BY-SA 4.0 depbank, jvj, tv2r, fm-udgivelser, eur-lex-sum-da, memo, cellar, doab_filtered, libretexts_filtered, news_filtered, oercommons_filtered, peS2o_filtered, pressbooks_filtered, public_domain_review_filtered, pubmed_filtered, stackexchange_filtered, wikimedia_filtered, wikiteam_filtered, youtube_filtered 122.21B
CC-0 adl, botxt, ep, ft, hest, naat, relig, retspraksis, skat, spont, synne, wiki, wikibooks, wikisource, danske-taler, miljoeportalen, nordjyllandnews, nota, opensubtitles, ncc_books, ncc_newspaper, health_hovedstaden, grundtvig, enevaeldens_nyheder, arxiv_abstracts_filtered, arxiv_papers_filtered, biodiversity_heritage_library_filtered, caselaw_access_project_filtered, cccc_filtered, data_provenance_initiative_filtered, library_of_congress_filtered, pre_1929_books_filtered, project_gutenberg_filtered 74.04B
Various - MIT, BSD-3-Clause, Apache-2.0, etc. github_archive_filtered, stackv2_edu_filtered 72.60B
Verbal agreement cvr-reports 2.32B
Open Parliament License uk_hansard_filtered 2.01B
Written agreement (public models, private data) plandata, dbc-abstracts, dbc-faktalink, dbc-forfatterweb, dbc-reviews 1.78B
Other (No attribution required) retsinformationdk, domsdatabasen 904.61M
Other (Attribution required) dannet, gutenberg, ai-aktindsigt, ncc_maalfrid, ncc_parliament 515.61M
DSK-1 dsk-alexandra, dsk-atp, dsk-cbrain, dsk-danskerhverv, dsk-dkmedier, dsk-hofor, dsk-ida, dsk-odense, dsk-plesner, dsk-salling, dsk-vejle, dsk-vitec 113.35M
Total 430.24B

Source Data

Below follows a brief overview of the sources in the corpus along with their individual license. To get more information about the individual dataset click the hyperlink in the table.

Overview Table (click to unfold)

You can learn more about each dataset by pressing the link in the first column.

Source Description Domain Language N. Tokens License
uspto_filtered In the United States, patent documents are released into the public domain as government works Legal English 142.39B Public Domain
stackv2_edu_filtered Stack V2 Edu is a dataset containing files in various programming and markup languages from openly licensed projects Other English 62.39B Various - MIT, BSD-3-Clause, Apache-2.0, etc.
peS2o_filtered A set of openly licensed scientific articles Scientific English 39.51B CC-BY-SA 4.0
pubmed_filtered A set of permissively licensed papers collected from tthe PubMed Central Medical English 35.32B CC-BY-SA 4.0
stackexchange_filtered StackExchange is a collection of Q&A communities spanning a wide variety of topics Conversation English 21.83B CC-BY-SA 4.0
caselaw_access_project_filtered The Caselaw Access Project consists of nearly 40 million pages of U.S. federal and state court decisions and judges’ opinions from the last 365 years Legal English 17.36B CC-0
wikimedia_filtered Official Wikimedia wikis are released under a CC BY-SA license Encyclopedic English 14.08B CC-BY-SA 4.0
cccc_filtered A dataset consisting of permissively licensed web pages processed from common crawl Web English 13.99B CC-0
pre_1929_books_filtered A set of books published in the US pre-1929 Books English 10.56B CC-0
github_archive_filtered A large set of issues and pull request descriptions along with their comments Conversation English 10.21B Various - MIT, BSD-3-Clause, Apache-2.0, etc.
biodiversity_heritage_library_filtered A set of ~15 million public domain books and documents from the BHL collection Books English 8.62B CC-0
library_of_congress_filtered A large set of public domain books from the "Selected Digitized Books" collection Books English 8.06B CC-0
usgpo_filtered The United States Government Publishing Office (USGPO) is a federal agency responsible for disseminating official documents authored by the U.S. government Governmental English 7.78B Public Domain
arxiv_papers_filtered Set of public domain papers, published on ArXiv Scientific English 6.11B CC-0
project_gutenberg_filtered All books from project gutenberg marked as english and public domain Books English 4.91B CC-0
youtube_filtered YouTube is a large-scale video-sharing platform where users have the option of uploading content under a CC BY license Speeches English 4.07B CC-BY-SA 4.0
wikiteam_filtered There are many wikis on the internet that are not managed by the Wikimedia Foundation, but do use their MediaWiki software to power their wiki Encyclopedic English 2.94B CC-BY-SA 4.0
doab_filtered The Directory of Open Access Books (DOAB) is an online index of over 94,000 peer-reviewed books curated from trusted open-access publishers Books English 2.80B CC-BY-SA 4.0
cvr-reports Annual reports from danish companies in the period 2010-2025 Financial Danish 2.32B Verbal agreement
uk_hansard_filtered Hansard represents the official record of parliamentary proceedings across the United Kingdom’s legislative bodies Governmental English 2.01B Open Parliament License
ubuntu_irc_filtered Ubuntu-hosted Internet Relay Chat (IRC) is an online chat service Conversation English 1.76B Public Domain
regulations_filtered This dataset includes all plain-text regulatory documents published by a variety of U.S. federal agencies on Regulations.Gov Governmental English 1.28B Public Domain
cellar The official digital repository for European Union legal documents and open data Legal Danish 1.15B CC-BY-SA 4.0
enevaeldens_nyheder High quality OCR'd texts from Danish and Norwegian newspapers during the period of constitutional absolutism in Denmark (1660–1849) News Danish 1.03B CC-0
plandata A comprehensive dataset consisting of municipal planning documents from across Denmark, including local development plans, municipal plans, planning strategies, and related document types Governmental Danish 1.03B Written agreement (public models, private data)
retsinformationdk retsinformation.dk (legal-information.dk) the official legal information system of Denmark Legal Danish 818.25M Danish Copyright Law
data_provenance_initiative_filtered The Data Provenance Initiative is a digital library of supervised datasets that have been manually annotated with their source and license information Other English 817.36M CC-0
dbc-abstracts dbc-abstracts consists of more than 11.6 million abstracts of books and other materials collected and created by DBC D1G1TAL (former Dansk Bibliotekscenter) Books Danish 694.42M Written agreement (public models, private data)
danish-pd PleIAs - Danish Public Domain is a large collection aiming to aggregate all Danish monographies and periodicals in the public domain Books Danish 532.43M Public Domain
ncc_books Danish books extracted from the Norwegian Colossal Corpus derived from OCR Books Danish 531.97M CC-0
arxiv_abstracts_filtered A set of public domain arxiv paper abstracts Scientific English 524.45M CC-0
hest Samples from the Danish debate forum www.heste-nettet.dk Social Media Danish 389.32M CC-0
ncc_parliament Collections from the Norwegian parliament in Danish. Extracted from the Norwegian Colossal Corpus derived from ocr Other Danish 338.87M NLOD 2.0
opensubtitles Danish subsection of OpenSubtitles Conversation Danish 271.60M CC-0
wiki The Danish subsection of wikipedia Encyclopedic Danish 172.43M CC-0
ai-aktindsigt Multiple web scrapes from municipality websites collected as a part of the AI-aktindsigt project Web Danish 139.23M Apache 2.0
miljoeportalen Data from Danmarks Miljøportalen (Denmark's Environment Portal) Legal Danish 127.38M CC-0
pressbooks_filtered A set of openly licensed books Books English 125.65M CC-BY-SA 4.0
skat Skat is the Danish tax authority. This dataset contains content from its website skat.dk Legal Danish 122.11M CC-0
ft Records from all meetings of The Danish parliament (Folketinget) in the parliament hall Conversation Danish 114.09M CC-0
memo The MeMo corpus comprising almost all Danish novels from the period 1870-1899, known as the Modern Breakthrough Books Danish 113.74M CC-BY-SA 4.0
ep The Danish subsection of Europarl Conversation Danish 100.84M CC-0
domsdatabasen Domsdatabasen.dk is a public database containing selected judgments from the Danish courts Legal Danish 86.35M Danish Copyright Law
libretexts_filtered A catalog of open-access text books Books English 84.19M CC-BY-SA 4.0
dsk-dkmedier A collection of ~100K news articles from DK Medier, written in the period 2000-2024 News Danish 63.64M DSK-1
adl Danish literature from 1700-2023 from the Archive for Danish Literature (ADL) Books Danish 58.49M CC-0
retspraksis Case law or judical practice in Denmark derived from Retspraksis Legal Danish 56.26M CC-0
dbc-reviews dbc-reviews consists of more than 214 thousand reviews of books and other materials collected and created by DBC D1G1TAL (former Dansk Bibliotekscenter) Books Danish 53.96M Written agreement (public models, private data)
news_filtered A set of news stories, scraped from opennewswire News English 53.77M CC-BY-SA 4.0
fm-udgivelser The official publication series of the Danish Ministry of Finance containing economic analyses, budget proposals, and fiscal policy documents Legal Danish 50.34M CC-BY-SA 4.0
nordjyllandnews Articles from the Danish Newspaper TV2 Nord News Danish 37.90M CC-0
eur-lex-sum-da The Danish subsection of EUR-lex SUM consisting of EU legislation paired with professionally written summaries Legal Danish 31.37M CC-BY-SA 4.0
ncc_maalfrid Danish content from Norwegian institutions websites Web Danish 29.26M NLOD 2.0
dsk-vejle A collection of crawled webpages that is managed by Vejle Kommune. Contains various information, covering everything from tourists to garbage collection to historical knowledge of the area Web Danish 27.99M DSK-1
health_hovedstaden Guidelines and informational documents for healthcare professionals from the Capital Region Medical Danish 27.07M CC-0
tv2r Contemporary Danish newswire articles published between 2010 and 2019 News Danish 21.67M CC-BY-SA 4.0
oercommons_filtered OERCommons is an online platform where educators share open-access instructional materials—such as textbooks, lesson plans, problem sets, course syllabi, and worksheets—with the goal of expanding access to affordable education Books English 10.82M CC-BY-SA 4.0
grundtvig The complete collection of Grundtvig (1783-1872) one of Denmark’s most influential figures Books Danish 10.53M CC-0
dsk-salling A collection of crawled webpages that is managed by Salling Group. The dataset consists mainly of product pages from online stores such as bilka.dk, br.dk and such. The data consists of ~24K webpages Web Danish 9.79M DSK-1
danske-taler Danish Speeches from dansketaler.dk Conversation Danish 8.72M CC-0
wikibooks The Danish Subsection of Wikibooks Books Danish 7.63M CC-0
nota The text only part of the Nota lyd- og tekstdata dataset Readaloud Danish 7.30M CC-0
gutenberg The Danish subsection from Project Gutenberg Books Danish 6.76M Gutenberg
wikisource The Danish subsection of Wikisource Encyclopedic Danish 6.28M CC-0
dsk-cbrain A collection of Marketing material, product guides, and datasheets produced by cBrain for their products Other Danish 4.19M DSK-1
jvj The works of the Danish author and poet, Johannes V. Jensen Books Danish 3.55M CC-BY-SA 4.0
dsk-atp A collection of crawled webpages that is managed by ATP Web Danish 2.86M DSK-1
python_enhancement_proposals_filtered This set consists of almost all PEPs created Technical English 2.54M Public Domain
dbc-faktalink dbc-faktalink consists of more than 5 hundred articles created by DBC D1G1TAL (former Dansk Bibliotekscenter) Encyclopedic Danish 1.99M Written agreement (public models, private data)
spont Conversational samples collected as a part of research projects at Aarhus University Conversation Danish 1.56M CC-0
public_domain_review_filtered A set of articles describing works of art that is part of public domain Other English 1.51M CC-BY-SA 4.0
dannet DanNet is a Danish WordNet Other Danish 1.48M DanNet 1.0
dbc-forfatterweb dbc-forfatterweb consists of more than 1 thousand articles created by DBC D1G1TAL (former Dansk Bibliotekscenter) Encyclopedic Danish 1.42M Written agreement (public models, private data)
relig Danish religious text from the 1700-2022 Books Danish 1.24M CC-0
dsk-odense A set of newsletters stories, covering events in Odense Municipality. Have been published on their website News Danish 1.18M DSK-1
dsk-danskerhverv A set of newsletters written by Dansk Erhverv, primarily focusing on financials and companies world wide News Danish 1.12M DSK-1
ncc_newspaper OCR'd Newspapers derived from NCC News Danish 1.05M CC-0
dsk-plesner A combination of crawled webpages from Plesners own website, and a series of internal documents outlining procedures Other Danish 896.33K DSK-1
botxt The Bornholmsk Ordbog Dictionary Project Dialect Danish 847.97K CC-0
dsk-alexandra A collection of crawled webpages that is managed by Alexandra Institutet Web Danish 584.35K DSK-1
dsk-vitec A collection of documents covering product descriptions, to newsletters, to internal documentation Other Danish 537.07K DSK-1
dsk-ida A collection of newsletters, articles and other texts produced by IDA News Danish 417.32K DSK-1
naat Danish speeches from 1930-2022 Conversation Danish 286.68K CC-0
depbank The Danish subsection of the Universal Dependencies Treebank Other Danish 185.45K CC-BY-SA 4.0
dsk-hofor A collection of articles, guides and newsletters written by HOFOR for their customers Other Danish 143.49K DSK-1
synne Dataset collected from synnejysk forening's website, covering the Danish dialect sønderjysk Other Danish 52.02K CC-0
Total 430.24B

Dataset Statistics

The following plot pr. dataset histograms displaying document lengths.

Per dataset histograms

Additional Information

Citation Information

Currently no citation information is provided.

Disclaimer

We do not own any of the text from which the data has been extracted. If you believe that we are not allowed to train on any of the datasets noted please do contact us.

Notice and take down policy

Notice: Should you consider that our data contains material that is owned by you and should therefore not be included in the training of LLMs here, please:

  • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
  • Clearly identify the copyrighted work claimed to be infringed.
  • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

You can contact us by making an issue.

Take down: We will comply to legitimate requests by removing the affected sources from the next release of the corpus.


About

Datasheets and descriptive stats for data used for the DFM project.

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages