ua_datasets

UA-datasets provides ready-to-use Ukrainian NLP benchmark datasets with a single, lightweight Python API.

Fast access to Question Answering, News Classification, and POS Tagging corpora — with automatic download, caching, and consistent iteration.

Why use this library?

Unified API: All datasets expose len(ds), indexing, iteration, and simple frequency helpers.
Robust downloads: Automatic retries, integrity guards, and filename fallbacks for legacy splits.
Zero heavy deps: Pure Python + standard library (core loaders) for quick startup.
Repro friendly: Validation split for UA-SQuAD; classification CSV parsing with resilience to minor format drift.
Tooling ready: Works seamlessly with ruff, mypy, pytest, and uv-based workflows.

Maintained by the FIdo.ai research group (National University of Kyiv-Mohyla Academy).

Minimal Example

# Assumes `uv` workspace already synced with `uv sync` and project installed.

from pathlib import Path
from ua_datasets.question_answering import UaSquadDataset
from ua_datasets.text_classification import NewsClassificationDataset
from ua_datasets.token_classification import MovaInstitutePOSDataset

# Question Answering (first HF-style example dict)
qa = UaSquadDataset(root=Path("./data/ua_squad"), split="train", download=True)
print("QA examples:", len(qa))
example = qa[0]
print(example.keys())  # id, title, context, question, answers, is_impossible
print(example["question"], "->", example["answers"]["text"])  # list of accepted answers

# News Classification
news = NewsClassificationDataset(root=Path("./data/ua_news"), split="train", download=True)
title, text, target, tags = news[0]
print("Label count:", len(news.labels), "First label:", target)

# Part-of-Speech Tagging
pos = MovaInstitutePOSDataset(root=Path("./data/mova_pos"), download=True)
tokens, tags = pos[0]
print(tokens[:8], tags[:8])

For development commands see the Installation section below.

Installation

Choose one of the following methods.

1. Using uv (recommended)

Add to an existing project:

uv add ua-datasets

2. Using pip (PyPI)

# install
pip install ua_datasets
# upgrade
pip install -U ua_datasets

3. From source (editable install)

git clone https://github.com/fido-ai/ua-datasets.git
cd ua-datasets
pip install -e .[dev]  # if you later define optional dev extras

Or with uv (editable semantics via local path):

git clone https://github.com/fido-ai/ua-datasets.git
cd ua-datasets
uv sync --dev

Latest Updates

Date	Highlights
25-10-2025	Added validation split for UA-SQuAD and updated package code.
05-07-2022	Added HuggingFace API for UA-SQuAD (Q&A) and UA-News (Text Classification).

Available Datasets

Task	Dataset	Import Class	Splits	Notes
Question Answering	UA-SQuAD	`UaSquadDataset`	`train`, `val`	SQuAD v2-style examples (`is_impossible`, multi answers); iteration yields dicts
Text Classification	UA-News	`NewsClassificationDataset`	`train`, `test`	CSV (title, text, target[, tags]); optional tag parsing
Token Classification	Mova Institute POS	`MovaInstitutePOSDataset`	(single corpus)	CoNLL-U like POS tagging; yields (tokens, tags) per sentence

Contribution

In case you are willing to contribute (update any part of the library, add your dataset) do not hesitate to connect through GitHub Issue. Thanks in advance for your contribution!

Citation

@software{ua_datasets_2021,
  author = {Ivanyuk-Skulskiy, Bogdan and Zaliznyi, Anton and Reshetar, Oleksand and Protsyk, Oleksiy and Romanchuk, Bohdan and Shpihanovych, Vladyslav},
  month = oct,
  title = {ua_datasets: a collection of Ukrainian language datasets},
  url = {https://github.com/fido-ai/ua-datasets},
  version = {1.0.0},
  year = {2021}
}

Name		Name	Last commit message	Last commit date
Latest commit History 223 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
imgs		imgs
test		test
ua_datasets		ua_datasets
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ua_datasets

Why use this library?

Minimal Example

Installation

1. Using uv (recommended)

Latest Updates

Available Datasets

Contribution

Citation

About

Uh oh!

Releases 6

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

License

fido-ai/ua-datasets

Folders and files

Latest commit

History

Repository files navigation

ua_datasets

Why use this library?

Minimal Example

Installation

1. Using uv (recommended)

Latest Updates

Available Datasets

Contribution

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages