Skip to content

fido-ai/ua-datasets

Repository files navigation

NaUKMA FIdo Logo

ua_datasets

PyPI version Python versions License Downloads

Build CI Code size Code style: Ruff Type checking: mypy

UA-datasets provides ready-to-use Ukrainian NLP benchmark datasets with a single, lightweight Python API.

Fast access to Question Answering, News Classification, and POS Tagging corpora — with automatic download, caching, and consistent iteration.

Why use this library?

  • Unified API: All datasets expose len(ds), indexing, iteration, and simple frequency helpers.
  • Robust downloads: Automatic retries, integrity guards, and filename fallbacks for legacy splits.
  • Zero heavy deps: Pure Python + standard library (core loaders) for quick startup.
  • Repro friendly: Validation split for UA-SQuAD; classification CSV parsing with resilience to minor format drift.
  • Tooling ready: Works seamlessly with ruff, mypy, pytest, and uv-based workflows.

Maintained by the FIdo.ai research group (National University of Kyiv-Mohyla Academy).

Minimal Example

# Assumes `uv` workspace already synced with `uv sync` and project installed.

from pathlib import Path
from ua_datasets.question_answering import UaSquadDataset
from ua_datasets.text_classification import NewsClassificationDataset
from ua_datasets.token_classification import MovaInstitutePOSDataset

# Question Answering (first HF-style example dict)
qa = UaSquadDataset(root=Path("./data/ua_squad"), split="train", download=True)
print("QA examples:", len(qa))
example = qa[0]
print(example.keys())  # id, title, context, question, answers, is_impossible
print(example["question"], "->", example["answers"]["text"])  # list of accepted answers

# News Classification
news = NewsClassificationDataset(root=Path("./data/ua_news"), split="train", download=True)
title, text, target, tags = news[0]
print("Label count:", len(news.labels), "First label:", target)

# Part-of-Speech Tagging
pos = MovaInstitutePOSDataset(root=Path("./data/mova_pos"), download=True)
tokens, tags = pos[0]
print(tokens[:8], tags[:8])

For development commands see the Installation section below.

Installation

Choose one of the following methods.

1. Using uv (recommended)

Add to an existing project:

uv add ua-datasets
2. Using pip (PyPI)
# install
pip install ua_datasets
# upgrade
pip install -U ua_datasets
3. From source (editable install)
git clone https://github.com/fido-ai/ua-datasets.git
cd ua-datasets
pip install -e .[dev]  # if you later define optional dev extras

Or with uv (editable semantics via local path):

git clone https://github.com/fido-ai/ua-datasets.git
cd ua-datasets
uv sync --dev

Latest Updates

Date Highlights
25-10-2025 Added validation split for UA-SQuAD and updated package code.
05-07-2022 Added HuggingFace API for UA-SQuAD (Q&A) and UA-News (Text Classification).

Available Datasets

Task Dataset Import Class Splits Notes
Question Answering UA-SQuAD UaSquadDataset train, val SQuAD v2-style examples (is_impossible, multi answers); iteration yields dicts
Text Classification UA-News NewsClassificationDataset train, test CSV (title, text, target[, tags]); optional tag parsing
Token Classification Mova Institute POS MovaInstitutePOSDataset (single corpus) CoNLL-U like POS tagging; yields (tokens, tags) per sentence

Contribution

In case you are willing to contribute (update any part of the library, add your dataset) do not hesitate to connect through GitHub Issue. Thanks in advance for your contribution!

Citation

@software{ua_datasets_2021,
  author = {Ivanyuk-Skulskiy, Bogdan and Zaliznyi, Anton and Reshetar, Oleksand and Protsyk, Oleksiy and Romanchuk, Bohdan and Shpihanovych, Vladyslav},
  month = oct,
  title = {ua_datasets: a collection of Ukrainian language datasets},
  url = {https://github.com/fido-ai/ua-datasets},
  version = {1.0.0},
  year = {2021}
}