Skip to content

Ungoliant v1.1.0

Compare
Choose a tag to compare
@Uinelj Uinelj released this 28 Feb 11:59
aff79ed

Ungoliant v1.1.0

This is the second release of Ungoliant, a project that provides tools to generate corpora from CommonCrawl.
Ungoliant also includes already established pipeline(s), in particular to generate [OSCAR][oscar]-like corpora.

Ungoliant also replaces goclassy.

Get the release from the Releases tab or via cargo: cargo install ungoliant.

Features

Ungoliant v1.1.0 features a new pipeline that produces document oriented corpora instead of previous, line oriented corpora.

The changes include:

  • New corpus format where content and metadata are merged into a single, JSONLines-formatted file per language,
  • New multilingual corpus, with documents containing lines in different languages,
  • Annotations, enabling filtering of documents based on different criteria (adult content, noisy, short..)
  • (Unstable) Rebuilding of corpus, with AVRO-based rebuild files containing identifications and annotations. (Needs a copy of the related Common Crawl)