Ungoliant v1.1.0
Ungoliant v1.1.0
This is the second release of Ungoliant, a project that provides tools to generate corpora from CommonCrawl.
Ungoliant also includes already established pipeline(s), in particular to generate [OSCAR][oscar]-like corpora.
Ungoliant also replaces goclassy
.
Get the release from the Releases tab or via cargo: cargo install ungoliant
.
Features
Ungoliant v1.1.0 features a new pipeline that produces document oriented corpora instead of previous, line oriented corpora.
The changes include:
- New corpus format where content and metadata are merged into a single, JSONLines-formatted file per language,
- New multilingual corpus, with documents containing lines in different languages,
- Annotations, enabling filtering of documents based on different criteria (adult content, noisy, short..)
- (Unstable) Rebuilding of corpus, with AVRO-based rebuild files containing identifications and annotations. (Needs a copy of the related Common Crawl)