Skip to content

Releases: oscar-project/ungoliant

Ungoliant v2.0.0

24 Feb 12:20
Compare
Choose a tag to compare

What's Changed

  • Dynamic language tag handling by @Uinelj in #57
  • Validate/Fix rebuilding for OSCAR Doc by @Uinelj in #65
  • KenLM based content detection by @Uinelj in #72
  • Locality sensitive hashing annotation by @Uinelj in #69
  • Fix bug in MeanLength filter by @sadra-barikbin in #71
  • feat(blocklists): ability to use multiple blocklists by @Uinelj in #76
  • Removal of custom domain blocklists from the CLI by @Uinelj in #80
  • refactor: remove old pipelines, old io code and old langtags by @Uinelj in #82
  • Move IO out of Ungoliant by @Uinelj in #83
  • Change annotation to quality_warnings by @Uinelj in #85
  • Move TLSH out of annotations by @Uinelj in #86

New Contributors

Full Changelog: v1.2.3...v2.0.0

Ungoliant v1.2.3

09 May 09:07
Compare
Choose a tag to compare

What's Changed

  • Update download.rs Change BASE_URL to new address by @qhduan in #52

New Contributors

Full Changelog: v1.2.1...v1.2.3

Ungoliant v1.2.1

03 Mar 15:35
Compare
Choose a tag to compare

What's Changed

  • feat(blocklist): make blocklist optional and improve error messages by @Uinelj in #46

Full Changelog: v1.1.1...v1.2.1

Ungoliant v1.1.1

01 Mar 10:20
6da9c3c
Compare
Choose a tag to compare

Ungoliant v1.1.0

28 Feb 11:59
aff79ed
Compare
Choose a tag to compare

Ungoliant v1.1.0

This is the second release of Ungoliant, a project that provides tools to generate corpora from CommonCrawl.
Ungoliant also includes already established pipeline(s), in particular to generate [OSCAR][oscar]-like corpora.

Ungoliant also replaces goclassy.

Get the release from the Releases tab or via cargo: cargo install ungoliant.

Features

Ungoliant v1.1.0 features a new pipeline that produces document oriented corpora instead of previous, line oriented corpora.

The changes include:

  • New corpus format where content and metadata are merged into a single, JSONLines-formatted file per language,
  • New multilingual corpus, with documents containing lines in different languages,
  • Annotations, enabling filtering of documents based on different criteria (adult content, noisy, short..)
  • (Unstable) Rebuilding of corpus, with AVRO-based rebuild files containing identifications and annotations. (Needs a copy of the related Common Crawl)

Ungoliant v1.0.0

07 Sep 17:31
a9a6421
Compare
Choose a tag to compare

Ungoliant v1.0.0

This is the first release of Ungoliant, a project that provides tools to generate corpora from CommonCrawl.
Ungoliant also includes already established pipeline(s), in particular to generate OSCAR-like corpora.

Ungoliant also replaces goclassy.

Get the release from the Releases tab or via cargo: cargo install ungoliant.

Features

  • Feature: Downloading of CommonCrawl. Ungoliant features an asynchronous multithreaded downloader that is faster than the previous solution used for OSCAR.
  • Feature: Generation of both OSCAR v1 and OSCAR v1.1 corpora. The new OSCAR v1.1 is a backward compatible corpus including metadata.
  • Feature: Deduplication using runiq. Ungoliant currently uses a fork that enables library access.
  • Feature: Splitting, compression and packaging. These three operations facilitates generated corpora preparation for ulterior distribution. Note that these operations are not yet performed on the fly, and may need huge free space.

Changes

These changes are feature evolutions from goclassy

  • Pipelines and tools are available through the ungoliant command-line interface.
  • Downloading and compilation of fasttext is not needed anymore. Be sure to have cmake installed if you plan on compiling Ungoliant yourself.
  • General performance improvements when using implemented pipelines. This has been possible by using a multithreading of a finer granularity, using rayon.rs.