Skip to content

Releases: bitextor/warc2text

v1.5.0

30 Sep 09:27

Choose a tag to compare

What's Changed

  • HTTP responses dechunking and decompressing by @nvanva in #76
  • Add max-record-size option by @ZJaume in #75

Full Changelog: v1.4.0...v1.5.0

v1.4.0

04 Sep 14:25

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v1.3.1...v1.4.0

v1.3.1

07 Feb 14:55

Choose a tag to compare

Full Changelog: v1.3.0...v1.3.1

v1.3.0

06 Feb 16:32

Choose a tag to compare

What's Changed

  • Fail when WARC file, tagfilters or urlfilters can't be opened by @ZJaume in #55
  • Replacing boost::json with nlohmann::json (added --encoding-errors handling option and not producing invalid utf8 anymore) by @ZJaume in #57
  • EasyBuild configs and installation instructions in the README by @nvanva in #60
  • Filter by http status code by @ZJaume in #61
  • Recover after a WARC file fails to be opened by @ZJaume in #63
  • Add detected encoding to the metadata by @ZJaume in #64
  • Fix html missing in JSONL stdout when skipping extraction by @ZJaume in #66

New Contributors

Full Changelog: v1.2.0...v1.3.0

v1.2.0

02 Feb 14:41

Choose a tag to compare

What's Changed

  • Add --robotspass shunt for records related to robots.txt by @jelmervdl in #43
  • Add --jsonl option by @jelmervdl in #35
  • warc2html changes by @ZJaume in #50
  • ZSTD compression and compression level support by @ZJaume in #51
  • Move JSONL output to --stdout and allow file-based output with JSONL by @ZJaume in #52

Full Changelog: v1.1.0...v1.2.0

v1.1.0: Merge pull request #36 from jelmervdl/fasttext-option

01 Aug 13:09
eac887e

Choose a tag to compare

Changes:

  • Add option to use a FastText model as a language identifier
  • Record identified by CLD2 as Unknown are classified as unk instead of dropped.

v1.0.0

01 Aug 13:08
673e371

Choose a tag to compare

Paragraph indexes now start in 1