Releases: bitextor/warc2text
Releases · bitextor/warc2text
v1.5.0
v1.4.0
What's Changed
- EB config for warc2text 1.3.0 by @nvanva in #67
- Add easyconfigs for LUMI by @maciejjan in #69
- Allow piped input by @ZJaume in #71
- New option to get non-zero exit codes if any WARC fails to read by @ZJaume in #73
New Contributors
- @maciejjan made their first contribution in #69
Full Changelog: v1.3.1...v1.4.0
v1.3.1
Full Changelog: v1.3.0...v1.3.1
v1.3.0
What's Changed
- Fail when WARC file, tagfilters or urlfilters can't be opened by @ZJaume in #55
- Replacing boost::json with nlohmann::json (added
--encoding-errorshandling option and not producing invalid utf8 anymore) by @ZJaume in #57 - EasyBuild configs and installation instructions in the README by @nvanva in #60
- Filter by http status code by @ZJaume in #61
- Recover after a WARC file fails to be opened by @ZJaume in #63
- Add detected encoding to the metadata by @ZJaume in #64
- Fix html missing in JSONL stdout when skipping extraction by @ZJaume in #66
New Contributors
Full Changelog: v1.2.0...v1.3.0
v1.2.0
What's Changed
- Add
--robotspassshunt for records related to robots.txt by @jelmervdl in #43 - Add
--jsonloption by @jelmervdl in #35 - warc2html changes by @ZJaume in #50
- ZSTD compression and compression level support by @ZJaume in #51
- Move JSONL output to --stdout and allow file-based output with JSONL by @ZJaume in #52
Full Changelog: v1.1.0...v1.2.0
v1.1.0: Merge pull request #36 from jelmervdl/fasttext-option
Changes:
- Add option to use a FastText model as a language identifier
- Record identified by CLD2 as Unknown are classified as
unkinstead of dropped.