Skip to content

oscar-project/oscar-io

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ce19d80 · Nov 9, 2023

History

65 Commits
Aug 8, 2023
Aug 8, 2023
Jul 25, 2023
Jan 20, 2023
Aug 8, 2023
Nov 9, 2023
May 12, 2022

Repository files navigation

oscar-io

Types and IO (Reader/Writer) for OSCAR Corpus processing and generation.

The crate provides basic abstractions around Corpus items and generic readers/writers useable in OSCAR Corpus files. At some time, it should replace reader implementations in both Ungoliant and oscar-tools.

Features

oscar-io aims to provide readers/writers for numerous types of OSCAR Corpora.

OSCAR v2

  • Reader
    • Uncompressed [oscar_doc::Reader::new]
    • GZipped [oscar_doc::Reader::from_gzip]
    • Parquet
  • Writer
    • Uncompressed [oscar_doc::Writer::new]
    • GZipped [oscar_doc::Writer::new] (using a [GzEncoder] reader, from_gzip not yet implemented)
    • Parquet
  • SplitReader (Should be unified with SplitReader with split_size: Option<u64>)
    • Uncompressed
    • GZipped
  • SplitWriter (Same)
    • Uncompressed
    • GZipped

OSCAR v1.1

  • Reader
  • Writer
  • SplitReader (Should be unified with SplitReader with split_size: Option<u64>)
  • SplitWriter (Same)

OSCAR v1

  • Reader
  • Writer
  • SplitReader
  • SplitWriter