Feature Request: Adding Parquet support for tokenized datasets

### Prerequisites

- [x] I am running the latest code. Mention the version if possible as well.
- [x] I carefully followed the [README.md](https://github.com/ggml-org/llama.cpp/blob/master/README.md).
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggml-org/llama.cpp/discussions), and have a new and useful enhancement to share.

### Feature Description

Add to llama.cpp the ability to load pre-tokenized datasets from Parquet files for model training. The implementation should:

Support reading tokens from tokens column (type list<int32>)

Handle Fragmented Datasets

Maintain compatibility with current text pipeline

### Motivation

Storage efficiency: Parquet provides up to 75% compression compared to text-based formats

Processing speed: Reading Parquet is 2-5x faster than CSV/text

Compatibility with ML ecosystem: Integration with Pandas, Spark and PyTorch DataLoader

Avoiding repeated tokenization: Save 30-40% time when working with large datasets

### Possible Implementation

Add support for reading pre-tokenized data from Parquet files via Apache Arrow C++.

Implementation Requirements

```
//Command Line Interface
// Add in common_params:
std::string dataset_format = "text"; // "text"|"parquet"
std::string parquet_path;
//New Data Loading Method

std::vector<llama_token> load_parquet_dataset( 
  const std::string& path, 
  const std::string& column_name = "tokens"
) { 
// Implementation of reading Parquet via Apache Arrow
}
//Integration with the Learning Pipeline

std::vector<llama_token> tokens;
if (params.dataset_format == "parquet") { 
  tokens = load_parquet_dataset(params.parquet_path);
} else { 
  tokens = common_tokenize(ctx.get(), params.prompt, true);
}
```

Technical Details
1. Parquet File Format
Mandatory columns:

```
tokens: list<int32> (tokenized sequences)
seq_length: int32 (sequence length)
```
Optional:

`metadata: string (more information)`
2. Dependencies
Add to CMakeLists.txt:

```
find_package(Arrow REQUIRED)
find_package(Parquet REQUIRED)
target_link_libraries(train PRIVATE Arrow::arrow_shared Parquet::parquet_shared)
```
3. Implementation of reading data

```
#include <arrow/api.h>
#include <parquet/arrow/reader.h>

std::vector<llama_token> load_parquet_dataset(const std::string& path) { 
  arrow::MemoryPool* pool = arrow::default_memory_pool(); 
  std::shared_ptr<arrow::io::ReadableFile> infile; 
  PARQUET_ASSIGN_OR_THROW(infile, arrow::io::ReadableFile::Open(path)); 
  
  std::unique_ptr<parquet::arrow::FileReader> reader; 
  PARQUET_THROW_NOT_OK(parquet::arrow::OpenFile(infile, pool, &reader)); 
  
  std::shared_ptr<arrow::Table> table; 
  PARQUET_THROW_NOT_OK(reader->ReadTable(&table)); 
  
  auto chunks = std::static_pointer_cast<arrow::ListArray>( 
    table->GetColumnByName("tokens")->chunk(0) 
  ); 
  
  std::vector<llama_token> tokens; 
  for (int64_t i = 0; i < chunks->length(); ++i) { 
    auto values ​​= std::static_pointer_cast<arrow::Int32Array>( 
      chunks->value_slice(i) 
    ); 
    for (int64_t j = 0; j < values->length(); ++j) { 
      tokens.push_back(static_cast<call_token>(values->Value(j))); 
    } 
  } 
  return tokens;
}
```
Advantages
Memory Efficiency: Parquet + Columnar compression

Loading speed: 2-5x faster than text formats

Compatibility with data ecosystem (Pandas, Spark)

Support for fragmented datasets

Error Handling
File Existence Check

Data Schema Validation

Type mismatch handling:

```
if (!table->schema()->GetFieldByName("tokens")->type()->Equals(arrow::list(arrow::int32()))) { 
  LOG_ERR("Invalid tokens column type");
}

```
Test Plan
Create a test Parquet dataset:

```
import pyarrow as pa
import pyarrow.parquet as pq
tokens = [[1023], [2047]]
schema = pa.schema([("tokens", pa.list_(pa.int32()))])
table = pa.Table.from_arrays([tokens], schema=schema)
pq.write_table(table, "test.parquet")

```
Integration Tests:
```

void test_parquet_loader() { 
  auto tokens = load_parquet_dataset("test.parquet"); 
  assert(tokens.size() == 5); 
  assert(tokens == 1);
}
```
Alternatives without dependencies
If adding an Arrow is undesirable, implement a simple binary format.

Recommended Implementation Plan
Implement as an optional feature with the LLAMA_PARQUET flag

Add to examples/training/README.md usage instructions

Incremental Data Loading Support

This solution will allow efficient use of pre-tokenized datasets without changing the existing textual pipeline.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Adding Parquet support for tokenized datasets #14442

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Adding Parquet support for tokenized datasets #14442

Description

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions