Skip to content

Feature Request: Adding Parquet support for tokenized datasets #14442

Open
@lexasub

Description

@lexasub

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Add to llama.cpp the ability to load pre-tokenized datasets from Parquet files for model training. The implementation should:

Support reading tokens from tokens column (type list)

Handle Fragmented Datasets

Maintain compatibility with current text pipeline

Motivation

Storage efficiency: Parquet provides up to 75% compression compared to text-based formats

Processing speed: Reading Parquet is 2-5x faster than CSV/text

Compatibility with ML ecosystem: Integration with Pandas, Spark and PyTorch DataLoader

Avoiding repeated tokenization: Save 30-40% time when working with large datasets

Possible Implementation

Add support for reading pre-tokenized data from Parquet files via Apache Arrow C++.

Implementation Requirements

//Command Line Interface
// Add in common_params:
std::string dataset_format = "text"; // "text"|"parquet"
std::string parquet_path;
//New Data Loading Method

std::vector<llama_token> load_parquet_dataset( 
  const std::string& path, 
  const std::string& column_name = "tokens"
) { 
// Implementation of reading Parquet via Apache Arrow
}
//Integration with the Learning Pipeline

std::vector<llama_token> tokens;
if (params.dataset_format == "parquet") { 
  tokens = load_parquet_dataset(params.parquet_path);
} else { 
  tokens = common_tokenize(ctx.get(), params.prompt, true);
}

Technical Details

  1. Parquet File Format
    Mandatory columns:
tokens: list<int32> (tokenized sequences)
seq_length: int32 (sequence length)

Optional:

metadata: string (more information)
2. Dependencies
Add to CMakeLists.txt:

find_package(Arrow REQUIRED)
find_package(Parquet REQUIRED)
target_link_libraries(train PRIVATE Arrow::arrow_shared Parquet::parquet_shared)
  1. Implementation of reading data
#include <arrow/api.h>
#include <parquet/arrow/reader.h>

std::vector<llama_token> load_parquet_dataset(const std::string& path) { 
  arrow::MemoryPool* pool = arrow::default_memory_pool(); 
  std::shared_ptr<arrow::io::ReadableFile> infile; 
  PARQUET_ASSIGN_OR_THROW(infile, arrow::io::ReadableFile::Open(path)); 
  
  std::unique_ptr<parquet::arrow::FileReader> reader; 
  PARQUET_THROW_NOT_OK(parquet::arrow::OpenFile(infile, pool, &reader)); 
  
  std::shared_ptr<arrow::Table> table; 
  PARQUET_THROW_NOT_OK(reader->ReadTable(&table)); 
  
  auto chunks = std::static_pointer_cast<arrow::ListArray>( 
    table->GetColumnByName("tokens")->chunk(0) 
  ); 
  
  std::vector<llama_token> tokens; 
  for (int64_t i = 0; i < chunks->length(); ++i) { 
    auto values ​​= std::static_pointer_cast<arrow::Int32Array>( 
      chunks->value_slice(i) 
    ); 
    for (int64_t j = 0; j < values->length(); ++j) { 
      tokens.push_back(static_cast<call_token>(values->Value(j))); 
    } 
  } 
  return tokens;
}

Advantages
Memory Efficiency: Parquet + Columnar compression

Loading speed: 2-5x faster than text formats

Compatibility with data ecosystem (Pandas, Spark)

Support for fragmented datasets

Error Handling
File Existence Check

Data Schema Validation

Type mismatch handling:

if (!table->schema()->GetFieldByName("tokens")->type()->Equals(arrow::list(arrow::int32()))) { 
  LOG_ERR("Invalid tokens column type");
}

Test Plan
Create a test Parquet dataset:

import pyarrow as pa
import pyarrow.parquet as pq
tokens = [[1023], [2047]]
schema = pa.schema([("tokens", pa.list_(pa.int32()))])
table = pa.Table.from_arrays([tokens], schema=schema)
pq.write_table(table, "test.parquet")

Integration Tests:


void test_parquet_loader() { 
  auto tokens = load_parquet_dataset("test.parquet"); 
  assert(tokens.size() == 5); 
  assert(tokens == 1);
}

Alternatives without dependencies
If adding an Arrow is undesirable, implement a simple binary format.

Recommended Implementation Plan
Implement as an optional feature with the LLAMA_PARQUET flag

Add to examples/training/README.md usage instructions

Incremental Data Loading Support

This solution will allow efficient use of pre-tokenized datasets without changing the existing textual pipeline.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions