Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
Add to llama.cpp the ability to load pre-tokenized datasets from Parquet files for model training. The implementation should:
Support reading tokens from tokens column (type list)
Handle Fragmented Datasets
Maintain compatibility with current text pipeline
Motivation
Storage efficiency: Parquet provides up to 75% compression compared to text-based formats
Processing speed: Reading Parquet is 2-5x faster than CSV/text
Compatibility with ML ecosystem: Integration with Pandas, Spark and PyTorch DataLoader
Avoiding repeated tokenization: Save 30-40% time when working with large datasets
Possible Implementation
Add support for reading pre-tokenized data from Parquet files via Apache Arrow C++.
Implementation Requirements
//Command Line Interface
// Add in common_params:
std::string dataset_format = "text"; // "text"|"parquet"
std::string parquet_path;
//New Data Loading Method
std::vector<llama_token> load_parquet_dataset(
const std::string& path,
const std::string& column_name = "tokens"
) {
// Implementation of reading Parquet via Apache Arrow
}
//Integration with the Learning Pipeline
std::vector<llama_token> tokens;
if (params.dataset_format == "parquet") {
tokens = load_parquet_dataset(params.parquet_path);
} else {
tokens = common_tokenize(ctx.get(), params.prompt, true);
}
Technical Details
- Parquet File Format
Mandatory columns:
tokens: list<int32> (tokenized sequences)
seq_length: int32 (sequence length)
Optional:
metadata: string (more information)
2. Dependencies
Add to CMakeLists.txt:
find_package(Arrow REQUIRED)
find_package(Parquet REQUIRED)
target_link_libraries(train PRIVATE Arrow::arrow_shared Parquet::parquet_shared)
- Implementation of reading data
#include <arrow/api.h>
#include <parquet/arrow/reader.h>
std::vector<llama_token> load_parquet_dataset(const std::string& path) {
arrow::MemoryPool* pool = arrow::default_memory_pool();
std::shared_ptr<arrow::io::ReadableFile> infile;
PARQUET_ASSIGN_OR_THROW(infile, arrow::io::ReadableFile::Open(path));
std::unique_ptr<parquet::arrow::FileReader> reader;
PARQUET_THROW_NOT_OK(parquet::arrow::OpenFile(infile, pool, &reader));
std::shared_ptr<arrow::Table> table;
PARQUET_THROW_NOT_OK(reader->ReadTable(&table));
auto chunks = std::static_pointer_cast<arrow::ListArray>(
table->GetColumnByName("tokens")->chunk(0)
);
std::vector<llama_token> tokens;
for (int64_t i = 0; i < chunks->length(); ++i) {
auto values = std::static_pointer_cast<arrow::Int32Array>(
chunks->value_slice(i)
);
for (int64_t j = 0; j < values->length(); ++j) {
tokens.push_back(static_cast<call_token>(values->Value(j)));
}
}
return tokens;
}
Advantages
Memory Efficiency: Parquet + Columnar compression
Loading speed: 2-5x faster than text formats
Compatibility with data ecosystem (Pandas, Spark)
Support for fragmented datasets
Error Handling
File Existence Check
Data Schema Validation
Type mismatch handling:
if (!table->schema()->GetFieldByName("tokens")->type()->Equals(arrow::list(arrow::int32()))) {
LOG_ERR("Invalid tokens column type");
}
Test Plan
Create a test Parquet dataset:
import pyarrow as pa
import pyarrow.parquet as pq
tokens = [[1023], [2047]]
schema = pa.schema([("tokens", pa.list_(pa.int32()))])
table = pa.Table.from_arrays([tokens], schema=schema)
pq.write_table(table, "test.parquet")
Integration Tests:
void test_parquet_loader() {
auto tokens = load_parquet_dataset("test.parquet");
assert(tokens.size() == 5);
assert(tokens == 1);
}
Alternatives without dependencies
If adding an Arrow is undesirable, implement a simple binary format.
Recommended Implementation Plan
Implement as an optional feature with the LLAMA_PARQUET flag
Add to examples/training/README.md usage instructions
Incremental Data Loading Support
This solution will allow efficient use of pre-tokenized datasets without changing the existing textual pipeline.