Open
Description
Description
I'm trying to optimize a dataset from HuggingFace using LitData for LLM pretraining. The code attempts to tokenize text data and create optimized chunks, but I'm encountering issues with the process.
Current Code
from pathlib import Path
from litdata import optimize, StreamingDataset, StreamingDataLoader
from litgpt.litgpt.tokenizer import Tokenizer
from functools import partial
import pyarrow.parquet as pq
from datasets import load_dataset
from litdata.streaming.item_loader import ParquetLoader
def tokenize_fn(text, tokenizer=None):
yield tokenizer.encode(text[0][0], bos=False, eos=True)
if __name__ == "__main__":
hf_dataset = StreamingDataset("hf://datasets/skadooah2/cultura_pretrain/data",
item_loader=ParquetLoader)
loader = StreamingDataLoader(hf_dataset, batch_size=1, num_workers=1)
training_seq_len = 8192
chunk_size = training_seq_len + 1
outputs = optimize(
fn=partial(tokenize_fn,
tokenizer=Tokenizer("./checkpoints/meta-llama/Llama-3.2-3B")),
inputs=loader,
output_dir="/home/jakob/llara/pretrain2/",
chunk_size=(chunk_size * 2048),
reorder_files=True,
num_workers=32
)
i get File datasets is not a valid chunk file. It will be ignored.
warning
Is there a more best practice way of doing this.
Thanks!