Skip to content

How to optimimize dataset for pretraining from HuggingFace #482

Open
@TheLukaDragar

Description

@TheLukaDragar

Description

I'm trying to optimize a dataset from HuggingFace using LitData for LLM pretraining. The code attempts to tokenize text data and create optimized chunks, but I'm encountering issues with the process.

Current Code

from pathlib import Path
from litdata import optimize, StreamingDataset, StreamingDataLoader
from litgpt.litgpt.tokenizer import Tokenizer
from functools import partial
import pyarrow.parquet as pq
from datasets import load_dataset
from litdata.streaming.item_loader import ParquetLoader

def tokenize_fn(text, tokenizer=None):
    yield tokenizer.encode(text[0][0], bos=False, eos=True)

if __name__ == "__main__":
    hf_dataset = StreamingDataset("hf://datasets/skadooah2/cultura_pretrain/data", 
                                item_loader=ParquetLoader)
    loader = StreamingDataLoader(hf_dataset, batch_size=1, num_workers=1)
    
    training_seq_len = 8192
    chunk_size = training_seq_len + 1
    
    outputs = optimize(
        fn=partial(tokenize_fn, 
                  tokenizer=Tokenizer("./checkpoints/meta-llama/Llama-3.2-3B")),
        inputs=loader,
        output_dir="/home/jakob/llara/pretrain2/",
        chunk_size=(chunk_size * 2048),
        reorder_files=True,
        num_workers=32
    )

i get File datasets is not a valid chunk file. It will be ignored. warning

Is there a more best practice way of doing this.
Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingquestionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions