Skip to content

Conversation

@AbdulmalikDS
Copy link

What does this PR do?

Support pre-tokenized datasets in Parquet format and to skip tokenization step if it's already has been done.

Motivation

I needed to use Parquet files with pre-tokenized data, but LLaMA-Factory didn't support the format. This adds Parquet support and skips tokenization when data is already tokenized.

Changes

New files:

  • src/llamafactory/data/tokenized_parquet.py - Parquet loader
  • src/llamafactory/data/collator_tokenized.py - Data collator for pre-tokenized samples

Usage Example

dataset_format: tokenized_ids
data_files:
  - /path/to/chunk_000.parquet
  - /path/to/chunk_001.parquet

Parquet files should have input_ids (required) and attention_mask (optional) columns.

Before submitting

  • Did you read the contributor guideline?
  • Did you run make style && make quality?
  • Did you write any new necessary tests?

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @AbdulmalikDS, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances LLaMA-Factory's data loading capabilities by introducing native support for pre-tokenized datasets stored in Parquet files. This allows users to directly feed already processed token IDs into the training pipeline, streamlining workflows and potentially reducing data preparation overhead by avoiding redundant tokenization.

Highlights

  • Parquet Dataset Support: Introduces the ability to load and process datasets stored in Parquet format, specifically designed for pre-tokenized data.
  • Skip Tokenization: Implements a mechanism to bypass the tokenization step during training when the input data is already pre-tokenized, improving efficiency.
  • Dedicated Data Collator: Adds a new TokenizedIdsCollator to handle batching and padding for pre-tokenized input_ids and attention_mask from Parquet files.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable feature to support pre-tokenized datasets in Parquet format, which can significantly speed up data loading by skipping the tokenization step. The implementation adds a new data loader for Parquet files and a corresponding data collator. The changes are well-structured.

However, I've identified a few areas for improvement:

  • There is a critical issue with class inheritance in the new TokenizedIdsCollator that will lead to a runtime error.
  • The Parquet file loader is currently limited to local files, and should be extended to support cloud storage paths.
  • The current implementation does not support evaluation datasets for the pre-tokenized format, which is a significant limitation.
  • The new functionality lacks unit tests, which are important for ensuring correctness and preventing future regressions.

Detailed comments are provided below.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@AbdulmalikDS
Copy link
Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable feature by adding support for pre-tokenized datasets in Parquet format, which can significantly speed up data loading for large-scale training. The implementation is well-structured, with new modules for loading parquet files and a dedicated data collator. My review includes a critical fix for a missing import that would prevent the code from running, along with several suggestions to improve performance, code clarity, and maintainability. Overall, this is a great addition, and with these changes, it will be even more robust and efficient.

AbdulmalikDS and others added 4 commits October 26, 2025 16:17
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@AbdulmalikDS
Copy link
Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for pre-tokenized datasets in Parquet format, which is a valuable addition for users who want to skip the tokenization step. The implementation includes a new Parquet loader, a data collator for pre-tokenized samples, and updates to the data loading logic and arguments. The code is well-structured, but I have identified a few areas for improvement, mainly concerning efficiency and robustness. My suggestions aim to make the code more concise and performant.

AbdulmalikDS and others added 3 commits October 29, 2025 13:45
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@AbdulmalikDS
Copy link
Author

/gemini summary

@gemini-code-assist
Copy link
Contributor

Summary of Changes

This pull request significantly enhances LLaMA-Factory's data loading capabilities by introducing native support for pre-tokenized datasets stored in Parquet files. This allows users to directly feed already processed token IDs and attention masks into the training pipeline, streamlining workflows and potentially reducing data preparation overhead by avoiding redundant tokenization. The changes include new modules for loading Parquet data and a specialized data collator, along with new configuration options.

Highlights

  • Pre-tokenized Parquet Dataset Support: Introduces the ability to load and process datasets stored in Parquet format, specifically designed for data that has already undergone tokenization.
  • Skip Tokenization Step: Implements a mechanism to bypass the tokenization step during training when the input data is already pre-tokenized, improving efficiency and reducing data preparation overhead.
  • Dedicated Data Collator: Adds a new TokenizedIdsCollator to handle batching and padding for pre-tokenized input_ids and attention_mask from Parquet files, inheriting from transformers.DataCollatorForSeq2Seq.
  • New Data Arguments: Adds dataset_format, data_files, and dataset_columns arguments to DataArguments to configure the loading of pre-tokenized Parquet datasets.
Changelog
  • src/llamafactory/data/collator_tokenized.py
    • Added TokenizedIdsCollator for handling pre-tokenized data, inheriting from DataCollatorForSeq2Seq.
    • Implements padding for input_ids and attention_mask to the maximum length within a batch.
    • Generates labels by copying input_ids and padding with IGNORE_INDEX.
    • Includes logic to resolve pad_token_id from the tokenizer or model configuration.
  • src/llamafactory/data/loader.py
    • Imported TokenizedIdsCollator and load_tokenized_parquet_dataset.
    • Added a conditional check in _get_preprocessed_dataset to bypass tokenization if dataset_format is tokenized_ids.
    • Modified get_dataset to conditionally load data using load_tokenized_parquet_dataset when dataset_format is tokenized_ids, requiring data_files to be provided.
    • Assigns TokenizedIdsCollator as the data_collator for datasets using the tokenized_ids format.
  • src/llamafactory/data/tokenized_parquet.py
    • Added _iter_parquet_rows to stream pre-tokenized samples from multiple Parquet files.
    • Implemented load_tokenized_parquet_dataset to create an IterableDataset from Parquet files.
    • Handles input_ids (required) and attention_mask (optional) columns from Parquet files.
    • Includes error handling for FileNotFoundError when accessing Parquet files and logs warnings.
    • Uses pyarrow.parquet.ParquetFile as a context manager for cleaner file handling.
  • src/llamafactory/hparams/data_args.py
    • Added dataset_format argument, allowing 'default' or 'tokenized_ids' (default 'default').
    • Added data_files argument to specify paths for datasets using the tokenized_ids format.
    • Added dataset_columns argument for custom column name mapping (e.g., {'ids': 'token_ids', 'mask': 'attn_mask'}).
    • Added validation in __post_init__ to ensure data_files is specified when dataset_format is tokenized_ids.
Activity
  • On Oct 25, gemini-code-assist[bot] provided an initial summary of the PR (Comment 1).
  • On Oct 25, gemini-code-assist[bot] posted several review comments, including a critical one about TokenizedIdsCollator inheriting from the wrong base class (resolved in current patch) and suggestions for evaluation dataset support, unit tests, and cloud path handling for Parquet files.
  • On Oct 25, AbdulmalikDS acknowledged the missing unit tests and stated willingness to add them if required, and noted that the Parquet file handling follows existing LLaMA-Factory patterns.
  • On Oct 26, AbdulmalikDS requested a new review from gemini-code-assist[bot] (Comment 2).
  • On Oct 26, gemini-code-assist[bot] provided further review comments, including a critical one about an unused import (resolved in current patch) and suggestions for optimizing tensor creation, using pq.ParquetFile as a context manager (implemented), simplifying IterableDataset.from_generator (implemented), and broadening the data_files type hint.
  • On Oct 29, AbdulmalikDS requested another review from gemini-code-assist[bot] (Comment 3).
  • On Oct 29, gemini-code-assist[bot] provided additional review comments, suggesting simplification of labels creation (implemented), adding try...except FileNotFoundError (implemented), simplifying dictionary construction in _iter_parquet_rows, and using Any for data_files type hint (implemented).
  • On Nov 3, AbdulmalikDS requested a summary of the PR (Comment 4).

@hiyouga hiyouga added the pending This problem is yet to be addressed label Nov 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pending This problem is yet to be addressed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants