Skip to content

Conversation

hiyouga
Copy link
Owner

@hiyouga hiyouga commented Oct 9, 2025

What does this PR do?

Fixes # (issue)

Before submitting

Copy link
Contributor

Summary of Changes

Hello @hiyouga, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant architectural change by implementing a plugin-based system for data management. The core DataEngine has been redesigned to leverage these new plugins, which handle responsibilities such as loading datasets from various sources, adjusting data indices based on specified sizes or weights, and efficiently retrieving data samples. This refactoring aims to enhance the modularity, extensibility, and maintainability of the data processing pipeline, making it easier to support new data formats and processing strategies in the future.

Highlights

  • New Data Plugin Architecture: Introduced a plugin-based system for data handling, including DataLoaderPlugin, DataIndexPlugin, and DataGetItemPlugin, to modularize data loading, indexing, and retrieval logic.
  • Refactored DataEngine: The DataEngine class has been significantly refactored to integrate and utilize the new data plugins, streamlining its internal logic and removing the DatasetPathMixin.
  • Enhanced DataCollator: The DataCollator now accepts dataset_info in its constructor and includes a __call__ method, allowing for more context-aware data batching.
  • Updated DatasetInfo Structure: The DatasetInfo TypedDict has been updated with clearer descriptions for its fields, and num_samples was renamed to size.
  • Improved Type Hinting: Type hints in extras/types.py have been refined to use direct module imports for datasets, torch, and transformers components, and a Tensor type was added.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a plugin-based architecture for the data engine, which is a positive step towards modularity and extensibility. It refactors the data loading logic into a DataLoaderPlugin and adds new plugins for data indexing and item retrieval. However, the current implementation has several critical issues that will lead to runtime errors, including missing implementations in DataIndexPlugin, incorrect instantiation of DataGetItemPlugin, and a bug in its get_data method. Additionally, DataCollator is not fully implemented. These issues need to be addressed before this can be merged.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@hiyouga hiyouga merged commit 9687b71 into main Oct 9, 2025
17 checks passed
@hiyouga hiyouga deleted the hiyouga/data_plugin branch October 9, 2025 14:39
@hiyouga hiyouga added the solved This problem has been already solved label Oct 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

solved This problem has been already solved

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant