Support of partitioning/checkpointing/event-logging #748

cyruszhang · 2025-07-24T20:54:09Z

design doc (internal): https://aliyuque.antfin.com/ah7ri9/zdesop/qw3tm08a5wcqx446

概述
Data-Juicer 分区、检查点和事件日志系统为处理大型数据集提供了全面的解决方案，具备容错性、可扩展性和完整的可观测性。

设计初衷
Ray会有一些容错能力（actor的persistence机制，以及task级别的重试逻辑）；Ray-DLC也会提供更好的异常容错和自愈；但是还是存在一些系统性的问题：
● 整体执行问题：ray将整个数据集作为一个整体单元来处理；如果一小部分失败了，整个OP stage乃至pipeine就失败
● 进度恢复空白：整个流程作为个整体来操作的，一个部分错了就会需要全部重跑
● 没有用户可配置的细粒度的容错方式，缺少灵活性
● 数据持久化和映射：这个目前是空缺的；actor可以提供入口，但是目前DJ框架没有支持
● 可观测性不够：ray只有集群状态，对于dj任务的状态还是缺少了更好的观测
所以我们希望能够通过一整套分区、检查点、事件日志的逻辑，把这些问题都解决

主要功能
● 容错性: 使用检查点自动从故障中恢复
● 可扩展性: 基于分区的处理，适用于任何规模的数据集
● 可观测性: 全面的事件日志记录和实时监控
● 性能: 优化的存储格式和并行处理
● 灵活性: 可配置的分区和检查点策略
DJ会提供分区、检查点逻辑，可以给用户提供显式的容错机制介入，灵活的重启方式，小分片的血缘支持，也有更加完整和细粒度的的任务可观测性，可以给DJ提供企业级的服务打好基础。这个跟Ray或者Ray-DLC关注的容错和计算可扩展性并不冲突

…file structure

gemini-code-assist

Summary of Changes

Hello @cyruszhang, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades Data-Juicer's capabilities for large-scale data processing by implementing a comprehensive system for partitioning, checkpointing, and event logging. These changes are designed to enhance the platform's fault tolerance, improve scalability, provide in-depth observability, and offer greater flexibility in managing data processing workflows, especially within the Ray execution environment.

Highlights

Enhanced Fault Tolerance and Resumption: This PR introduces a robust partitioning and checkpointing system for Data-Juicer, enabling jobs to automatically resume from failures. It prevents the need to re-run entire pipelines by saving intermediate states, significantly improving reliability for large dataset processing.
Comprehensive Observability with Event Logging: A new, universal event logging system has been integrated, providing real-time monitoring, detailed progress tracking, and comprehensive audit trails. This system logs various events, including job, partition, operation, and checkpoint activities, enhancing debugging and operational visibility across all executor types.
Optimized Data Handling and Storage: The system now supports configurable intermediate data storage formats (Parquet, Arrow, JSONL) with compression, allowing users to balance performance, storage efficiency, and human readability. It also includes an intelligent auto-configuration system for optimal partition sizing based on data modality and processing complexity.
Flexible Job Management and Configuration: New configuration options and command-line arguments are added for fine-grained control over partitioning, checkpointing strategies (e.g., every_op, every_partition, every_n_ops), and file retention policies. It also introduces job-specific directories and automatic generation of resumption commands for easier job management.
New Partitioned Ray Executor: A ray_partitioned executor type is introduced, leveraging Ray's distributed capabilities with the new partitioning, checkpointing, and event logging features to provide a highly scalable and fault-tolerant processing solution for large datasets.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a major new feature: a partitioned Ray executor with checkpointing and event logging capabilities. This is a significant step towards making Data-Juicer more robust, scalable, and observable for large-scale data processing. The changes are extensive, including new core components, configuration options, and comprehensive documentation.

The overall design is solid, with a clear separation of concerns for partitioning, checkpointing, and event logging. The use of a mixin for event logging is a good pattern for reusability across different executors. The automatic partition size optimization is a great usability improvement.

However, the review identified several critical and high-severity issues that need to be addressed:

There is significant code duplication and conflicting implementations for event logging between PartitionedRayExecutor and EventLoggingMixin.
A new file partitioned_executor_base.py appears to be dead code and should be removed.
There are a couple of correctness bugs in the event logging and partition size optimizer that could lead to runtime errors or incorrect behavior.

I've provided detailed comments and suggestions for each issue. Addressing these will greatly improve the maintainability, correctness, and clarity of the new implementation. Once these issues are resolved, this will be a fantastic addition to Data-Juicer.

data_juicer/core/executor/event_logging_mixin.py

data_juicer/core/executor/ray_executor_partitioned.py

data_juicer/core/executor/partition_size_optimizer.py

data_juicer/core/executor/partitioned_executor_base.py

data_juicer/core/executor/event_logging_mixin.py

data_juicer/core/executor/ray_executor_partitioned.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

gemini-code-assist

Code Review

这次的 Pull Request 引入了一套完整的分区、检查点和事件日志系统，极大地增强了 Data-Juicer 的容错性、可扩展性和可观测性。这些改动非常广泛且结构清晰，包括了新的执行器、配置管理和作业工具。特别是 ray_partitioned 执行器的引入，以及可配置的检查点策略和详细的事件日志，为企业级数据处理奠定了坚实的基础。新增的作业快照分析和监控等管理工具也极大地改善了用户体验。

不过，我发现了一些问题，尤其是在目录解析逻辑中存在一个关键问题，可能会导致用户无法按预期使用灵活的存储路径。此外，用于作业恢复的命令行参数解析器不够健壮，并且有一个测试用例存在错误。解决这些问题将使这个出色的贡献更加完善。

data_juicer/config/config.py

tests/core/executor/test_dag.py

configs/demo/partition-checkpoint-eventlog-control.yaml

gemini-code-assist · 2025-11-21T22:54:50Z

data_juicer/config/config.py

+def _parse_cli_to_config(cli_args: list) -> dict:
+    """Parse CLI arguments into config dictionary format."""
+    config = {}
+
+    i = 0
+    while i < len(cli_args):
+        arg = cli_args[i]
+
+        if arg.startswith("--"):
+            key = arg[2:]  # Remove '--'
+
+            # Check if next arg is a value (not another flag)
+            if i + 1 < len(cli_args) and not cli_args[i + 1].startswith("--"):
+                value = cli_args[i + 1]
+
+                # Try to parse as different types
+                if value.lower() in ["true", "false"]:
+                    config[key] = value.lower() == "true"
+                elif value.isdigit():
+                    config[key] = int(value)
+                elif value.replace(".", "").isdigit():
+                    config[key] = float(value)
+                else:
+                    config[key] = value
+
+                i += 2  # Skip both key and value
+            else:
+                # Boolean flag (no value)
+                config[key] = True
+                i += 1
+        else:
+            i += 1
+
+    return config


函数 _parse_cli_to_config 手动解析命令行参数，但这个实现不够健壮。它无法正确处理包含空格、多个值（nargs='+'）或 --key=value 语法的参数。如果用户在恢复作业时使用了复杂的命令行参数，这可能导致错误的配置验证结果。

为了更稳健地处理，建议再次使用 ArgumentParser 来解析 cli_args，这样可以得到一个更准确的参数字典。

cyruszhang · 2025-11-21T22:59:20Z

/gemini review

gemini-code-assist

Code Review

这个 PR 引入了一套完整的分区、检查点和事件日志系统，极大地增强了 Data-Juicer 处理大规模数据集时的容错性、可扩展性和可观测性。这是一个非常重要的功能，设计和实现都相当全面。主要亮点包括：

分区执行器：新的 PartitionedRayExecutor 支持将数据集拆分为多个分区并行处理，并能处理需要全局信息的算子（如去重）。
灵活的检查点：支持多种检查点策略（every_op, every_n_ops, manual），并实现了分区级别的检查点，能够在任务失败后从最近的检查点恢复，大大节省了重新计算的成本。
全面的事件日志和作业管理：引入了类似 Spark 的事件日志系统，记录作业执行过程中的详细事件。基于此，提供了作业快照、监控和停止等实用工具，并实现了可靠的作业恢复机制。
自动化资源优化：新增了资源感知的分区优化器，可以根据数据模态和系统资源自动配置最佳的分区大小和工作节点数，简化了用户配置。

代码结构清晰，通过 Mixin（EventLoggingMixin, DAGExecutionMixin）实现了功能的解耦和复用。同时，为新功能添加了详细的文档、演示和单元测试，这对于项目的可维护性至关重要。

在审查过程中，我发现了一些需要关注的问题，主要集中在配置处理和核心执行逻辑中，具体请见下面的评论。总体而言，这是一个高质量、功能强大的 PR，将显著提升 Data-Juicer 的企业级服务能力。

gemini-code-assist · 2025-11-21T23:02:27Z

data_juicer/core/executor/ray_executor_partitioned.py

+                if self.pipeline_dag:
+                    self._pre_execute_operations_with_dag_monitoring(group_ops, partition_id=partition_id)
+                else:
+                    # Fallback to manual logging without DAG
+                    for op_idx, op in enumerate(group_ops):
+                        self._log_event(
+                            event_type=EventType.OP_START,
+                            message=f"Starting operation: {op._name}",
+                            operation_name=op._name,
+                            operation_idx=start_idx + op_idx,
+                            partition_id=partition_id,
+                        )
+
+                    # Execute operations
+                    current_dataset = current_dataset.process(group_ops)
+
+                    # Post-execute DAG monitoring (log operation completion events)
+                    if self.pipeline_dag:
+                        self._post_execute_operations_with_dag_monitoring(group_ops, partition_id=partition_id)
+                    else:
+                        # Fallback to manual logging without DAG
+                        for op_idx, op in enumerate(group_ops):
+                            self._log_event(
+                                event_type=EventType.OP_COMPLETE,
+                                message=f"Completed operation: {op._name}",
+                                operation_name=op._name,
+                                operation_idx=start_idx + op_idx,
+                                partition_id=partition_id,
+                            )
+


在 _process_with_checkpointing 方法中，current_dataset.process(group_ops) 调用被错误地放置在了 if self.pipeline_dag: 的 else 分支内部。这意味着当启用 DAG 监控时（即 self.pipeline_dag 为真），分区中的操作（group_ops）将永远不会被执行。代码会记录操作开始的日志，但随后会跳过实际的数据处理步骤。这是一个严重错误，破坏了执行器的核心处理逻辑。应将 dataset.process 调用移出 if/else 块。

# Pre-execute DAG monitoring (log operation start events) if self.pipeline_dag: self._pre_execute_operations_with_dag_monitoring(group_ops, partition_id=partition_id) else: # Fallback to manual logging without DAG for op_idx, op in enumerate(group_ops): self._log_event( event_type=EventType.OP_START, message=f"Starting operation: {op._name}", operation_name=op._name, operation_idx=start_idx + op_idx, partition_id=partition_id, ) # Execute operations current_dataset = current_dataset.process(group_ops) # Post-execute DAG monitoring (log operation completion events) if self.pipeline_dag: self._post_execute_operations_with_dag_monitoring(group_ops, partition_id=partition_id) else: # Fallback to manual logging without DAG for op_idx, op in enumerate(group_ops): self._log_event( event_type=EventType.OP_COMPLETE, message=f"Completed operation: {op._name}", operation_name=op._name, operation_idx=start_idx + op_idx, partition_id=partition_id, )

data_juicer/config/config.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

HYLcool

Please resolve pre-commit issues as well.

HYLcool · 2025-11-24T08:47:04Z

data_juicer/core/executor/dag_execution_mixin.py

+                if partition_id != 0:
+                    # Partitioned executor - pass partition_id
+                    self._log_operation_with_dag_context(op_name, op_idx, "op_start", partition_id=partition_id)
+                else:
+                    # Non-partitioned executor
+                    self._log_operation_with_dag_context(op_name, op_idx, "op_start")


The default value of partition_id is 0 in the method _log_operation_with_dag_context, so maybe this if-else is not necessary?

HYLcool · 2025-11-24T08:48:36Z

data_juicer/core/executor/dag_execution_mixin.py

+                if partition_id != 0:
+                    # Partitioned executor - pass partition_id
+                    self._log_operation_with_dag_context(
+                        op_name,
+                        op_idx,
+                        "op_complete",
+                        partition_id=partition_id,
+                        duration=0.0,
+                        input_rows=0,
+                        output_rows=0,
+                    )
+                else:
+                    # Non-partitioned executor
+                    self._log_operation_with_dag_context(
+                        op_name, op_idx, "op_complete", duration=0.0, input_rows=0, output_rows=0
+                    )


Same as the previous comment: the default value of partition_id is 0 in the method _log_operation_with_dag_context.

HYLcool · 2025-11-24T09:09:24Z

data_juicer/core/executor/ray_executor_partitioned.py

+                                operation_name=op._name,
+                                operation_idx=start_idx + op_idx,
+                                partition_id=partition_id,
+                            )


This if-else might be not correct. Please double check.

Maybe lines 619-634 should have their indentation reduced.

cyruszhang added 14 commits July 16, 2025 10:19

docs and utility code

bd5e115

add checkpointing strategy support

db3f470

use enum for strategy; use every_ops for default

d6aa8c3

support event_log and checkpoint directories, with proper naming and …

98eba4b

…file structure

add ray_partitioned mode in process_data

8668ae9

add necessary configs for partition/checkpoint

d7a1275

update event_loggin_mixin for proper formatting

86c6522

add README.md

20e146c

update demo yaml

17e0d2c

add parition and intermediate_storage related config logic

c277435

remove export_shard_size for single file output

20b1c06

fix export/logging logic in PartitionedRayExecutor

bda3033

add atuo partition size logic; ignore F541

f98d7f8

switch back to ray_partitioned mode

116ff0e

cyruszhang requested review from HYLcool and yxdyc July 24, 2025 20:54

cyruszhang had a problem deploying to Testing July 24, 2025 20:54 — with GitHub Actions Failure

gemini-code-assist bot reviewed Jul 24, 2025

View reviewed changes

Update data_juicer/core/executor/partition_size_optimizer.py

0c9373a

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

cyruszhang had a problem deploying to Testing July 24, 2025 23:16 — with GitHub Actions Failure

cyruszhang added 3 commits July 24, 2025 16:24

remove duplcate code

552527d

rename demo config

9d20177

add partition_dir; resolution logic of job_id and work_dir

488126c

cyruszhang had a problem deploying to Testing July 25, 2025 01:16 — with GitHub Actions Failure

consolidate direcotry resolution logic

3796c20

cyruszhang had a problem deploying to Testing July 25, 2025 01:18 — with GitHub Actions Failure

cyruszhang added 2 commits November 21, 2025 13:50

remove AST logic

0022f2f

Merge branch 'main' into feat/cyrusz/partition

9b86c9f

cyruszhang requested a deployment to Testing November 21, 2025 21:56 — with GitHub Actions Waiting

cyruszhang added 4 commits November 21, 2025 14:00

return proper dataset

5bd0e36

proper handling of returned datasets; enclose with logging

f7c2f44

use proper keys

60ad34b

update rayexporter to support s3 and extra configs

4987c08

cyruszhang requested a deployment to Testing November 21, 2025 22:14 — with GitHub Actions Waiting

cyruszhang added 4 commits November 21, 2025 14:39

remove duplicate code

afd4dbb

fix is_global_operation

d74e34f

move documentation

4bf01a7

run demo from root

27b5157

cyruszhang requested a deployment to Testing November 21, 2025 22:51 — with GitHub Actions Waiting

gemini-code-assist bot reviewed Nov 21, 2025

View reviewed changes

log -> logs

5129e92

cyruszhang requested a deployment to Testing November 22, 2025 00:06 — with GitHub Actions Waiting

cyruszhang added 2 commits November 21, 2025 16:08

remove AST tests

e40665b

update control config

9ca5ba5

cyruszhang requested a deployment to Testing November 22, 2025 00:34 — with GitHub Actions Waiting

adopt user defined dirs

91fbaae

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

cyruszhang requested a deployment to Testing November 22, 2025 00:35 — with GitHub Actions Waiting

HYLcool reviewed Nov 24, 2025

View reviewed changes

Support of partitioning/checkpointing/event-logging #748

Are you sure you want to change the base?

Support of partitioning/checkpointing/event-logging #748

Uh oh!

Conversation

cyruszhang commented Jul 24, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

cyruszhang commented Nov 21, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HYLcool left a comment

Choose a reason for hiding this comment

Uh oh!

HYLcool Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

HYLcool Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

HYLcool Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants