Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature][Connectors] Abnormal Data Logging #8005

Open
2 of 3 tasks
Ivan-gfan opened this issue Nov 10, 2024 · 2 comments
Open
2 of 3 tasks

[Feature][Connectors] Abnormal Data Logging #8005

Ivan-gfan opened this issue Nov 10, 2024 · 2 comments

Comments

@Ivan-gfan
Copy link
Contributor

Ivan-gfan commented Nov 10, 2024

Search before asking

  • I had searched in the feature and found no similar feature requirement.

Description

Description:

Currently, there are no metrics for tracking abnormal data records, nor is there an option to ignore exceptions and continue execution. Regardless of whether JDBC or other data sources are used, any error encountered during insertion will terminate the application, which is not user-friendly.

Suggested Improvements:

1. Abnormal Data Metrics:

The final metrics should include not only the read and write counts but also the count of abnormal data. The sum of abnormal data and successful write counts should equal the total read count.

2. Detailed Abnormal Record Entity:

Introduce a domain entity to record detailed information about abnormal records. This entity should include:

  • The identifier of the erroneous row.
  • The name of the column.
  • The erroneous data content.
  • The reason for the error.

3. Batch Submission Handling:

Some connectors may use batch submission to improve performance, relying on the transaction management of the target data source (e.g., the batch_size parameter in the JDBC connector). Users must balance their tolerance for record-level granularity.

  • If a batch contains an erroneous record, the entire batch will typically fail. As a result, the error count will accumulate in multiples of batch_size.

  • For precise error tracking, users would need to set batch_size to 1, but this compromises performance.

  • Conversely, for high-performance batch submissions, error tracking becomes less accurate.
    This trade-off needs to be managed by the user based on their specific use case, but the system should provide the necessary functionality.

4. Planned Total Record Count in Metrics:

It would be beneficial to include the total planned record count in the metrics (e.g., the result of SELECT COUNT(*) FROM source).

  • This would enable the implementation of a progress bar when using batch processing.

  • Currently, the metrics only show the cumulative read and write counts at the current time but do not include the total planned count for the entire task.

Usage Scenario

1. Precise Error Row Counting and Detailed Error Information

  • The system should be able to accurately count the number of error rows.
  • For each error, detailed information should be recorded, including:
    • The specific row that caused the error.
    • The erroneous column identifier and name.
    • The erroneous data content.
    • The reason for the error.

2. Incremental Synchronization

  • The solution must support incremental data synchronization.
  • Errors encountered during synchronization should not halt the entire process.

3. User Display

  • Users should be able to view a summary of the synchronization process, including error statistics and detailed error records.

4. Key Pain Points:

  • Task Termination on Error:
    During large-scale data synchronization, a single error can cause the task to terminate abruptly.
    • This is especially frustrating when earlier successful transactions have already been committed to the database.
    • Users are left with incomplete data and have to restart or manually reconcile the process.

5. Desired Behavior:

  • The task should not terminate immediately upon encountering an error.
  • Instead, errors should be logged, and the synchronization process should continue.
  • At the end of the task, a comprehensive report should be available for users, showing:
    • Total records processed.
    • Successful writes.
    • Errors, including their details.
    • Metrics for read, write, and error counts.

This approach would improve user experience and ensure data integrity while allowing users to handle errors post-synchronization.

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@Ivan-gfan
Copy link
Contributor Author

@liugddx PTAL

@liugddx
Copy link
Member

liugddx commented Nov 10, 2024

@liugddx PTAL

Thanks for following this issue! @Ivan-gfan LGTM! cc: @Hisoka-X @hailin0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants