Skip to content

[Bug] Long tail tasks in the Write Stage retry phase results in data loss. #2300

@yl09099

Description

@yl09099

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

Describe the bug

During the Write Stage retry phase, the MapOutputTrackerMaster clears the MapStatus corresponding to the shuffleId on the Driver side. However, when a large number of partitions are encountered, the MapStatus may not be completely cleared. Retry at the Stage, the task becomes less, resulting in data loss. At present, I encountered a 40000 Partition, resulting in data loss.
Below is a screenshot of my problem:
image

Affects Version(s)

0.10.0

Uniffle Server Log Output

No response

Uniffle Engine Log Output

No response

Uniffle Server Configurations

No response

Uniffle Engine Configurations

No response

Additional context

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions