Skip to content

Conversation

@coolderli
Copy link

  • distinct() will storage the data, and the downstream will read from the shuffle. We do not need the cache any more.
  • Use the count() to instead the collect() to avoid the drive OOM.

@HYLcool HYLcool added enhancement New feature or request good first issue Good for newcomers dj:dist issues/PRs about distributed data processing dj:efficiency regarding to efficiency issues and enhancements dj:tools issues/PRs about specific tools labels Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dj:dist issues/PRs about distributed data processing dj:efficiency regarding to efficiency issues and enhancements dj:tools issues/PRs about specific tools enhancement New feature or request good first issue Good for newcomers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants