Skip to content

[SPARK-52509][K8S] Cleanup shuffles from fallback storage #90

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

EnricoMi
Copy link

What changes were proposed in this pull request?

Shuffle data of individual shuffles are deleted from the fallback storage during regular shuffle cleanup.

Why are the changes needed?

Currently, the shuffle data are only removed from the fallback storage on Spark context shutdown. Long running Spark jobs accumulate shuffle data, though this data is not used by Spark any more. Those shuffles should be cleaned up while Spark context is running.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests and manual test via reproduction example.

Run the reproduction example without the <<< "$scala". In the Spark shell, execute this code:

import org.apache.spark.sql.SaveMode

val n = 100000000
val j = spark.sparkContext.broadcast(1000)
val x = spark.range(0, n, 1, 100).select($"id".cast("int"))
x.as[Int]
 .mapPartitions { it => if (it.hasNext && it.next < n / 100 * 80) Thread.sleep(2000); it }
 .groupBy($"value" % 1000).as[Int, Int]
 .flatMapSortedGroups($"value"){ case (m, it) => if (it.hasNext && it.next == 0) Thread.sleep(10000); it }
  .write.mode(SaveMode.Overwrite).csv("/tmp/spark.csv")

This writes some data of shuffle 0 to the fallback storage.

Invoking System.gc() removes that shuffle from the fallback storage.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the CORE label Jun 12, 2025
@EnricoMi EnricoMi force-pushed the fallback-storage-cleanup branch from eef1d80 to d96df77 Compare June 13, 2025 07:37
@EnricoMi EnricoMi changed the title Cleanup shuffle from fallback storage [SPARK-52509][K8S] Cleanup shuffles from fallback storage Jun 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant