Skip to content

Performance issues with linking 35 million records on PySpark and AWS EMR #1017

Answered by RobinL
reyvasquez-vh asked this question in Q&A
Discussion options

You must be logged in to vote

Could you try the following and let us know if it helps:

After you read the data in:

data_left = data_left.repartition(50)
data_right = data_right.repartition(50)

(the above probably won't make much difference, but won't hurt either)

Conf:

conf.set("spark.default.parallelism", "400")
conf.set("spark.sql.shuffle.partitions", "400")

And

break_lineage_method="parquet"

You can read about Splink logging here:
https://moj-analytical-services.github.io/splink/dev_guides/debug_modes.html#logging

Replies: 1 comment 2 replies

Comment options

You must be logged in to vote
2 replies
@reyvasquez-vh
Comment options

@RobinL
Comment options

Answer selected by reyvasquez-vh
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants