Performance issues with linking 35 million records on PySpark and AWS EMR #1017

reyvasquez-vh · 2023-02-02T17:42:56Z

reyvasquez-vh
Feb 2, 2023

I'm using splink to link two 35 million data set on AWS EMR cluster and running on performance issues.
The code snippet is below, and basically, it's taking too long on the linker.predict()
I let it run for three hours or so and at some point I see the cluster terminating nodes for being idle.

Data has firstname, lastname, and dob columns. Code below works for 3M records, but not 35M.

I'm using r5.2xlarge type instances and seen it scale up to 33 nodes then dropping to 3 without completing in 3+ hours. It give an impression that resources are not being utilized, and so the cluster is winding down resources.

The execution gets stuck on linker.predict()

Answered by RobinL

Feb 3, 2023

Could you try the following and let us know if it helps:

After you read the data in:

data_left = data_left.repartition(50)
data_right = data_right.repartition(50)

(the above probably won't make much difference, but won't hurt either)

Conf:

conf.set("spark.default.parallelism", "400")
conf.set("spark.sql.shuffle.partitions", "400")

And

break_lineage_method="parquet"

You can read about Splink logging here:
https://moj-analytical-services.github.io/splink/dev_guides/debug_modes.html#logging

View full answer

RobinL · 2023-02-03T17:38:45Z

RobinL
Feb 3, 2023
Maintainer

Could you try the following and let us know if it helps:

After you read the data in:

data_left = data_left.repartition(50)
data_right = data_right.repartition(50)

(the above probably won't make much difference, but won't hurt either)

Conf:

conf.set("spark.default.parallelism", "400")
conf.set("spark.sql.shuffle.partitions", "400")

And

break_lineage_method="parquet"

You can read about Splink logging here:
https://moj-analytical-services.github.io/splink/dev_guides/debug_modes.html#logging

2 replies

reyvasquez-vh Feb 3, 2023
Author

Thank you, I will give that a try.

In general, for linking two 35M data set with firstname, lastname and dob. From your experience, do you know how long that should take? This is my first time running a big job like this and want to understand if timewise, it's closer to 1 hour, 12 hours or 24 hours?

Thanks!

RobinL Feb 4, 2023
Maintainer

The critical number from a runtime point of view is 207,163,869. Should definitely be closer to an hour for that many comparisons. And probably less

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issues with linking 35 million records on PySpark and AWS EMR #1017

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Performance issues with linking 35 million records on PySpark and AWS EMR #1017

reyvasquez-vh Feb 2, 2023

Replies: 1 comment · 2 replies

RobinL Feb 3, 2023 Maintainer

reyvasquez-vh Feb 3, 2023 Author

RobinL Feb 4, 2023 Maintainer

reyvasquez-vh
Feb 2, 2023

Replies: 1 comment 2 replies

RobinL
Feb 3, 2023
Maintainer

reyvasquez-vh Feb 3, 2023
Author

RobinL Feb 4, 2023
Maintainer