Performance issues with linking 35 million records on PySpark and AWS EMR #1017
-
I'm using splink to link two 35 million data set on AWS EMR cluster and running on performance issues. Data has firstname, lastname, and dob columns. Code below works for 3M records, but not 35M. I'm using r5.2xlarge type instances and seen it scale up to 33 nodes then dropping to 3 without completing in 3+ hours. It give an impression that resources are not being utilized, and so the cluster is winding down resources. The execution gets stuck on |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Could you try the following and let us know if it helps: After you read the data in:
(the above probably won't make much difference, but won't hurt either) Conf:
And
You can read about Splink logging here: |
Beta Was this translation helpful? Give feedback.
Could you try the following and let us know if it helps:
After you read the data in:
(the above probably won't make much difference, but won't hurt either)
Conf:
And
You can read about Splink logging here:
https://moj-analytical-services.github.io/splink/dev_guides/debug_modes.html#logging