-
Notifications
You must be signed in to change notification settings - Fork 209
Optimise train u #2870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: splink_5_dev
Are you sure you want to change the base?
Optimise train u #2870
Conversation
|
This is pretty interesting! I'm wondering what the benefit of the probe step is, versus just stepping through the chunks and stopping early if you reach the minimum count in each level? If I'm understanding it correctly, if you have 10 chunks then the probe uses 1/100 of the max_pairs. In that case is there much disadvantage to just using 100 chunks to begin with? |
|
The issue is that running the computation in n chunks takes a little longer than running it in a single chunk, and this gets worse the more chunks you use. As a result, you can't just set number of chunks to be arbitrarily high. Ie there is a trade off between speed gain from early exit, and slowdown from running a large number of chunks. I suspect the penalty for setting a high number of chunks is also significantly higher in Spark than duckdb. You could try to be clever and use some sort of adaptive algorithm, but for the sake of simple code I decided equal sizes chunks was good enough, and meant the code stayed pretty simple. One area of complexity with chunks of different size is ensuring you evaluate all pairs without replacement. The current simple approach is also quite nice because it reuses the chunking mechanism from predict(). The 'probe' is just a way cheating and getting most of the benefit of an adaptive algorithm with much of the complexity cost. I run the probe to get the additional speedup from small chunks at small cost, but then if the (very quick) probe doesn't work, I proceed to the normal algorithm. Note the probe is simple and if we can't exit immediately after the probe, we throw away its results rather than try to reuse them (since changing the subsequent code to only select pairs not already used by the probe is fiddly) |
| min_count_per_level: int | None, | ||
| probe_percent_of_max_pairs: float | None = None, | ||
| ) -> bool: | ||
| if probe_percent_of_max_pairs is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic has essentially just moved - it used to be inlined in estimate_u_values, but now that has more complex chunking/stop early logic it makes sense to factor it out.
The main change is just that we now need to pass in chunk information
| linker._db_api.debug_keep_temp_views = True | ||
|
|
||
| linker.training.estimate_u_using_random_sampling(max_pairs=max_pairs) | ||
| linker.training.estimate_u_using_random_sampling(max_pairs=max_pairs, num_chunks=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting num chunks = 1 here also disables the probe
Updates
estimate_u_using_random_samplingto:Comparisonat a time, reducing memory overheadMake things about 6-10x faster in testing, see here
Also means the user gets feedback on progress
Issue here
Example of outputs
results in