Skip to content

Conversation

@RobinL
Copy link
Member

@RobinL RobinL commented Jan 6, 2026

Updates estimate_u_using_random_sampling to:

  • Estimate one Comparison at a time, reducing memory overhead
  • Chunk estimates and exit early once a minimum count has been reached (once the comparison level with the smallest count exceeds some thresdhol)

Make things about 6-10x faster in testing, see here

Also means the user gets feedback on progress

Issue here

Example of outputs

linker.training.estimate_u_using_random_sampling(max_pairs=1e9, num_chunks=30)

results in

----- Estimating u probabilities using random sampling -----
Estimating u with: max_pairs = 1,000,000,000, min_count_per_level = 100, num_chunks = 30

Estimating u for: full_name (Comparison 1 of 4)
  Running probe chunk (~0.33% of max_pairs)
  Min u_count: 9 for comparison level Jaro-Winkler distance of full_name >= 0.97 (cvv=3)
  Probe did not converge; restarting with normal chunking

  Running chunk 1/30
  Count of 77 for level Jaro-Winkler distance of full_name >= 0.97 (cvv=3). Chunk took 1.9 seconds.
  Min u_count not hit, continuing.
  Running chunk 2/30
  Count of 153 for level Jaro-Winkler distance of full_name >= 0.97 (cvv=3). Chunk took 2.0 seconds.
  Exiting early since min count of 153 exceeds min_count_per_level = 100

Estimating u for: dod (Comparison 2 of 4)
  Running probe chunk (~0.33% of max_pairs)
  Min u_count: 4,571 for comparison level Exact match on dod (cvv=6)
  Exiting early since min count of 4,571 exceeds min_count_per_level = 100

Estimating u for: occupation (Comparison 3 of 4)
  Running probe chunk (~0.33% of max_pairs)
  Min u_count: 51,255 for comparison level Exact match on occupation (cvv=1)
  Exiting early since min count of 51,255 exceeds min_count_per_level = 100

Estimating u for: dob (Comparison 4 of 4)
  Running probe chunk (~0.33% of max_pairs)
  Min u_count: 5,990 for comparison level Exact match on dob (cvv=6)
  Exiting early since min count of 5,990 exceeds min_count_per_level = 100

Estimated u probabilities using random sampling

Your model is not yet fully trained. Missing estimates for:
    - full_name (no m values are trained).
    - dod (no m values are trained).
    - occupation (no m values are trained).
    - dob (no m values are trained).```

@RobinL RobinL changed the title (WIP) Optimise train u Optimise train u Jan 7, 2026
@RobinL RobinL added the splink_5 label Jan 7, 2026
@aymonwuolanne
Copy link
Contributor

This is pretty interesting! I'm wondering what the benefit of the probe step is, versus just stepping through the chunks and stopping early if you reach the minimum count in each level? If I'm understanding it correctly, if you have 10 chunks then the probe uses 1/100 of the max_pairs. In that case is there much disadvantage to just using 100 chunks to begin with?

@RobinL
Copy link
Member Author

RobinL commented Jan 9, 2026

The issue is that running the computation in n chunks takes a little longer than running it in a single chunk, and this gets worse the more chunks you use. As a result, you can't just set number of chunks to be arbitrarily high. Ie there is a trade off between speed gain from early exit, and slowdown from running a large number of chunks.

I suspect the penalty for setting a high number of chunks is also significantly higher in Spark than duckdb.

You could try to be clever and use some sort of adaptive algorithm, but for the sake of simple code I decided equal sizes chunks was good enough, and meant the code stayed pretty simple. One area of complexity with chunks of different size is ensuring you evaluate all pairs without replacement. The current simple approach is also quite nice because it reuses the chunking mechanism from predict().

The 'probe' is just a way cheating and getting most of the benefit of an adaptive algorithm with much of the complexity cost. I run the probe to get the additional speedup from small chunks at small cost, but then if the (very quick) probe doesn't work, I proceed to the normal algorithm. Note the probe is simple and if we can't exit immediately after the probe, we throw away its results rather than try to reuse them (since changing the subsequent code to only select pairs not already used by the probe is fiddly)

Base automatically changed from splinkdataframes_everywhere to splink_5_dev January 12, 2026 15:15
@RobinL RobinL changed the title Optimise train u (WIP) Optimise train u Jan 12, 2026
min_count_per_level: int | None,
probe_percent_of_max_pairs: float | None = None,
) -> bool:
if probe_percent_of_max_pairs is not None:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic has essentially just moved - it used to be inlined in estimate_u_values, but now that has more complex chunking/stop early logic it makes sense to factor it out.

The main change is just that we now need to pass in chunk information

linker._db_api.debug_keep_temp_views = True

linker.training.estimate_u_using_random_sampling(max_pairs=max_pairs)
linker.training.estimate_u_using_random_sampling(max_pairs=max_pairs, num_chunks=1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting num chunks = 1 here also disables the probe

@RobinL RobinL requested a review from ADBond January 12, 2026 17:36
@RobinL RobinL changed the title (WIP) Optimise train u Optimise train u Jan 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants