Optimise train u #2870

RobinL · 2026-01-06T16:03:09Z

Updates estimate_u_using_random_sampling to:

Estimate one Comparison at a time, reducing memory overhead
Chunk estimates and exit early once a minimum count has been reached (once the comparison level with the smallest count exceeds some thresdhol)

Make things about 6-10x faster in testing, see here

Also means the user gets feedback on progress

Issue here

Example of outputs

linker.training.estimate_u_using_random_sampling(max_pairs=1e9, num_chunks=30)

results in

----- Estimating u probabilities using random sampling -----
Estimating u with: max_pairs = 1,000,000,000, min_count_per_level = 100, num_chunks = 30

Estimating u for: full_name (Comparison 1 of 4)
  Running probe chunk (~0.33% of max_pairs)
  Min u_count: 9 for comparison level Jaro-Winkler distance of full_name >= 0.97 (cvv=3)
  Probe did not converge; restarting with normal chunking

  Running chunk 1/30
  Count of 77 for level Jaro-Winkler distance of full_name >= 0.97 (cvv=3). Chunk took 1.9 seconds.
  Min u_count not hit, continuing.
  Running chunk 2/30
  Count of 153 for level Jaro-Winkler distance of full_name >= 0.97 (cvv=3). Chunk took 2.0 seconds.
  Exiting early since min count of 153 exceeds min_count_per_level = 100

Estimating u for: dod (Comparison 2 of 4)
  Running probe chunk (~0.33% of max_pairs)
  Min u_count: 4,571 for comparison level Exact match on dod (cvv=6)
  Exiting early since min count of 4,571 exceeds min_count_per_level = 100

Estimating u for: occupation (Comparison 3 of 4)
  Running probe chunk (~0.33% of max_pairs)
  Min u_count: 51,255 for comparison level Exact match on occupation (cvv=1)
  Exiting early since min count of 51,255 exceeds min_count_per_level = 100

Estimating u for: dob (Comparison 4 of 4)
  Running probe chunk (~0.33% of max_pairs)
  Min u_count: 5,990 for comparison level Exact match on dob (cvv=6)
  Exiting early since min count of 5,990 exceeds min_count_per_level = 100

Estimated u probabilities using random sampling

Your model is not yet fully trained. Missing estimates for:
    - full_name (no m values are trained).
    - dod (no m values are trained).
    - occupation (no m values are trained).
    - dob (no m values are trained).```

aymonwuolanne · 2026-01-08T22:54:24Z

This is pretty interesting! I'm wondering what the benefit of the probe step is, versus just stepping through the chunks and stopping early if you reach the minimum count in each level? If I'm understanding it correctly, if you have 10 chunks then the probe uses 1/100 of the max_pairs. In that case is there much disadvantage to just using 100 chunks to begin with?

RobinL · 2026-01-09T07:18:42Z

The issue is that running the computation in n chunks takes a little longer than running it in a single chunk, and this gets worse the more chunks you use. As a result, you can't just set number of chunks to be arbitrarily high. Ie there is a trade off between speed gain from early exit, and slowdown from running a large number of chunks.

I suspect the penalty for setting a high number of chunks is also significantly higher in Spark than duckdb.

You could try to be clever and use some sort of adaptive algorithm, but for the sake of simple code I decided equal sizes chunks was good enough, and meant the code stayed pretty simple. One area of complexity with chunks of different size is ensuring you evaluate all pairs without replacement. The current simple approach is also quite nice because it reuses the chunking mechanism from predict().

The 'probe' is just a way cheating and getting most of the benefit of an adaptive algorithm with much of the complexity cost. I run the probe to get the additional speedup from small chunks at small cost, but then if the (very quick) probe doesn't work, I proceed to the normal algorithm. Note the probe is simple and if we can't exit immediately after the probe, we throw away its results rather than try to reuse them (since changing the subsequent code to only select pairs not already used by the probe is fiddly)

splink/internals/estimate_u.py

RobinL · 2026-01-12T17:04:15Z

splink/internals/estimate_u.py

+    min_count_per_level: int | None,
+    probe_percent_of_max_pairs: float | None = None,
+) -> bool:
+    if probe_percent_of_max_pairs is not None:


This logic has essentially just moved - it used to be inlined in estimate_u_values, but now that has more complex chunking/stop early logic it makes sense to factor it out.

The main change is just that we now need to pass in chunk information

RobinL · 2026-01-12T17:09:38Z

tests/test_u_train.py

    linker._db_api.debug_keep_temp_views = True

-    linker.training.estimate_u_using_random_sampling(max_pairs=max_pairs)
+    linker.training.estimate_u_using_random_sampling(max_pairs=max_pairs, num_chunks=1)


Setting num chunks = 1 here also disables the probe

RobinL added 15 commits January 6, 2026 11:06

one col at a time

8d3d09b

don't materialise blockedp airs

9e3083b

chunking

4e1ec5a

use defaultdict

5a1ed0c

slight update

777f515

tidy before making chunking adaptive

351837d

early stop with min count

c1eec72

handle count 0 properly

2e0fa06

100x probe

13d7999

tests and api

e636b43

fix tests

2d40d4f

allow user to skip early exit

b347e00

update changelog

3ebb0c4

add nice logging

398dd21

add nice logging2

4d3bfeb

RobinL changed the title ~~(WIP) Optimise train u~~ Optimise train u Jan 7, 2026

RobinL added the splink_5 label Jan 7, 2026

Base automatically changed from splinkdataframes_everywhere to splink_5_dev January 12, 2026 15:15

slightly clarify interface

0a5bb79

RobinL changed the title ~~Optimise train u~~ (WIP) Optimise train u Jan 12, 2026

simplify

43cc6fd

RobinL commented Jan 12, 2026

View reviewed changes

splink/internals/estimate_u.py Outdated Show resolved Hide resolved

RobinL added 4 commits January 12, 2026 16:13

better comments

ac21f3c

simplify

1c45e9d

curry/bind early for more succinct and readable code

54ba415

name things better

2eba5ec

RobinL commented Jan 12, 2026

View reviewed changes

improve comment

50ec0ae

RobinL commented Jan 12, 2026

View reviewed changes

update tests for clarity

1171090

RobinL requested a review from ADBond January 12, 2026 17:36

RobinL changed the title ~~(WIP) Optimise train u~~ Optimise train u Jan 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimise train u #2870

Optimise train u #2870

Uh oh!

RobinL commented Jan 6, 2026 •

edited

Loading

Uh oh!

aymonwuolanne commented Jan 8, 2026

Uh oh!

RobinL commented Jan 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

RobinL Jan 12, 2026

Uh oh!

RobinL Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Optimise train u #2870

Are you sure you want to change the base?

Optimise train u #2870

Uh oh!

Conversation

RobinL commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aymonwuolanne commented Jan 8, 2026

Uh oh!

RobinL commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

RobinL Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

RobinL Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

RobinL commented Jan 6, 2026 •

edited

Loading

RobinL commented Jan 9, 2026 •

edited

Loading