-
Notifications
You must be signed in to change notification settings - Fork 186
(WIP) 2580 improve runtimes but pushing up common case statements into precomputed values #2630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Existing code: With speedup: |
I think the structure of this is a bit wrong - passing the experimental_optimisation flag isn't quite right, I think this should 'belong' to the comparison And we want to perform the optimisation within the comparison so we already have access to the sql dialect But for now i'm just going to get it working Also remember this current method doensn't immediately work with compare_two_records() |
Rembee to add some tests based on the context.md file |
The following fails on sqlglot 25.x but works on 26.6.0 The error is something to do with XOR at q^0.33 i.e. it's not correctly translating/transpiling the
|
Quickprompt is Details
detaield prompt is Details
we want to generate something like: WITH __splink__reusable_function_values AS (
SELECT
*,
jaro_winkler_similarity("first_name_l", "first_name_r") AS jws_first_name
FROM blocked_with_cols
)
SELECT
CASE
WHEN "first_name_l" IS NULL OR "first_name_r" IS NULL THEN -1
WHEN "first_name_l" = "first_name_r" THEN 3
WHEN jws_first_name >= 0.9 THEN 2
WHEN jws_first_name >= 0.7 THEN 1
ELSE 0
END as gamma_first_name
FROM __splink__reusable_function_values To implement, a possible approach is to use SQLGlot to:
How This Fits into SplinkThe CASE statement in which these repeated function calls occur is produced by the Main Files Affected
Call Stack ExampleIn a typical prediction run the following events occur:
Your optimization code (using SQLGlot) will modify the SQL at step (1) so that any repeated function call is replaced by a reusable variable, and an appropriate CTE is added to precompute these values. Implementation Highlights
Next Steps for Integration
Questions & VerificationBefore fully integrating, consider running these checks:
SQL Pipeline DetailsThe optimization needs to be integrated into a specific part of Splink's SQL generation pipeline. Here's the detailed flow:
-- Stage 1: Join blocked pairs with input data
WITH blocked_with_cols AS (
SELECT
l.unique_id AS unique_id_l,
r.unique_id AS unique_id_r,
l.first_name AS first_name_l,
r.first_name AS first_name_r,
-- ... other columns
FROM __splink__df_concat_with_tf AS l
INNER JOIN __splink__blocked_id_pairs AS b
ON l.unique_id = b.join_key_l
INNER JOIN __splink__df_concat_with_tf AS r
ON r.unique_id = b.join_key_r
),
-- Stage 2 (NEW): Compute reusable function values
__splink__reusable_function_values AS (
SELECT
*,
jaro_winkler_similarity(first_name_l, first_name_r) AS jws_first_name,
jaccard(surname_l, surname_r) AS jd_surname
-- ... other reusable computations
FROM blocked_with_cols
),
-- Stage 3: Compute comparison vectors using reusable values
__splink__df_comparison_vectors AS (
SELECT
unique_id_l,
unique_id_r,
first_name_l,
first_name_r,
CASE
WHEN first_name_l IS NULL OR first_name_r IS NULL THEN -1
WHEN first_name_l = first_name_r THEN 3
WHEN jws_first_name >= 0.9 THEN 2
WHEN jws_first_name >= 0.7 THEN 1
ELSE 0
END as gamma_first_name,
-- ... other comparison vectors
FROM __splink__reusable_function_values
) Key Integration Points
Implementation Strategy
Testing Strategy
Detailed SQL Pipeline StructureThe complete SQL pipeline for prediction consists of several CTEs:
CASE Statement AST StructureThe CASE statements we need to optimize have a specific structure in SQLGlot:
Key points about the AST:
Implementation Details
SQLGlot Implementation PatternsBased on the working solution in o3_mini_high.py, here's how to effectively use SQLGlot for SQL transformation:
Key SQLGlot Patterns
Applying to Splink
Common Pitfalls
This document provides all the details needed to incorporate the optimization into Splink. Please ask any further questions or run additional tests as needed to help iterate or improve this integration. Additional Context and ClarificationsActual SQL StructureThe CASE statements we need to optimize appear in the CASE
WHEN "first_name_l" IS NULL OR "first_name_r" IS NULL THEN -1
WHEN "first_name_l" = "first_name_r" THEN 3
WHEN jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.9 THEN 2
WHEN jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.7 THEN 1
ELSE 0
END as gamma_first_name Integration PointsThe CASE statement is generated in
SQLGlot AST StructureThe parsed AST for these CASE statements has this structure: Alias(
this=Case(
ifs=[
If(this=<condition>, true=<value>),
If(this=<condition>, true=<value>),
...
],
default=Literal(this=0, is_string=False)
),
alias=Identifier(this="gamma_*")
) CTE Pipeline StructureThe full SQL query has this CTE structure:
Our optimization should be inserted between Key Requirements
ComparisonLevel DetailsEach CASE statement is built from multiple ComparisonLevel objects, which:
Testing ConsiderationsThe sandbox code provides a good test case with:
The sandbox is:
and results in:
Let's eventually make this an argument on predict() for now i.e. predict(experimental_optimisation=True)
|
I got gemini to analyse the whole sqlglot codebase and it thinks these are the relevant files for the purpose of solving this task
Test script: Details
|
Is this a better implementation? Details
Ohwait if hash is usable, then we can use eq meaning we can use a different appraoch Details
|
see #2738 |
Very much work in progress for now, but the overall approach here seems to work
See #2580
Closes #2580