AnalysisException: cannot resolve '`l.unique_id`' #1099

Mehul903 · 2023-03-06T21:21:27Z

Mehul903
Mar 6, 2023

I’ve been trying to set up a simple dedup pipeline using Splink (see code below) on databricks, but when I execute results = linker_test.predict(threshold_match_probability=0.9), I am running into this error: “AnalysisException: cannot resolve 'l.unique_id'”.

I went through some discussion threads and relevant issues identified by others, but still unable to get past this error message.

Databricks cluster configs: DB runtime: 9.1 LTS; Spark: 3.1.2

Appreciate any help!


from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, types
import splink.spark.spark_comparison_library as cl
from splink.spark.spark_linker import SparkLinker

conf = SparkConf()
conf.set('spark.jars', 'xxx-scala_udf_similarity_0_1_0_spark3_3-478d4.jar')
sc = SparkContext.getOrCreate(conf=conf)
spark = SparkSession(sc)
spark.udf.registerJavaFunction('jaro_winkler','uk.gov.moj.dash.linkage.JaroWinklerSimilarity',types.DoubleType())

df_test = spark.read.format('csv').option('header', True).load('/test.csv')

settings = {
    "link_type": "dedupe_only",
    "comparisons": [
        cl.jaro_winkler_at_thresholds("firstName", 0.8),
        cl.jaro_winkler_at_thresholds("lastName", 0.8)
    ],
    "retain_matching_columns": True,
    "retain_intermediate_calculation_columns": True,

    "em_convergence": 0.01
}


linker_test = SparkLinker(df_test, settings, spark = spark)
results = linker_test.predict(threshold_match_probability=0.9)  ## error occurs once this statement is executed

# df_test.printSchema()
# root
#  |-- id: string (nullable = true)
#  |-- firstName: string (nullable = true)
#  |-- lastName: string (nullable = true)
#  |-- streetnumber: string (nullable = true)
#  |-- street: string (nullable = true)
#  |-- address1: string (nullable = true)
#  |-- address2: string (nullable = true)
#  |-- areacode: string (nullable = true)
#  |-- stateCode: string (nullable = true)
#  |-- dateOfbirth: string (nullable = true)
#  |-- ssn: string (nullable = true)

Answered by RossKen

Mar 7, 2023

Hey @Mehul903

I think you are missing the unique_id_column_name parameter in your settings object. It defaults to unique_id so I think it should work if you change your settings object to:

settings = {
    "link_type": "dedupe_only",
    "unique_id_column_name": "id"
    "comparisons": [
        cl.jaro_winkler_at_thresholds("firstName", 0.8),
        cl.jaro_winkler_at_thresholds("lastName", 0.8)
    ],
    "retain_matching_columns": True,
    "retain_intermediate_calculation_columns": True,
    "em_convergence": 0.01,
}

View full answer

RossKen · 2023-03-07T10:31:11Z

RossKen
Mar 7, 2023

Hey @Mehul903

I think you are missing the unique_id_column_name parameter in your settings object. It defaults to unique_id so I think it should work if you change your settings object to:

settings = {
    "link_type": "dedupe_only",
    "unique_id_column_name": "id"
    "comparisons": [
        cl.jaro_winkler_at_thresholds("firstName", 0.8),
        cl.jaro_winkler_at_thresholds("lastName", 0.8)
    ],
    "retain_matching_columns": True,
    "retain_intermediate_calculation_columns": True,
    "em_convergence": 0.01,
}

1 reply

Mehul903 Mar 8, 2023
Author

Thanks @RossKen - it worked for pandas as well pyspark dataframes (on local and databricks environments)!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AnalysisException: cannot resolve '`l.unique_id`' #1099

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

AnalysisException: cannot resolve 'l.unique_id' #1099

Uh oh!

Mehul903 Mar 6, 2023

Replies: 1 comment · 1 reply

Uh oh!

RossKen Mar 7, 2023

Uh oh!

Mehul903 Mar 8, 2023 Author

AnalysisException: cannot resolve '`l.unique_id`' #1099

Mehul903
Mar 6, 2023

Replies: 1 comment 1 reply

RossKen
Mar 7, 2023

Mehul903 Mar 8, 2023
Author