InvalidInputException: Invalid Input Error: Failed to cast value: Unimplemented type for cast (VARCHAR -> NULL) #1028

checkbook-org · 2023-02-07T04:48:02Z

checkbook-org
Feb 7, 2023

Hi,
I am running splink on a couple data frames that I originally generated in R. I read them into my Python code using read_feather.
In several splink functions, e.g.:
linker.profile_columns(["first_name", "last_name", "gender_code"])
count = linker.count_num_comparisons_from_blocking_rule(deterministic_rules)
I get exceptions like:
InvalidInputException: Invalid Input Error: Failed to cast value: Unimplemented type for cast (VARCHAR -> NULL)

I can't find anything in the exception messages that indicate which column is generating this error. Based on other responses, I've tried cleaning up my data frames with:
df_l=df_l.replace(r'^\s*$',None,regex=True)
df_l=df_l.fillna(np.nan).replace([np.nan, pd.NA], [None, None])
df_l["state_name"]=df_l["state_name"].str.upper()

This worked for a few runs, but then I needed to add columns these data frames and the VARCHAR-> NULL exceptions returned.

Is there any way to get more details on which columns are causing this exception?

Thanks!

Eric

Answered by ADBond

Feb 7, 2023

I haven't looked at this issue in any detail yet but it has come up here: #1024 due to very NULL-y columns. One thing that might be worth trying is to use pyarrow.feather.read_feather to read in your data (if you aren't already). Are you using DuckDB?

View full answer

ADBond · 2023-02-07T09:20:35Z

ADBond
Feb 7, 2023
Maintainer

I haven't looked at this issue in any detail yet but it has come up here: #1024 due to very NULL-y columns. One thing that might be worth trying is to use pyarrow.feather.read_feather to read in your data (if you aren't already). Are you using DuckDB?

0 replies

RobinL · 2023-02-07T19:44:26Z

RobinL
Feb 7, 2023
Maintainer

Worth noting I experienced the same error for the first time today. I wonder if it's something that's crept in in the latest Duckdb that perhaps wasn't present in earlier versions

0 replies

checkbook-org · 2023-02-07T20:39:52Z

checkbook-org
Feb 7, 2023
Author

Thank you both! I used the pyarrow.Table method and it worked with no problems. Makes sense to me that it's associated with a change in version of DuckDb. I do think that getting more detail in the exception messages would be helpful. Happy to report to DuckDb if that would be helpful.

…

________________________________ From: Robin Linacre ***@***.***> Sent: Tuesday, February 7, 2023, 2:44 PM To: moj-analytical-services/splink ***@***.***> Cc: Eric Ellsworth ***@***.***>; Author ***@***.***> Subject: Re: [moj-analytical-services/splink] InvalidInputException: Invalid Input Error: Failed to cast value: Unimplemented type for cast (VARCHAR -> NULL) (Discussion #1028) EXTERNAL EMAIL - This email originated from outside our organization. Be cautious of links and attachments. Worth noting I experienced the same error for the first time today. I wonder if it's something that's crept in in the latest Duckdb that perhaps wasn't present in earlier versions — Reply to this email directly, view it on GitHub<#1028 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/APDMNRG7LNAF3WWQOLENQ73WWKQ2JANCNFSM6AAAAAAUTPTQUI>. You are receiving this because you authored the thread.Message ID: ***@***.***>

2 replies

RobinL Feb 7, 2023
Maintainer

Thanks. I'll try to take a look soon to make sure it's not due to a change on our side/verify it's definitely a duckdb issue. But yeah, would definitely be good to pin this down a bit better

RobinL Feb 12, 2023
Maintainer

Struggling to pin this down. Was definitely getting this error the other day. Here's an example that might reproduce the bug, but i'm not getting any errors running it:

import pandas as pd
from splink.duckdb.duckdb_linker import DuckDBLinker

data = [{'first_name': None, "surname": "smith"} for i in range(10000)]
data.append({'first_name': "john", "surname": "smith"})
data = pd.DataFrame(data).reset_index()
data = data.rename(columns={"index": "unique_id"})
data.to_csv("nully.csv", index=False)

df = pd.read_csv("nully.csv")

settings = {"link_type": "dedupe_only"}

linker_exploratory = DuckDBLinker(df, settings)
linker_exploratory.profile_columns(["first_name"])

checkbook-org · 2023-02-12T20:15:19Z

checkbook-org
Feb 12, 2023
Author

I was getting it very frequently until I switch to the arrow Table format, meaning I don’t think it just represents and ephemeral corner case. I haven’t had a chance to get back and re-run my code, but I wanted to point something out. In reading about these NULLs, I saw some things that indicated that Pandas series with strings may end up actually represented as objects. I am no Pandas expert (really an R guy) so I haven’t done a lot of messing with that. Nonetheless, I can easily see this sort of type issues confusing DuckDB or whatever ORM layer is trying to do the conversion to DuckDB types. It might be worth forcing your Pandas dataframe to have something other than a simple string type and see if that prompts the error. Eric From: Robin Linacre ***@***.***> Sent: Sunday, February 12, 2023 2:50 AM To: moj-analytical-services/splink ***@***.***> Cc: Eric Ellsworth ***@***.***>; Author ***@***.***> Subject: Re: [moj-analytical-services/splink] InvalidInputException: Invalid Input Error: Failed to cast value: Unimplemented type for cast (VARCHAR -> NULL) (Discussion #1028) EXTERNAL EMAIL - This email originated from outside our organization. Be cautious of links and attachments. Struggling to pin this down. Was definitely getting this error the other day. Here's an example that might reproduce the bug, but i'm not getting any errors running it: import pandas as pd from splink.duckdb.duckdb_linker import DuckDBLinker data = [{'first_name': None, "surname": "smith"} for i in range(10000)] data.append({'first_name': "john", "surname": "smith"}) data = pd.DataFrame(data).reset_index() data = data.rename(columns={"index": "unique_id"}) data.to_csv("nully.csv", index=False) import splink print(splink.__version__) df = pd.read_csv("nully.csv") settings = {"link_type": "dedupe_only"} linker_exploratory = DuckDBLinker(df, settings) linker_exploratory.profile_columns(["first_name"]) — Reply to this email directly, view it on GitHub<#1028 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/APDMNRCHD6LEGOZH3LVTH3LWXCIZ3ANCNFSM6AAAAAAUTPTQUI>. You are receiving this because you authored the thread.Message ID: ***@***.******@***.***>>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InvalidInputException: Invalid Input Error: Failed to cast value: Unimplemented type for cast (VARCHAR -> NULL) #1028

{{title}}

Replies: 4 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

InvalidInputException: Invalid Input Error: Failed to cast value: Unimplemented type for cast (VARCHAR -> NULL) #1028

checkbook-org Feb 7, 2023

Replies: 4 comments · 2 replies

ADBond Feb 7, 2023 Maintainer

RobinL Feb 7, 2023 Maintainer

checkbook-org Feb 7, 2023 Author

RobinL Feb 7, 2023 Maintainer

RobinL Feb 12, 2023 Maintainer

checkbook-org Feb 12, 2023 Author

checkbook-org
Feb 7, 2023

Replies: 4 comments 2 replies

ADBond
Feb 7, 2023
Maintainer

RobinL
Feb 7, 2023
Maintainer

checkbook-org
Feb 7, 2023
Author

RobinL Feb 7, 2023
Maintainer

RobinL Feb 12, 2023
Maintainer

checkbook-org
Feb 12, 2023
Author