InvalidInputException: Invalid Input Error: Failed to cast value: Unimplemented type for cast (VARCHAR -> NULL) #1028
-
Hi, I can't find anything in the exception messages that indicate which column is generating this error. Based on other responses, I've tried cleaning up my data frames with: This worked for a few runs, but then I needed to add columns these data frames and the VARCHAR-> NULL exceptions returned. Is there any way to get more details on which columns are causing this exception? Thanks! Eric |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 2 replies
-
I haven't looked at this issue in any detail yet but it has come up here: #1024 due to very |
Beta Was this translation helpful? Give feedback.
-
Worth noting I experienced the same error for the first time today. I wonder if it's something that's crept in in the latest Duckdb that perhaps wasn't present in earlier versions |
Beta Was this translation helpful? Give feedback.
-
Thank you both!
I used the pyarrow.Table method and it worked with no problems.
Makes sense to me that it's associated with a change in version of DuckDb.
I do think that getting more detail in the exception messages would be helpful.
Happy to report to DuckDb if that would be helpful.
…________________________________
From: Robin Linacre ***@***.***>
Sent: Tuesday, February 7, 2023, 2:44 PM
To: moj-analytical-services/splink ***@***.***>
Cc: Eric Ellsworth ***@***.***>; Author ***@***.***>
Subject: Re: [moj-analytical-services/splink] InvalidInputException: Invalid Input Error: Failed to cast value: Unimplemented type for cast (VARCHAR -> NULL) (Discussion #1028)
EXTERNAL EMAIL - This email originated from outside our organization. Be cautious of links and attachments.
Worth noting I experienced the same error for the first time today. I wonder if it's something that's crept in in the latest Duckdb that perhaps wasn't present in earlier versions
—
Reply to this email directly, view it on GitHub<#1028 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/APDMNRG7LNAF3WWQOLENQ73WWKQ2JANCNFSM6AAAAAAUTPTQUI>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
I was getting it very frequently until I switch to the arrow Table format, meaning I don’t think it just represents and ephemeral corner case.
I haven’t had a chance to get back and re-run my code, but I wanted to point something out. In reading about these NULLs, I saw some things that indicated that Pandas series with strings may end up actually represented as objects. I am no Pandas expert (really an R guy) so I haven’t done a lot of messing with that. Nonetheless, I can easily see this sort of type issues confusing DuckDB or whatever ORM layer is trying to do the conversion to DuckDB types. It might be worth forcing your Pandas dataframe to have something other than a simple string type and see if that prompts the error.
Eric
From: Robin Linacre ***@***.***>
Sent: Sunday, February 12, 2023 2:50 AM
To: moj-analytical-services/splink ***@***.***>
Cc: Eric Ellsworth ***@***.***>; Author ***@***.***>
Subject: Re: [moj-analytical-services/splink] InvalidInputException: Invalid Input Error: Failed to cast value: Unimplemented type for cast (VARCHAR -> NULL) (Discussion #1028)
EXTERNAL EMAIL - This email originated from outside our organization. Be cautious of links and attachments.
Struggling to pin this down. Was definitely getting this error the other day. Here's an example that might reproduce the bug, but i'm not getting any errors running it:
import pandas as pd
from splink.duckdb.duckdb_linker import DuckDBLinker
data = [{'first_name': None, "surname": "smith"} for i in range(10000)]
data.append({'first_name': "john", "surname": "smith"})
data = pd.DataFrame(data).reset_index()
data = data.rename(columns={"index": "unique_id"})
data.to_csv("nully.csv", index=False)
import splink
print(splink.__version__)
df = pd.read_csv("nully.csv")
settings = {"link_type": "dedupe_only"}
linker_exploratory = DuckDBLinker(df, settings)
linker_exploratory.profile_columns(["first_name"])
—
Reply to this email directly, view it on GitHub<#1028 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/APDMNRCHD6LEGOZH3LVTH3LWXCIZ3ANCNFSM6AAAAAAUTPTQUI>.
You are receiving this because you authored the thread.Message ID: ***@***.******@***.***>>
|
Beta Was this translation helpful? Give feedback.
I haven't looked at this issue in any detail yet but it has come up here: #1024 due to very
NULL
-y columns. One thing that might be worth trying is to usepyarrow.feather.read_feather
to read in your data (if you aren't already). Are you using DuckDB?