-
Notifications
You must be signed in to change notification settings - Fork 209
Drop pandas as a required dependancy - core working version #2883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: splink_5_dev
Are you sure you want to change the base?
Drop pandas as a required dependancy - core working version #2883
Conversation
|
Not for this PR, but I wonder if we should raise a warning if user provides data as That's prevent issues like this, or at least, give the user more info about how to do it properly: #2874 |
|
|
||
| return self.as_pandas_dataframe(limit=limit).to_dict(orient="records") | ||
| spark_df = self.db_api._execute_sql_against_backend(sql) | ||
| cols = spark_df.columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Spark provides a way to do this more natively using
rows = df.toLocalIterator()
[r.asDict(recursive=True) for r in rows]
or if it's small/memory not an issue
[r.asDict(recursive=True) for r in spark_df.collect()]
| row_count, | ||
| sum(row_count) over | ||
| ( | ||
| order by match_key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ordering by match key need to cast to int. I think previous code relied on it being in the correct order due to original insertion order rather than sort
| result_df = db_api.sql_pipeline_to_splink_dataframe(pipeline).as_pandas_dataframe() | ||
| table_name = "result_df" | ||
| # TODO: resuse connexion if available - see similar code in EM training | ||
| con = duckdb.connect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using duckdb for computations, we should explicitly register result_df rather than rely on replacement scan. Tom and I have both had problems with replacement scans for reasons largely unknown, it just 'sometimes breaks'
RobinL
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thanks, definitely, looks like it's going in the right direction. Some minor comments.
I didn't review the part around Comparisons/date casting.
|
This is a nice change, I think the goal of minimising dependencies is good. I just tried this out with a fresh environment to see how it goes and had a few issues. This is the result of I run into this error when I try out a simple linkage script: I'm not sure if you were still working through some of the TODOs in Another place I can see a surviving Test script |
The primary aim of this PR is to remove
pandas(andnumpy) as a required dependency. The work will largely be done byduckdb, orpyarrow(for some optional functionality).pandasis quite a large dependency, so it would be nice to do without it. But further to that, the main things we are using it for are at least as well served by other things. Specifically:duckdb(in a separate capacity to its role as a SQL backend), which is already requiredpyarrowinstead, as that is what it is designed for, and has much clearer typing.pyarrowitself is zero-dependency, so is 'lighter' in that sense, and will anyway not be required for all functionalitySome of the places we were using it were needless - we converted to/from a
pandasframe for no particular reason.Main things to note in this PR:
pandasandnumpynow no longer requiredpandascalculation is translated. Others in less key parts I have simply moved imports inside functions - this will be remedied in follow-up PRspandasframes, although in the main we should move away from these in examplesneeds_pandas- in a future PR when testing works withoutpandaswe will need to test these separately, as they are direct tests ofpandascompatibilityFurther context in this discussion.