Skip to content

Conversation

@ADBond
Copy link
Contributor

@ADBond ADBond commented Jan 14, 2026

The primary aim of this PR is to remove pandas (and numpy) as a required dependency. The work will largely be done by duckdb, or pyarrow (for some optional functionality).

pandas is quite a large dependency, so it would be nice to do without it. But further to that, the main things we are using it for are at least as well served by other things. Specifically:

  • where we use it for small ad-hoc bits of calculation, we should either do it in native python, or in duckdb (in a separate capacity to its role as a SQL backend), which is already required
  • where we use it as a data transport layer, we should use pyarrow instead, as that is what it is designed for, and has much clearer typing. pyarrow itself is zero-dependency, so is 'lighter' in that sense, and will anyway not be required for all functionality

Some of the places we were using it were needless - we converted to/from a pandas frame for no particular reason.

Main things to note in this PR:

  • pandas and numpy now no longer required
  • some pandas calculation is translated. Others in less key parts I have simply moved imports inside functions - this will be remedied in follow-up PRs
  • tests still require pandas for now.
  • we will keep support for ingesting pandas frames, although in the main we should move away from these in examples
  • I've had to fiddle with the date-difference levels a little, as stricter typing (date columns) revealed some issues - see this comment
  • There is a new pytest mark needs_pandas - in a future PR when testing works without pandas we will need to test these separately, as they are direct tests of pandas compatibility
  • in the course of this, I've deleted a few bits of code that were not used anywhere

Further context in this discussion.

@ADBond ADBond added dependencies Pull requests that update a dependency file splink_5 labels Jan 14, 2026
@ADBond ADBond requested a review from RobinL January 14, 2026 11:48
@RobinL
Copy link
Member

RobinL commented Jan 15, 2026

Not for this PR, but I wonder if we should raise a warning if user provides data as dictor list, with a suggestion for them to provide it as arrow table with explicit schema.

That's prevent issues like this, or at least, give the user more info about how to do it properly: #2874


return self.as_pandas_dataframe(limit=limit).to_dict(orient="records")
spark_df = self.db_api._execute_sql_against_backend(sql)
cols = spark_df.columns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Spark provides a way to do this more natively using

rows = df.toLocalIterator()
[r.asDict(recursive=True) for r in rows]

or if it's small/memory not an issue

[r.asDict(recursive=True) for r in spark_df.collect()]

row_count,
sum(row_count) over
(
order by match_key
Copy link
Member

@RobinL RobinL Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ordering by match key need to cast to int. I think previous code relied on it being in the correct order due to original insertion order rather than sort

result_df = db_api.sql_pipeline_to_splink_dataframe(pipeline).as_pandas_dataframe()
table_name = "result_df"
# TODO: resuse connexion if available - see similar code in EM training
con = duckdb.connect()
Copy link
Member

@RobinL RobinL Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using duckdb for computations, we should explicitly register result_df rather than rely on replacement scan. Tom and I have both had problems with replacement scans for reasons largely unknown, it just 'sometimes breaks'

Copy link
Member

@RobinL RobinL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks, definitely, looks like it's going in the right direction. Some minor comments.

I didn't review the part around Comparisons/date casting.

@aymonwuolanne
Copy link
Contributor

This is a nice change, I think the goal of minimising dependencies is good. I just tried this out with a fresh environment to see how it goes and had a few issues.

This is the result of pip list (noting no pyarrow or pandas):

Package                   Version
------------------------- ----------
altair                    6.0.0
attrs                     25.4.0
duckdb                    1.4.3
igraph                    1.0.0
Jinja2                    3.1.6
jsonschema                4.26.0
jsonschema-specifications 2025.9.1
MarkupSafe                3.0.3
narwhals                  2.15.0
packaging                 25.0
pip                       24.0
referencing               0.37.0
rpds-py                   0.30.0
splink                    5.0.0.dev1
sqlglot                   28.6.0
texttable                 1.7.0
typing_extensions         4.15.0

I run into this error when I try out a simple linkage script:

----- Estimating u probabilities using random sampling -----
Traceback (most recent call last):
  File "/home/aymon/splink/mytest.py", line 27, in <module>
    linker.training.estimate_u_using_random_sampling(1e6)
  File "/home/aymon/splink/splink/internals/linker_components/training.py", line 210, in estimate_u_using_random_sampling
    estimate_u_values(self._linker, max_pairs, seed)
  File "/home/aymon/splink/splink/internals/estimate_u.py", line 207, in estimate_u_values
    param_records = compute_proportions_for_new_parameters(df_params)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aymon/splink/splink/internals/expectation_maximisation.py", line 124, in compute_proportions_for_new_parameters
    m_u_df = df_params.as_pandas_dataframe()  # noqa: F841 (unused variable)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aymon/splink/splink/internals/duckdb/dataframe.py", line 59, in as_pandas_dataframe
    return self.db_api._execute_sql_against_backend(sql).to_df()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ModuleNotFoundError: No module named 'numpy'

I'm not sure if you were still working through some of the TODOs in expectation_maximisation.py, in which case maybe it'll be fixed before being merged (in that particular spot you already have a function to generate the sql so I wonder if you can just execute that against the backend?) Otherwise, I think this change could be a bit unfriendly, and it might be worth keeping these packages as dependencies until more of the key uses are covered without pandas?

Another place I can see a surviving as_pandas_dataframe is in the blocking analysis, which I think is also key enough that it should probably work without optional dependencies.

Test script
import duckdb
from splink import Linker, SettingsCreator, block_on, DuckDBAPI
import splink.comparison_library as cl

con = duckdb.connect()

# read from a file rather than use splink_datasets because that needs pandas
df = con.read_csv("./tests/datasets/fake_1000_from_splink_demos.csv")
df.create("df_table")

settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.ExactMatch("first_name"),
        cl.ExactMatch("surname"),
        cl.ExactMatch("dob"),
        cl.ExactMatch("email"),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("dob"),
    ]
)

db_api = DuckDBAPI(con)

sdf = db_api.register("df_table", "df")

linker = Linker(sdf, settings)

linker.training.estimate_u_using_random_sampling(1e6)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file splink_5

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants