Drop pandas as a required dependancy - core working version #2883

ADBond · 2026-01-14T11:17:25Z

The primary aim of this PR is to remove pandas (and numpy) as a required dependency. The work will largely be done by duckdb, or pyarrow (for some optional functionality).

pandas is quite a large dependency, so it would be nice to do without it. But further to that, the main things we are using it for are at least as well served by other things. Specifically:

where we use it for small ad-hoc bits of calculation, we should either do it in native python, or in duckdb (in a separate capacity to its role as a SQL backend), which is already required
where we use it as a data transport layer, we should use pyarrow instead, as that is what it is designed for, and has much clearer typing. pyarrow itself is zero-dependency, so is 'lighter' in that sense, and will anyway not be required for all functionality

Some of the places we were using it were needless - we converted to/from a pandas frame for no particular reason.

Main things to note in this PR:

pandas and numpy now no longer required
some pandas calculation is translated. Others in less key parts I have simply moved imports inside functions - this will be remedied in follow-up PRs
tests still require pandas for now.
we will keep support for ingesting pandas frames, although in the main we should move away from these in examples
I've had to fiddle with the date-difference levels a little, as stricter typing (date columns) revealed some issues - see this comment
There is a new pytest mark needs_pandas - in a future PR when testing works without pandas we will need to test these separately, as they are direct tests of pandas compatibility
in the course of this, I've deleted a few bits of code that were not used anywhere

Further context in this discussion.

RobinL · 2026-01-15T07:11:24Z

Not for this PR, but I wonder if we should raise a warning if user provides data as dictor list, with a suggestion for them to provide it as arrow table with explicit schema.

That's prevent issues like this, or at least, give the user more info about how to do it properly: #2874

RobinL · 2026-01-15T07:17:51Z

splink/internals/spark/dataframe.py


-        return self.as_pandas_dataframe(limit=limit).to_dict(orient="records")
+        spark_df = self.db_api._execute_sql_against_backend(sql)
+        cols = spark_df.columns


I think Spark provides a way to do this more natively using

rows = df.toLocalIterator() [r.asDict(recursive=True) for r in rows]

or if it's small/memory not an issue

[r.asDict(recursive=True) for r in spark_df.collect()]

RobinL · 2026-01-15T07:29:52Z

splink/internals/blocking_analysis.py

+                row_count,
+                sum(row_count) over
+                    (
+                        order by match_key


I think ordering by match key need to cast to int. I think previous code relied on it being in the correct order due to original insertion order rather than sort

RobinL · 2026-01-15T07:33:39Z

splink/internals/blocking_analysis.py

    result_df = db_api.sql_pipeline_to_splink_dataframe(pipeline).as_pandas_dataframe()
+    table_name = "result_df"
+    # TODO: resuse connexion if available - see similar code in EM training
+    con = duckdb.connect()


When using duckdb for computations, we should explicitly register result_df rather than rely on replacement scan. Tom and I have both had problems with replacement scans for reasons largely unknown, it just 'sometimes breaks'

RobinL

This is great, thanks, definitely, looks like it's going in the right direction. Some minor comments.

I didn't review the part around Comparisons/date casting.

aymonwuolanne · 2026-01-16T04:41:40Z

This is a nice change, I think the goal of minimising dependencies is good. I just tried this out with a fresh environment to see how it goes and had a few issues.

This is the result of pip list (noting no pyarrow or pandas):

Package                   Version
------------------------- ----------
altair                    6.0.0
attrs                     25.4.0
duckdb                    1.4.3
igraph                    1.0.0
Jinja2                    3.1.6
jsonschema                4.26.0
jsonschema-specifications 2025.9.1
MarkupSafe                3.0.3
narwhals                  2.15.0
packaging                 25.0
pip                       24.0
referencing               0.37.0
rpds-py                   0.30.0
splink                    5.0.0.dev1
sqlglot                   28.6.0
texttable                 1.7.0
typing_extensions         4.15.0

I run into this error when I try out a simple linkage script:

----- Estimating u probabilities using random sampling -----
Traceback (most recent call last):
  File "/home/aymon/splink/mytest.py", line 27, in <module>
    linker.training.estimate_u_using_random_sampling(1e6)
  File "/home/aymon/splink/splink/internals/linker_components/training.py", line 210, in estimate_u_using_random_sampling
    estimate_u_values(self._linker, max_pairs, seed)
  File "/home/aymon/splink/splink/internals/estimate_u.py", line 207, in estimate_u_values
    param_records = compute_proportions_for_new_parameters(df_params)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aymon/splink/splink/internals/expectation_maximisation.py", line 124, in compute_proportions_for_new_parameters
    m_u_df = df_params.as_pandas_dataframe()  # noqa: F841 (unused variable)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aymon/splink/splink/internals/duckdb/dataframe.py", line 59, in as_pandas_dataframe
    return self.db_api._execute_sql_against_backend(sql).to_df()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ModuleNotFoundError: No module named 'numpy'

I'm not sure if you were still working through some of the TODOs in expectation_maximisation.py, in which case maybe it'll be fixed before being merged (in that particular spot you already have a function to generate the sql so I wonder if you can just execute that against the backend?) Otherwise, I think this change could be a bit unfriendly, and it might be worth keeping these packages as dependencies until more of the key uses are covered without pandas?

Another place I can see a surviving as_pandas_dataframe is in the blocking analysis, which I think is also key enough that it should probably work without optional dependencies.

Test script

import duckdb
from splink import Linker, SettingsCreator, block_on, DuckDBAPI
import splink.comparison_library as cl

con = duckdb.connect()

# read from a file rather than use splink_datasets because that needs pandas
df = con.read_csv("./tests/datasets/fake_1000_from_splink_demos.csv")
df.create("df_table")

settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.ExactMatch("first_name"),
        cl.ExactMatch("surname"),
        cl.ExactMatch("dob"),
        cl.ExactMatch("email"),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("dob"),
    ]
)

db_api = DuckDBAPI(con)

sdf = db_api.register("df_table", "df")

linker = Linker(sdf, settings)

linker.training.estimate_u_using_random_sampling(1e6)

ADBond added 30 commits January 13, 2026 10:41

-pandas, +pyarrow

682d9e6

pandas dev group only

4552f05

load datasets with arrow

eb297dd

input types - arrow + pandas in type-checking

f44e6aa

test table registration

13b3e37

table registration underpinned by pyarrow

91069bd

safe check for pandas frame

e849ba8

SparkDataFrame to records without pandas

ad42e11

postgres - handle inputs with arrow

381c56f

numpy non-required - only with pandas

0d493d1

hide pandas/numpy imports

87b7fa8

remove unused helper function fixture

f2783eb

charts - ditch pandas for native

ed9d027

test helpers - less pandas-centric

b60303f

pyarrow in core deps group

a5cb93c

remove top-level numpy import

483c9cf

remove unneeded pandas fallback

4ff4d63

more direct EM calc code, simplify call sites

bf235e7

cumulative comparisons - calc in duckdb, not pandas

04cd613

duckdb dataframe - pandas guard

08388eb

pandas temporarily default dev group

b9525b5

fix duckdb prepared statement syntax + logic

f000b4e

fix test condition for new return type

30d6a2f

fix typing in tests

070f98c

temporarily keep datasets as pandas frames

c7c7bba

duckdb in core for now, while it's wider

d490e47

actually register arrow tables in sqlite

4b9ddf9

don't try to convert spark frames

8a033c0

correct sqlite table registration

b3e0748

fix demos typing

5b851a6

ADBond added 16 commits January 13, 2026 10:42

format settings

dab763e

realtime settings - case to varchar

87029b9

realtime notebook - read from local file if poss, otherwise url

677a772

update tutorial settings

65edd51

fix json syntax

de98848

local / url pattern for getting settings

413f6a6

fix path to local settings

cfed4ce

allow separate date SQL per-dialect if required, and a spark solution

b8642f6

update method name + args

57f4dbb

fix Linker init call in notebooks

d484c92

Merge branch 'splink_5_dev' into depandas-rebase

81dfd95

pyarrow as optional only

ffb41ef

import pyarrow only at type-checking level

e7ba672

guard against pyarrow not being available

5877815

wrap up pyarrow list/dict conversion

2d4984f

future for postponed evaluations

f09c851

ADBond added dependencies Pull requests that update a dependency file splink_5 labels Jan 14, 2026

ADBond requested a review from RobinL January 14, 2026 11:48

This was referenced Jan 14, 2026

Depandas rebase ADBond/splink#75

Closed

[WIP] pandas -> arrow ADBond/splink#72

Closed

Update changelog

50b66be

RobinL reviewed Jan 15, 2026

View reviewed changes

RobinL approved these changes Jan 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Drop pandas as a required dependancy - core working version #2883

Drop pandas as a required dependancy - core working version #2883

Uh oh!

ADBond commented Jan 14, 2026 •

edited

Loading

Uh oh!

RobinL commented Jan 15, 2026 •

edited

Loading

Uh oh!

RobinL Jan 15, 2026

Uh oh!

RobinL Jan 15, 2026 •

edited

Loading

Uh oh!

RobinL Jan 15, 2026 •

edited

Loading

Uh oh!

RobinL left a comment

Uh oh!

aymonwuolanne commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Drop pandas as a required dependancy - core working version #2883

Are you sure you want to change the base?

Drop pandas as a required dependancy - core working version #2883

Uh oh!

Conversation

ADBond commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RobinL commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RobinL Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

RobinL Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RobinL Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RobinL left a comment

Choose a reason for hiding this comment

Uh oh!

aymonwuolanne commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ADBond commented Jan 14, 2026 •

edited

Loading

RobinL commented Jan 15, 2026 •

edited

Loading

RobinL Jan 15, 2026 •

edited

Loading

RobinL Jan 15, 2026 •

edited

Loading