Skip to content

Polars backend #705

@charles-turner-1

Description

@charles-turner-1

Is your feature request related to a problem? Please describe.
Some of our catalogues can take a long time to load - this has a negative impact on user experience.

Describe the solution you'd like
I've previously had a lot of success speeding dataframe operations up by doing everything in Polars, and then converting back to a pandas dataframe where necessary. I've done a very basic test, and with next to no work, this gives a ~2.5x speed increase: see diff & screencap.

-                df = pd.read_csv(
-                    cat.catalog_file,
-                    storage_options=storage_options,
-                    **read_csv_kwargs,
-                )
-            else:
-                df = pd.DataFrame(cat.catalog_dict)
+                read_csv_kwargs.pop('converters',None) # Hack, different args for polars
+                df = pl.read_csv(
+                    cat.catalog_file,
+                    storage_options=storage_options,
+                    **read_csv_kwargs,
+                ).to_pandas()
+            else:
+                df = pl.DataFrame(cat.catalog_dict).to_pandas()
Image

Making this work would require some code changes (currently breaks a few tests), but I'm fairly confident that I could do this relatively quickly, and that it wouldn't take a great deal of effort to take the performance benefit from ~2.5x to ~10-100x.

I would advocate doing any actual dataframe operations in polars, and then transforming back to a pandas dataframe when the user calls it to avoid any user facing changes.

NB. Polars has no dependencies, so this has the additional benefit of not making solving the environment a gigantic pain.

Is this something people would be interested in?

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions