-
Notifications
You must be signed in to change notification settings - Fork 52
Description
Is your feature request related to a problem? Please describe.
Some of our catalogues can take a long time to load - this has a negative impact on user experience.
Describe the solution you'd like
I've previously had a lot of success speeding dataframe operations up by doing everything in Polars, and then converting back to a pandas dataframe where necessary. I've done a very basic test, and with next to no work, this gives a ~2.5x speed increase: see diff & screencap.
- df = pd.read_csv(
- cat.catalog_file,
- storage_options=storage_options,
- **read_csv_kwargs,
- )
- else:
- df = pd.DataFrame(cat.catalog_dict)
+ read_csv_kwargs.pop('converters',None) # Hack, different args for polars
+ df = pl.read_csv(
+ cat.catalog_file,
+ storage_options=storage_options,
+ **read_csv_kwargs,
+ ).to_pandas()
+ else:
+ df = pl.DataFrame(cat.catalog_dict).to_pandas()
Making this work would require some code changes (currently breaks a few tests), but I'm fairly confident that I could do this relatively quickly, and that it wouldn't take a great deal of effort to take the performance benefit from ~2.5x to ~10-100x.
I would advocate doing any actual dataframe operations in polars, and then transforming back to a pandas dataframe when the user calls it to avoid any user facing changes.
NB. Polars has no dependencies, so this has the additional benefit of not making solving the environment a gigantic pain.
Is this something people would be interested in?