forked from jcrobak/parquet-python
-
-
Notifications
You must be signed in to change notification settings - Fork 188
Open
Description
Describe the issue: Not sure if this is a fastparquet or pyarrow (or pandas) issue, but I noticed that a column with pandas categorical dtype is read as object dtype if the Parquet file is created by the fastparquet engine and then read by the pyarrow engine. The other three cases preserve the dtype.
Minimal Complete Verifiable Example:
import itertools
import pandas as pd
df = pd.Series(["a", "b", "c"]).rename("cat").astype("category").to_frame()
fn = "cat.parquet"
data = []
for write, read in itertools.product(["pyarrow", "fastparquet"], repeat=2):
df.to_parquet(fn, engine=write)
df_ = pd.read_parquet(fn, engine=read)
data.append((write, read, df_["cat"].dtype))
res = pd.DataFrame(data, columns=["write", "read", "dtype"])
print(res)
write read dtype
0 pyarrow pyarrow category
1 pyarrow fastparquet category
2 fastparquet pyarrow object
3 fastparquet fastparquet category
Anything else we need to know?:
Environment:
- Dask version:
- Python version: 3.11.3
- Operating System:
- Install method (conda, pip, source): pip
- fastparquet 2024.2.0, pyarrow 15.0.0, pandas 2.2.0
Metadata
Metadata
Assignees
Labels
No labels