Skip to content

Categorical dtype not preserved with fastparquet-write, pyarrow-read #920

@zmoon

Description

@zmoon

Describe the issue: Not sure if this is a fastparquet or pyarrow (or pandas) issue, but I noticed that a column with pandas categorical dtype is read as object dtype if the Parquet file is created by the fastparquet engine and then read by the pyarrow engine. The other three cases preserve the dtype.

Minimal Complete Verifiable Example:

import itertools

import pandas as pd

df = pd.Series(["a", "b", "c"]).rename("cat").astype("category").to_frame()

fn = "cat.parquet"
data = []
for write, read in itertools.product(["pyarrow", "fastparquet"], repeat=2):
    df.to_parquet(fn, engine=write)
    df_ = pd.read_parquet(fn, engine=read)
    data.append((write, read, df_["cat"].dtype))

res = pd.DataFrame(data, columns=["write", "read", "dtype"])
print(res)
         write         read     dtype
0      pyarrow      pyarrow  category
1      pyarrow  fastparquet  category
2  fastparquet      pyarrow    object
3  fastparquet  fastparquet  category

Anything else we need to know?:

Environment:

  • Dask version:
  • Python version: 3.11.3
  • Operating System:
  • Install method (conda, pip, source): pip
  • fastparquet 2024.2.0, pyarrow 15.0.0, pandas 2.2.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions