Skip to content

ArrowTypeError when reading a parquet dataset that's partitioned on a large_string column #47177

@TomAugspurger

Description

@TomAugspurger

Describe the bug, including details regarding any error messages, version, and platform.

When using pyarrow.dataset.dataset to read a partitioned parquet dataset that's partitioned on a large_string field, I see a ArrowTypeError: Unable to merge: Field part has incompatible types: large_string vs string:

import pyarrow as pa
import pyarrow.dataset
import pathlib
import pyarrow.parquet


t = pa.table(
    {"part": pa.array(["a", "a", "b", "b"], type=pa.large_string()), "col": [1, 2, 3, 4]}
)
root = pathlib.Path("string.parquet")
a = root / "a/data.parquet"
b = root / "b/data.parquet"

a.parent.mkdir(parents=True, exist_ok=True)
b.parent.mkdir(parents=True, exist_ok=True)

pyarrow.parquet.write_table(t[:2], a)
pyarrow.parquet.write_table(t[2:], b)

source = list(root.glob("**/*.parquet"))
t = pyarrow.dataset.dataset(root, partitioning=["part"], partition_base_dir=str(root))

errors with

Traceback (most recent call last):
  File "/Users/toaugspurger/gh/dask/dask/debug.py", line 21, in <module>
    t = pyarrow.dataset.dataset(root, partitioning=["part"], partition_base_dir=str(root))
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/toaugspurger/gh/dask/.venv/lib/python3.12/site-packages/pyarrow/dataset.py", line 790, in dataset
    return _filesystem_dataset(source, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/toaugspurger/gh/dask/.venv/lib/python3.12/site-packages/pyarrow/dataset.py", line 482, in _filesystem_dataset
    return factory.finish(schema)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_dataset.pyx", line 3196, in pyarrow._dataset.DatasetFactory.finish
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Unable to merge: Field part has incompatible types: large_string vs string

I don't observe that failure with type=pa.string(), just large_string().

Component(s)

C++, Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions