-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Open
Description
Describe the bug, including details regarding any error messages, version, and platform.
When using pyarrow.dataset.dataset
to read a partitioned parquet dataset that's partitioned on a large_string
field, I see a ArrowTypeError: Unable to merge: Field part has incompatible types: large_string vs string
:
import pyarrow as pa
import pyarrow.dataset
import pathlib
import pyarrow.parquet
t = pa.table(
{"part": pa.array(["a", "a", "b", "b"], type=pa.large_string()), "col": [1, 2, 3, 4]}
)
root = pathlib.Path("string.parquet")
a = root / "a/data.parquet"
b = root / "b/data.parquet"
a.parent.mkdir(parents=True, exist_ok=True)
b.parent.mkdir(parents=True, exist_ok=True)
pyarrow.parquet.write_table(t[:2], a)
pyarrow.parquet.write_table(t[2:], b)
source = list(root.glob("**/*.parquet"))
t = pyarrow.dataset.dataset(root, partitioning=["part"], partition_base_dir=str(root))
errors with
Traceback (most recent call last):
File "/Users/toaugspurger/gh/dask/dask/debug.py", line 21, in <module>
t = pyarrow.dataset.dataset(root, partitioning=["part"], partition_base_dir=str(root))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/toaugspurger/gh/dask/.venv/lib/python3.12/site-packages/pyarrow/dataset.py", line 790, in dataset
return _filesystem_dataset(source, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/toaugspurger/gh/dask/.venv/lib/python3.12/site-packages/pyarrow/dataset.py", line 482, in _filesystem_dataset
return factory.finish(schema)
^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/_dataset.pyx", line 3196, in pyarrow._dataset.DatasetFactory.finish
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Unable to merge: Field part has incompatible types: large_string vs string
I don't observe that failure with type=pa.string()
, just large_string()
.
Component(s)
C++, Python