Parquet: reading exported parquet file #1066

dmpetrov · 2025-04-29T03:39:57Z

Description

>>> import datachain as dc
>>>
>>> ds = dc.read_parquet("example.parquet").limit(1000)
>>> ds.to_parquet("example-1000.parquet")
>>>
>>> ds2 = dc.read_parquet("example-1000.parquet")
>>> ds2.show()
Parsed by pyarrow: 0rows [00:00, ?rError while validating/converting type for column id with value file:///Users/dmitry/src/money-lion, original error Value 'file:///Users/dmitry/src/money-lion' with type <class 'str'> incompatible for column type Int64
NoneType: None
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/dmitry/src/datachain/src/datachain/lib/dc/datachain.py", line 1546, in show
    df = dc.to_pandas(flatten, include_hidden=include_hidden)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dmitry/src/datachain/src/datachain/lib/dc/datachain.py", line 1523, in to_pandas
    results = self.results(include_hidden=include_hidden)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dmitry/src/datachain/src/datachain/lib/dc/datachain.py", line 1032, in results
    return list(self.collect_flatten(include_hidden=include_hidden))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dmitry/src/datachain/src/datachain/lib/dc/datachain.py", line 983, in collect_flatten
    with self._query.ordered_select(*db_signals).as_iterable() as rows:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/Cellar/[email protected]/3.12.9/Frameworks/Python.framework/Versions/3.12/lib/python3.12/contextlib.py", line 137, in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
  File "/Users/dmitry/src/datachain/src/datachain/query/dataset.py", line 1304, in as_iterable
    query = self.apply_steps().select()
            ^^^^^^^^^^^^^^^^^^
  File "/Users/dmitry/src/datachain/src/datachain/query/dataset.py", line 1250, in apply_steps
    result = step.apply(
             ^^^^^^^^^^^
  File "/Users/dmitry/src/datachain/src/datachain/query/dataset.py", line 613, in apply
    self.populate_udf_table(udf_table, query)
  File "/Users/dmitry/src/datachain/src/datachain/query/dataset.py", line 531, in populate_udf_table
    process_udf_outputs(
  File "/Users/dmitry/src/datachain/src/datachain/query/dataset.py", line 351, in process_udf_outputs
    rows.append(adjust_outputs(warehouse, row, udf_col_types))
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dmitry/src/datachain/src/datachain/query/dataset.py", line 306, in adjust_outputs
    row[col_name] = warehouse.convert_type(
                    ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dmitry/src/datachain/src/datachain/data_storage/warehouse.py", line 152, in convert_type
    raise ve
ValueError: Value 'file:///Users/dmitry/src/money-lion' with type <class 'str'> incompatible for column type Int64

Version Info

0.14.6.dev5+g30b2d2a0
Python 3.12.9

The text was updated successfully, but these errors were encountered:

ilongin · 2025-04-29T21:47:03Z

@dmpetrov can you show me example rows of that example.parquet if possible? I'm currently unable to reproduce and it looks like it might be related to specific data in that file

shcheklein · 2025-04-30T01:02:26Z

I hit this a while ago (most likely).

This is specific to parquet files produced by DataChain after reading another parquet.

The difference is that they have ArrowRow source object inside + custom schema to deserialize it now.

When we read it second time it's not exactly the same parquet anymore - it has more information.

The problem that the second read_parquet is trying to add source second time, second ArrowRow column. It probably breaks there.

It should be fixed (probably avoid doing source second time? or replace it?).

Workaround should be to put source=False in one of those calls or both.

shcheklein · 2025-04-30T01:03:12Z

@dmpetrov if that workaround works, I would consider doing this as a P2 / P3 then.

ilongin · 2025-04-30T11:15:51Z

@shcheklein you are right, that's exactly what was happening. I've already created a fix PR.

dmpetrov added the bug Something isn't working label Apr 29, 2025

ilongin self-assigned this Apr 29, 2025

ilongin linked a pull request Apr 30, 2025 that will close this issue

Fix for reading exported parquet #1071

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet: reading exported parquet file #1066

Parquet: reading exported parquet file #1066

dmpetrov commented Apr 29, 2025

ilongin commented Apr 29, 2025

shcheklein commented Apr 30, 2025

shcheklein commented Apr 30, 2025

ilongin commented Apr 30, 2025

Parquet: reading exported parquet file #1066

Parquet: reading exported parquet file #1066

Comments

dmpetrov commented Apr 29, 2025

Description

Version Info

ilongin commented Apr 29, 2025

shcheklein commented Apr 30, 2025

shcheklein commented Apr 30, 2025

ilongin commented Apr 30, 2025