Skip to content

Parquet: reading exported parquet file #1066

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dmpetrov opened this issue Apr 29, 2025 · 4 comments · May be fixed by #1071
Open

Parquet: reading exported parquet file #1066

dmpetrov opened this issue Apr 29, 2025 · 4 comments · May be fixed by #1071
Assignees
Labels
bug Something isn't working

Comments

@dmpetrov
Copy link
Member

Description

>>> import datachain as dc
>>>
>>> ds = dc.read_parquet("example.parquet").limit(1000)
>>> ds.to_parquet("example-1000.parquet")
>>>
>>> ds2 = dc.read_parquet("example-1000.parquet")
>>> ds2.show()
Parsed by pyarrow: 0rows [00:00, ?rError while validating/converting type for column id with value file:///Users/dmitry/src/money-lion, original error Value 'file:///Users/dmitry/src/money-lion' with type <class 'str'> incompatible for column type Int64
NoneType: None
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/dmitry/src/datachain/src/datachain/lib/dc/datachain.py", line 1546, in show
    df = dc.to_pandas(flatten, include_hidden=include_hidden)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dmitry/src/datachain/src/datachain/lib/dc/datachain.py", line 1523, in to_pandas
    results = self.results(include_hidden=include_hidden)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dmitry/src/datachain/src/datachain/lib/dc/datachain.py", line 1032, in results
    return list(self.collect_flatten(include_hidden=include_hidden))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dmitry/src/datachain/src/datachain/lib/dc/datachain.py", line 983, in collect_flatten
    with self._query.ordered_select(*db_signals).as_iterable() as rows:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/Cellar/[email protected]/3.12.9/Frameworks/Python.framework/Versions/3.12/lib/python3.12/contextlib.py", line 137, in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
  File "/Users/dmitry/src/datachain/src/datachain/query/dataset.py", line 1304, in as_iterable
    query = self.apply_steps().select()
            ^^^^^^^^^^^^^^^^^^
  File "/Users/dmitry/src/datachain/src/datachain/query/dataset.py", line 1250, in apply_steps
    result = step.apply(
             ^^^^^^^^^^^
  File "/Users/dmitry/src/datachain/src/datachain/query/dataset.py", line 613, in apply
    self.populate_udf_table(udf_table, query)
  File "/Users/dmitry/src/datachain/src/datachain/query/dataset.py", line 531, in populate_udf_table
    process_udf_outputs(
  File "/Users/dmitry/src/datachain/src/datachain/query/dataset.py", line 351, in process_udf_outputs
    rows.append(adjust_outputs(warehouse, row, udf_col_types))
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dmitry/src/datachain/src/datachain/query/dataset.py", line 306, in adjust_outputs
    row[col_name] = warehouse.convert_type(
                    ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dmitry/src/datachain/src/datachain/data_storage/warehouse.py", line 152, in convert_type
    raise ve
ValueError: Value 'file:///Users/dmitry/src/money-lion' with type <class 'str'> incompatible for column type Int64

Version Info

0.14.6.dev5+g30b2d2a0
Python 3.12.9
@dmpetrov dmpetrov added the bug Something isn't working label Apr 29, 2025
@ilongin ilongin self-assigned this Apr 29, 2025
@ilongin
Copy link
Contributor

ilongin commented Apr 29, 2025

@dmpetrov can you show me example rows of that example.parquet if possible? I'm currently unable to reproduce and it looks like it might be related to specific data in that file

@shcheklein
Copy link
Member

I hit this a while ago (most likely).

This is specific to parquet files produced by DataChain after reading another parquet.

The difference is that they have ArrowRow source object inside + custom schema to deserialize it now.

When we read it second time it's not exactly the same parquet anymore - it has more information.

The problem that the second read_parquet is trying to add source second time, second ArrowRow column. It probably breaks there.

It should be fixed (probably avoid doing source second time? or replace it?).

Workaround should be to put source=False in one of those calls or both.

@shcheklein
Copy link
Member

@dmpetrov if that workaround works, I would consider doing this as a P2 / P3 then.

@ilongin ilongin linked a pull request Apr 30, 2025 that will close this issue
@ilongin
Copy link
Contributor

ilongin commented Apr 30, 2025

@shcheklein you are right, that's exactly what was happening. I've already created a fix PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants