Skip to content

Pickle error when trying to append to existing deltatable #87

Open
@Thodorissio

Description

@Thodorissio

First of all I would like to thank you for you awesome contributions. During my development I came across the following issue.

Description

When trying to append to existing DeltaTable the following error occurs:

TypeError: ('Could not serialize object of type HighLevelGraph', '<ToPickle: HighLevelGraph with 3 layers.\n<dask.highlevelgraph.HighLevelGraph object at 0x1384dcc0ad0>\n 0. 1341334790528\n 1. finalize-02082eb4-e53c-4b1a-83dc-fb753d3f60dc\n 2. _commit-94f97f6c-675a-47c6-88a7-82b1a5234034\n>')

Reproducible Example

import pandas as pd
import dask.dataframe as dd
import dask_deltatable as ddt

from distributed import Client
from deltalake import DeltaTable

output_table = "./animals"


if __name__ == "__main__":
    client = Client()
    print(f"Dask Client: {client}")

    animals_df = pd.DataFrame(
        {
            "name": ["dog", "cat", "whale", "elephant"],
            "life_span": [13, 15, 90, 70],
        },
    )

    animals_ddf = dd.from_pandas(animals_df)
    animals_ddf["high_longevity"] = animals_ddf["life_span"] > 40
    ddt.to_deltalake(
        table_or_uri=output_table,
        df=animals_ddf,
        compute=True,
        mode="append",
    )

    delta_table = DeltaTable(output_table)
    delta_table_df = delta_table.to_pandas()
    print("Created DeltaTable:")
    print(delta_table_df)

    more_animals_df = pd.DataFrame(
        {
            "name": ["shark", "parrot"],
            "life_span": [20, 50],
        },
    )

    more_animals_ddf = dd.from_pandas(more_animals_df)
    more_animals_ddf["high_longevity"] = more_animals_ddf["life_span"] > 40
    ddt.to_deltalake(
        table_or_uri=output_table,
        df=more_animals_ddf,
        compute=True,
        mode="append",
    )

Stacktrace

Dask Client: <Client: 'tcp://127.0.0.1:60355' processes=4 threads=12, memory=31.90 GiB>
Created DeltaTable:
       name  life_span  high_longevity
0       dog         13           False
1       cat         15           False
2     whale         90            True
3  elephant         70            True
2024-12-16 13:58:42,783 - distributed.protocol.pickle - ERROR - Failed to serialize <ToPickle: HighLevelGraph with 3 layers.
<dask.highlevelgraph.HighLevelGraph object at 0x1e7f9b8b530>
 0. 2095838501424
 1. finalize-54be334c-3207-4a95-8908-1aac80f5edb6
 2. _commit-2c7fae99-a722-4a11-8b99-ca2120ebbb4d
>.
Traceback (most recent call last):
  File "C:\Users\thodo\miniconda3\envs\myenv\Lib\site-packages\distributed\protocol\pickle.py", line 60, in dumps
    result = pickle.dumps(x, **dump_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: cannot pickle 'deltalake._internal.RawDeltaTable' object

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\thodo\miniconda3\envs\myenv\Lib\site-packages\distributed\protocol\pickle.py", line 65, in dumps
    pickler.dump(x)
TypeError: cannot pickle 'deltalake._internal.RawDeltaTable' object

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\thodo\miniconda3\envs\myenv\Lib\site-packages\distributed\protocol\pickle.py", line 77, in dumps
    result = cloudpickle.dumps(x, **dump_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\thodo\miniconda3\envs\myenv\Lib\site-packages\cloudpickle\cloudpickle.py", line 1529, in dumps
    cp.dump(obj)
  File "C:\Users\thodo\miniconda3\envs\myenv\Lib\site-packages\cloudpickle\cloudpickle.py", line 1295, in dump
    return super().dump(obj)
           ^^^^^^^^^^^^^^^^^
TypeError: cannot pickle 'deltalake._internal.RawDeltaTable' object
Traceback (most recent call last):
  File "C:\Users\thodo\miniconda3\envs\myenv\Lib\site-packages\distributed\protocol\pickle.py", line 60, in dumps
    result = pickle.dumps(x, **dump_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: cannot pickle 'deltalake._internal.RawDeltaTable' object

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\thodo\miniconda3\envs\myenv\Lib\site-packages\distributed\protocol\pickle.py", line 65, in dumps
    pickler.dump(x)
TypeError: cannot pickle 'deltalake._internal.RawDeltaTable' object

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\thodo\miniconda3\envs\myenv\Lib\site-packages\distributed\protocol\serialize.py", line 366, in serialize
    header, frames = dumps(x, context=context) if wants_context else dumps(x)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\thodo\miniconda3\envs\myenv\Lib\site-packages\distributed\protocol\serialize.py", line 78, in pickle_dumps
    frames[0] = pickle.dumps(
                ^^^^^^^^^^^^^
  File "C:\Users\thodo\miniconda3\envs\myenv\Lib\site-packages\distributed\protocol\pickle.py", line 77, in dumps
    result = cloudpickle.dumps(x, **dump_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\thodo\miniconda3\envs\myenv\Lib\site-packages\cloudpickle\cloudpickle.py", line 1529, in dumps
    cp.dump(obj)
  File "C:\Users\thodo\miniconda3\envs\myenv\Lib\site-packages\cloudpickle\cloudpickle.py", line 1295, in dump
    return super().dump(obj)
           ^^^^^^^^^^^^^^^^^
TypeError: cannot pickle 'deltalake._internal.RawDeltaTable' object

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\thodo\Documents\libra\myenv\append_issue.py", line 45, in <module>
    ddt.to_deltalake(
  File "C:\Users\thodo\miniconda3\envs\myenv\Lib\site-packages\dask_deltatable\write.py", line 239, in to_deltalake
    result = result.compute()
             ^^^^^^^^^^^^^^^^
  File "C:\Users\thodo\miniconda3\envs\myenv\Lib\site-packages\dask\base.py", line 372, in compute
    (result,) = compute(self, traverse=False, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\thodo\miniconda3\envs\myenv\Lib\site-packages\dask\base.py", line 660, in compute
    results = schedule(dsk, keys, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\thodo\miniconda3\envs\myenv\Lib\site-packages\distributed\protocol\serialize.py", line 392, in serialize
    raise TypeError(msg, str_x) from exc
TypeError: ('Could not serialize object of type HighLevelGraph', '<ToPickle: HighLevelGraph with 3 layers.\n<dask.highlevelgraph.HighLevelGraph object at 0x1e7f9b8b530>\n 0. 2095838501424\n 1. finalize-54be334c-3207-4a95-8908-1aac80f5edb6\n 2. _commit-2c7fae99-a722-4a11-8b99-ca2120ebbb4d\n>')

Library Versions

dask==2024.11.2
dask-deltatable==0.3.3
deltalake==0.22.3
distributed==2024.11.2
pandas==2.2.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions