forked from jcrobak/parquet-python
-
-
Notifications
You must be signed in to change notification settings - Fork 188
Open
Description
Describe the issue:
Hello,
I have a peculiar field type in my parquet file:
List of Lists of strings.
For example:
0 []
1 [["hello"]]
2 [["hello", "bye"]]
3 [["hello"], ["bye"]]
...
I found that pyarrow loads those fine (default pandas engine), while fastparquet silently converts them to nans.
Minimal Complete Verifiable Example:
import pandas as pd
import numpy as np
import sys
from fastparquet import ParquetFile
print(f"Python {sys.version}")
print(f"Numpy {np.__version__}")
print(f"Fasparquet {fastparquet.__version__}")
print(f"Pandas {pd.__version__}")
data = {
"texts": [[["Message1", "Message2"]]],
}
df = pd.DataFrame(data)
df.to_parquet('test.parquet')
df_fast = ParquetFile('test.parquet')
df_fast = df_fast.to_pandas()
print("fastparquet:")
print(df_fast)
df_pandas = pd.read_parquet('approximated_structure.parquet')
print("pyarrow:")
print(df_pandas)
prints out:
Python 3.9.18 (main, Oct 3 2023, 01:30:02)
[Clang 17.0.1 ]
Numpy 1.25.2
Fasparquet 2023.10.1
Pandas 1.5.3
fastparquet:
texts
0 None
pyarrow:
texts
0 [[Message1, Message2]]
Environment:
Fastparquet version: 2023.10.1
Metadata
Metadata
Assignees
Labels
No labels