forked from jcrobak/parquet-python
-
-
Notifications
You must be signed in to change notification settings - Fork 188
Open
Description
TL;DR
Column names with dots are badly handled when the column data is not a primitive, e.g. List[str]
.
-> It wrongly tries to find a key named foo
if the column is "foo.with.lists": [[], [], []],
Describe the issue:
Here is a unit test that shows the difference of behaviour and the reproduced bug
import fastparquet
import polars
import pytest
class TestReproduceGroupKeyErrorForNamesWithDots:
def test_valid_data(self, tmp_path: Path) -> None:
"""Test that column names with dots are not causing issues when cell contents are primitives."""
df = polars.DataFrame(
{
"foo.with.strings": ["hey", "there", None],
"foo.with.ints": [1, 2, 3],
}
)
df.write_parquet(tmp_path / "output.parquet")
file = fastparquet.ParquetFile(tmp_path / "output.parquet")
d = dict(enumerate(file.iter_row_groups(), start=1))
assert len(d) == 1
def test_reproduce_error(self, tmp_path: Path) -> None:
"""Test that column names with dots are causing issues when cell contents are not primitives."""
df = polars.DataFrame(
{
"foo.with.strings": ["hey", "there", None],
"foo.with.ints": [1, 2, 3],
"foo.with.lists": [[], [], []],
}
)
df.write_parquet(tmp_path / "output.parquet")
file = fastparquet.ParquetFile(tmp_path / "output.parquet")
with pytest.raises(KeyError): # for key 'foo'
d = dict(enumerate(file.iter_row_groups(), start=1))
assert len(d) == 1
Anything else we need to know?:
This is heavily used by frictionless-py here
Environment:
- Python version: Python 3.9.18
- Operating System: macos
- Install method (conda, pip, source): rye (pip)
Metadata
Metadata
Assignees
Labels
No labels