Maybe fix column selection#1979
Conversation
|
I'm looking into it. All examples from #1973 work now! Let me try running our CI in different repos vs this branch |
|
Sorry, I think I messed up something. The test from the issue still fails: uv venv venv
source venv/bin/activate
uv pip install git+https://github.com/martindurant/filesystem_spec@parquet-nested pyarrowimport random
import pyarrow as pa
import pyarrow.parquet as pq
from fsspec.parquet import open_parquet_file
def test(n, path):
flat = pa.array([random.random() for _ in range(n)])
nested = pa.array([{"a": random.random(), "b": random.random()} for _ in range(n)])
table = pa.table({"flat": flat, "nested": nested})
pq.write_table(table, path)
with open_parquet_file(path, columns=["nested.a"], engine="pyarrow") as fh:
_ = pq.read_table(fh)
# works for 10 rows
test(10, "/tmp/ten.parquet")
# fails for 100k rows
test(100_000, "100k.parquet") |
|
I added the test https://github.com/fsspec/filesystem_spec/pull/1979/files#diff-ff7fd767388891014a980915bb4fb6a84233cd96324d59e80ce9d2db18577791R208-R220 that is supposed to be identical. I wonder what the difference is. |
|
(sorry, this link: filesystem_spec/fsspec/tests/test_parquet.py Line 208 in 8ba3aae |
|
Sorry, it was my mistake with the code: |
|
Could you please try with the files I shared in #1973? It still fails for me with double-nested columns (e.g. "spectrum.flux" is a list-array itself). |
|
The test you introduced is a little bit different from the code in my original issue: you reuse the same If I change you test with either 1) Reproducible codeuv venv venv
source venv/bin/activate
uv pip install git+https://github.com/martindurant/filesystem_spec@parquet-nested pyarrow pandasimport os
import random
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
from fsspec.parquet import open_parquet_file
def test_nested(n, tmpdir, engine):
path = os.path.join(str(tmpdir), "test.parquet")
import pyarrow as pa
flat = pa.array([random.random() for _ in range(n)])
a = random.random()
b = random.random()
nested = pa.array([{"a": a, "b": b} for _ in range(n)])
table = pa.table({"flat": flat, "nested": nested})
pq.write_table(table, path, use_dictionary=False, compression=None)
with open_parquet_file(path, columns=["nested.a"], engine=engine) as fh:
col = pd.read_parquet(fh, engine=engine, columns=["nested.a"])
name = "a" if engine == "pyarrow" else "nested.a"
assert (col[name] == a).all()
test_nested(1_000_000, '/tmp', 'pyarrow') |
That's interesting... |
|
@hombit , are you using the new pandas 3.0? It has broken all kinds of stuff related to parquet. |
|
I will merge this when it passes to unblock other PRs that are waiting and blocked on pandas 3, and then try to make a new fix later. |
|
@martindurant sure! I did use pandas==3, but it fails the same way with |
@hombit , can you test, please? I fear this may be over reading bytes, but at least all the tests pass, including your specific case.