Maybe fix column selection by martindurant · Pull Request #1979 · fsspec/filesystem_spec

martindurant · 2026-01-23T15:26:49Z

@hombit , can you test, please? I fear this may be over reading bytes, but at least all the tests pass, including your specific case.

hombit · 2026-01-23T15:47:59Z

I'm looking into it. All examples from #1973 work now! Let me try running our CI in different repos vs this branch

hombit · 2026-01-23T16:06:05Z

Sorry, I think I messed up something. The test from the issue still fails:

uv venv venv
source venv/bin/activate
uv pip install git+https://github.com/martindurant/filesystem_spec@parquet-nested pyarrow

import random
import pyarrow as pa
import pyarrow.parquet as pq
from fsspec.parquet import open_parquet_file

def test(n, path):
    flat = pa.array([random.random() for _ in range(n)])
    nested = pa.array([{"a": random.random(), "b": random.random()} for _ in range(n)])
    table = pa.table({"flat": flat, "nested": nested})
    pq.write_table(table, path)
    with open_parquet_file(path, columns=["nested.a"], engine="pyarrow") as fh:
        _ = pq.read_table(fh)

# works for 10 rows
test(10, "/tmp/ten.parquet")

# fails for 100k rows
test(100_000, "100k.parquet")

OSError: Malformed levels. min: 2 max: 2 out of range.  Max Level: 1

martindurant · 2026-01-23T16:09:52Z

I added the test https://github.com/fsspec/filesystem_spec/pull/1979/files#diff-ff7fd767388891014a980915bb4fb6a84233cd96324d59e80ce9d2db18577791R208-R220 that is supposed to be identical. I wonder what the difference is.

martindurant · 2026-01-23T17:03:20Z

(sorry, this link:

filesystem_spec/fsspec/tests/test_parquet.py

Line 208 in 8ba3aae

def test_nested(n, tmpdir, engine):

)

hombit · 2026-01-23T18:36:45Z

Sorry, it was my mistake with the code: pq.read_table(fh) doesn't have columns=columns. It passes when I add it. I'm still testing with other repositories, give me some time

hombit · 2026-01-23T18:53:16Z

Could you please try with the files I shared in #1973? It still fails for me with double-nested columns (e.g. "spectrum.flux" is a list-array itself).

hombit · 2026-01-23T20:50:32Z

The test you introduced is a little bit different from the code in my original issue: you reuse the same a and b values for the nested column, which makes them very small on the disk because of encoding and compression.

If I change you test with either 1) {"a": random.random(), "b": random.random()}, or 2) pq.write_table(table, path, use_dictionary=False, compression=None), it fails with the same "Couldn't deserialize thrift" error.

Reproducible code

uv venv venv
source venv/bin/activate
uv pip install git+https://github.com/martindurant/filesystem_spec@parquet-nested pyarrow pandas

import os
import random

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
from fsspec.parquet import open_parquet_file

def test_nested(n, tmpdir, engine):
    path = os.path.join(str(tmpdir), "test.parquet")
    import pyarrow as pa
    flat = pa.array([random.random() for _ in range(n)])
    a = random.random()
    b = random.random()
    nested = pa.array([{"a": a, "b": b} for _ in range(n)])
    table = pa.table({"flat": flat, "nested": nested})
    pq.write_table(table, path, use_dictionary=False, compression=None)
    with open_parquet_file(path, columns=["nested.a"], engine=engine) as fh:
        col = pd.read_parquet(fh, engine=engine, columns=["nested.a"])
    name = "a" if engine == "pyarrow" else "nested.a"
    assert (col[name] == a).all()
test_nested(1_000_000, '/tmp', 'pyarrow')

Traceback (most recent call last):
  File "<python-input-1>", line 22, in <module>
    test_nested(1_000_000, '/tmp', 'pyarrow')
    ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<python-input-1>", line 19, in test_nested
    col = pd.read_parquet(fh, engine=engine, columns=["nested.a"])
  File "/private/tmp/venv/lib/python3.14/site-packages/pandas/io/parquet.py", line 671, in read_parquet
    return impl.read(
           ~~~~~~~~~^
        path,
        ^^^^^
    ...<6 lines>...
        **kwargs,
        ^^^^^^^^^
    )
    ^
  File "/private/tmp/venv/lib/python3.14/site-packages/pandas/io/parquet.py", line 260, in read
    pa_table = self.api.parquet.read_table(
        path_or_handle,
    ...<3 lines>...
        **kwargs,
    )
  File "/private/tmp/venv/lib/python3.14/site-packages/pyarrow/parquet/core.py", line 1926, in read_table
    return dataset.read(columns=columns, use_threads=use_threads,
           ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                        use_pandas_metadata=use_pandas_metadata)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/tmp/venv/lib/python3.14/site-packages/pyarrow/parquet/core.py", line 1552, in read
    table = self._dataset.to_table(
        columns=columns, filter=self._filter_expression,
        use_threads=use_threads
    )
  File "pyarrow/_dataset.pyx", line 589, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 3969, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: Couldn't deserialize thrift: TProtocolException: Invalid data
Deserializing page header failed.

martindurant · 2026-01-25T20:36:59Z

pq.write_table(table, path, use_dictionary=False, compression=None)

That's interesting...

martindurant · 2026-01-25T20:50:51Z

@hombit , are you using the new pandas 3.0? It has broken all kinds of stuff related to parquet.

martindurant · 2026-01-25T20:59:05Z

I will merge this when it passes to unblock other PRs that are waiting and blocked on pandas 3, and then try to make a new fix later.

hombit · 2026-01-25T21:27:52Z

@martindurant sure! I did use pandas==3, but it fails the same way with pandas<3. I also test it with this PR, there we also have pandas<3

martindurant added 5 commits January 23, 2026 10:16

Maybe fix column selection

a28d906

simplify

b9fc3b7

more test

4a2ad9f

remove debug

bd34c97

Avoid pandas 3 (for now)

8ba3aae

hombit mentioned this pull request Jan 23, 2026

DONT MERGE: test fsspec PR 1982 lincc-frameworks/nested-pandas#443

Closed

fastparquet skip pandas 3

e249e04

martindurant merged commit 6e11963 into fsspec:master Jan 25, 2026
10 checks passed

martindurant deleted the parquet-nested branch January 25, 2026 21:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Maybe fix column selection#1979

Maybe fix column selection#1979
martindurant merged 6 commits intofsspec:masterfrom
martindurant:parquet-nested

martindurant commented Jan 23, 2026

Uh oh!

hombit commented Jan 23, 2026

Uh oh!

hombit commented Jan 23, 2026

Uh oh!

martindurant commented Jan 23, 2026

Uh oh!

martindurant commented Jan 23, 2026

Uh oh!

hombit commented Jan 23, 2026

Uh oh!

hombit commented Jan 23, 2026

Uh oh!

hombit commented Jan 23, 2026 •

edited

Loading

Uh oh!

martindurant commented Jan 25, 2026

Uh oh!

martindurant commented Jan 25, 2026

Uh oh!

martindurant commented Jan 25, 2026

Uh oh!

Uh oh!

hombit commented Jan 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

martindurant commented Jan 23, 2026

Uh oh!

hombit commented Jan 23, 2026

Uh oh!

hombit commented Jan 23, 2026

Uh oh!

martindurant commented Jan 23, 2026

Uh oh!

martindurant commented Jan 23, 2026

Uh oh!

hombit commented Jan 23, 2026

Uh oh!

hombit commented Jan 23, 2026

Uh oh!

hombit commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindurant commented Jan 25, 2026

Uh oh!

martindurant commented Jan 25, 2026

Uh oh!

martindurant commented Jan 25, 2026

Uh oh!

Uh oh!

hombit commented Jan 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hombit commented Jan 23, 2026 •

edited

Loading