Skip to content

arrow_schema is not compatible with list #7577

Closed
@jonathanshen-upwork

Description

@jonathanshen-upwork

Describe the bug

import datasets
f = datasets.Features({'x': list[datasets.Value(dtype='int32')]})
f.arrow_schema

Traceback (most recent call last):
  File "datasets/features/features.py", line 1826, in arrow_schema
    return pa.schema(self.type).with_metadata({"huggingface": json.dumps(hf_metadata)})
                     ^^^^^^^^^
  File "datasets/features/features.py", line 1815, in type
    return get_nested_type(self)
           ^^^^^^^^^^^^^^^^^^^^^
  File "datasets/features/features.py", line 1252, in get_nested_type
    return pa.struct(
           ^^^^^^^^^^
  File "pyarrow/types.pxi", line 5406, in pyarrow.lib.struct
  File "pyarrow/types.pxi", line 3890, in pyarrow.lib.field
  File "pyarrow/types.pxi", line 5918, in pyarrow.lib.ensure_type
TypeError: DataType expected, got <class 'list'>

The following works

f = datasets.Features({'x': datasets.LargeList(datasets.Value(dtype='int32'))})

Expected behavior

according to

- Python `list`, [`LargeList`] or [`Sequence`] specifies a composite feature containing a sequence of
python list should be a valid type specification for features

Environment info

  • datasets version: 3.5.1
  • Platform: macOS-15.5-arm64-arm-64bit
  • Python version: 3.12.9
  • huggingface_hub version: 0.30.2
  • PyArrow version: 19.0.1
  • Pandas version: 2.2.3
  • fsspec version: 2024.12.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions