Skip to content

Commit 6fa8fb3

Browse files
authored
Add Pandas, PyArrow and Polars docs (#7382)
* add pandas pyarrow and polars docs * fix tests * minor * again * again * polars example
1 parent c2b7303 commit 6fa8fb3

9 files changed

+619
-25
lines changed

docs/source/_toctree.yml

+10-2
Original file line numberDiff line numberDiff line change
@@ -30,12 +30,20 @@
3030
title: Process
3131
- local: stream
3232
title: Stream
33-
- local: use_with_tensorflow
34-
title: Use with TensorFlow
3533
- local: use_with_pytorch
3634
title: Use with PyTorch
35+
- local: use_with_tensorflow
36+
title: Use with TensorFlow
37+
- local: use_with_numpy
38+
title: Use with NumPy
3739
- local: use_with_jax
3840
title: Use with JAX
41+
- local: use_with_pandas
42+
title: Use with Pandas
43+
- local: use_with_polars
44+
title: Use with Polars
45+
- local: use_with_pyarrow
46+
title: Use with PyArrow
3947
- local: use_with_spark
4048
title: Use with Spark
4149
- local: cache

docs/source/process.mdx

+73-17
Original file line numberDiff line numberDiff line change
@@ -630,53 +630,109 @@ Note that if no sampling probabilities are specified, the new dataset will have
630630

631631
## Format
632632

633-
The [`~Dataset.set_format`] function changes the format of a column to be compatible with some common data formats. Specify the output you'd like in the `type` parameter and the columns you want to format. Formatting is applied on-the-fly.
633+
The [`~Dataset.with_format`] function changes the format of a column to be compatible with some common data formats. Specify the output you'd like in the `type` parameter. You can also choose which the columns you want to format using `columns=`. Formatting is applied on-the-fly.
634634

635635
For example, create PyTorch tensors by setting `type="torch"`:
636636

637637
```py
638-
>>> import torch
639-
>>> dataset.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "label"])
638+
>>> dataset = dataset.with_format(type="torch")
640639
```
641640

642-
The [`~Dataset.with_format`] function also changes the format of a column, except it returns a new [`Dataset`] object:
641+
The [`~Dataset.set_format`] function also changes the format of a column, except it runs in-place:
643642

644643
```py
645-
>>> dataset = dataset.with_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "label"])
644+
>>> dataset.set_format(type="torch")
645+
```
646+
647+
If you need to reset the dataset to its original format, set the format to `None` (or use [`~Dataset.reset_format`]):
648+
649+
```py
650+
>>> dataset.format
651+
{'type': 'torch', 'format_kwargs': {}, 'columns': [...], 'output_all_columns': False}
652+
>>> dataset = dataset.with_format(None)
653+
>>> dataset.format
654+
{'type': None, 'format_kwargs': {}, 'columns': [...], 'output_all_columns': False}
646655
```
647656

657+
### Tensors formats
658+
659+
Several tensors or arrays formats are supported. It is generally recommended to use these formats instead of converting outputs of a dataset to tensors or arrays manually to avoid unnecessary data copies and accelerate data loading.
660+
661+
Here is the list of supported tensors or arrays formats:
662+
663+
- NumPy: format name is "numpy", for more information see [Using Datasets with NumPy](use_with_numpy)
664+
- PyTorch: format name is "torch", for more information see [Using Datasets with PyTorch](use_with_pytorch)
665+
- TensorFlow: format name is "tensorflow", for more information see [Using Datasets with TensorFlow](use_with_tensorflow)
666+
- JAX: format name is "jax", for more information see [Using Datasets with JAX](use_with_jax)
667+
648668
<Tip>
649669

650-
🤗 Datasets also provides support for other common data formats such as NumPy, TensorFlow, JAX, Arrow, Pandas and Polars. Check out the [Using Datasets with TensorFlow](https://huggingface.co/docs/datasets/master/en/use_with_tensorflow#using-totfdataset) guide for more details on how to efficiently create a TensorFlow dataset.
670+
Check out the [Using Datasets with TensorFlow](use_with_tensorflow#using-totfdataset) guide for more details on how to efficiently create a TensorFlow dataset.
651671

652672
</Tip>
653673

654-
If you need to reset the dataset to its original format, use the [`~Dataset.reset_format`] function:
674+
When a dataset is formatted in a tensor or array format, all the data are formatted as tensors or arrays (except unsupported types like strings for example for PyTorch):
655675

656-
```py
657-
>>> dataset.format
658-
{'type': 'torch', 'format_kwargs': {}, 'columns': ['label'], 'output_all_columns': False}
659-
>>> dataset.reset_format()
660-
>>> dataset.format
661-
{'type': 'python', 'format_kwargs': {}, 'columns': ['idx', 'label', 'sentence1', 'sentence2'], 'output_all_columns': False}
676+
```python
677+
>>> ds = Dataset.from_dict({"text": ["foo", "bar"], "tokens": [[0, 1, 2], [3, 4, 5]]})
678+
>>> ds = ds.with_format("torch")
679+
>>> ds[0]
680+
{'text': 'foo', 'tokens': tensor([0, 1, 2])}
681+
>>> ds[:2]
682+
{'text': ['foo', 'bar'],
683+
'tokens': tensor([[0, 1, 2],
684+
[3, 4, 5]])}
685+
```
686+
687+
### Tabular formats
688+
689+
You can use a dataframes or tables format to optimize data loading and data processing, since they generally offer zero-copy operations and transforms written in low-level languages.
690+
691+
Here is the list of supported dataframes or tables formats:
692+
693+
- Pandas: format name is "pandas", for more information see [Using Datasets with Pandas](use_with_pandas)
694+
- Polars: format name is "polars", for more information see [Using Datasets with Polars](use_with_polars)
695+
- PyArrow: format name is "arrow", for more information see [Using Datasets with PyArrow](use_with_tensorflow)
696+
697+
When a dataset is formatted in a dataframe or table format, every dataset row or batches of rows is formatted as a dataframe or table, and dataset colums are formatted as a series or array:
698+
699+
```python
700+
>>> ds = Dataset.from_dict({"text": ["foo", "bar"], "label": [0, 1]})
701+
>>> ds = ds.with_format("pandas")
702+
>>> ds[:2]
703+
text label
704+
0 foo 0
705+
1 bar 1
662706
```
663707

664-
### Format transform
708+
Those formats make it possible to iterate on the data faster by avoiding data copies, and also enable faster data processing in [`~Dataset.map`] or [`~Dataset.filter`]:
665709

666-
The [`~Dataset.set_transform`] function applies a custom formatting transform on-the-fly. This function replaces any previously specified format. For example, you can use this function to tokenize and pad tokens on-the-fly. Tokenization is only applied when examples are accessed:
710+
```python
711+
>>> ds = ds.map(lambda df: df.assign(upper_text=df.text.str.upper()), batched=True)
712+
>>> ds[:2]
713+
text label upper_text
714+
0 foo 0 FOO
715+
1 bar 1 BAR
716+
```
717+
718+
### Custom format transform
719+
720+
The [`~Dataset.with_transform`] function applies a custom formatting transform on-the-fly. This function replaces any previously specified format. For example, you can use this function to tokenize and pad tokens on-the-fly. Tokenization is only applied when examples are accessed:
667721

668722
```py
669723
>>> from transformers import AutoTokenizer
670724

671725
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
672726
>>> def encode(batch):
673727
... return tokenizer(batch["sentence1"], batch["sentence2"], padding="longest", truncation=True, max_length=512, return_tensors="pt")
674-
>>> dataset.set_transform(encode)
728+
>>> dataset = dataset.with_transform(encode)
675729
>>> dataset.format
676730
{'type': 'custom', 'format_kwargs': {'transform': <function __main__.encode(batch)>}, 'columns': ['idx', 'label', 'sentence1', 'sentence2'], 'output_all_columns': False}
677731
```
678732

679-
You can also use the [`~Dataset.set_transform`] function to decode formats not supported by [`Features`]. For example, the [`Audio`] feature uses [`soundfile`](https://python-soundfile.readthedocs.io/en/0.11.0/) - a fast and simple library to install - but it does not provide support for less common audio formats. Here is where you can use [`~Dataset.set_transform`] to apply a custom decoding transform on the fly. You're free to use any library you like to decode the audio files.
733+
There is also [`~Dataset.set_transform`] which does the same but runs in-place.
734+
735+
You can also use the [`~Dataset.with_transform`] function to decode formats not supported by [`Features`]. For example, the [`Audio`] feature uses [`soundfile`](https://python-soundfile.readthedocs.io/en/0.11.0/) - a fast and simple library to install - but it does not provide support for less common audio formats. Here is where you can use [`~Dataset.set_transform`] to apply a custom decoding transform on the fly. You're free to use any library you like to decode the audio files.
680736

681737
The example below uses the [`pydub`](http://pydub.com/) package to open an audio format not supported by `soundfile`:
682738

docs/source/use_with_jax.mdx

+1-1
Original file line numberDiff line numberDiff line change
@@ -108,7 +108,7 @@ To avoid this, you must explicitly use the [`Array`] feature type and specify th
108108
>>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]]
109109
>>> features = Features({"data": Array2D(shape=(2, 2), dtype='int32')})
110110
>>> ds = Dataset.from_dict({"data": data}, features=features)
111-
>>> ds = ds.with_format("torch")
111+
>>> ds = ds.with_format("jax")
112112
>>> ds[0]
113113
{'data': Array([[1, 2],
114114
[3, 4]], dtype=int32)}

docs/source/use_with_numpy.mdx

+191
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,191 @@
1+
# Use with NumPy
2+
3+
This document is a quick introduction to using `datasets` with NumPy, with a particular focus on how to get
4+
`numpy.ndarray` objects out of our datasets, and how to use them to train models based on NumPy such as `scikit-learn` models.
5+
6+
7+
## Dataset format
8+
9+
By default, datasets return regular Python objects: integers, floats, strings, lists, etc..
10+
11+
To get NumPy arrays instead, you can set the format of the dataset to `numpy`:
12+
13+
```py
14+
>>> from datasets import Dataset
15+
>>> data = [[1, 2], [3, 4]]
16+
>>> ds = Dataset.from_dict({"data": data})
17+
>>> ds = ds.with_format("numpy")
18+
>>> ds[0]
19+
{'data': array([1, 2])}
20+
>>> ds[:2]
21+
{'data': array([
22+
[1, 2],
23+
[3, 4]])}
24+
```
25+
26+
<Tip>
27+
28+
A [`Dataset`] object is a wrapper of an Arrow table, which allows fast reads from arrays in the dataset to NumPy arrays.
29+
30+
</Tip>
31+
32+
Note that the exact same procedure applies to `DatasetDict` objects, so that
33+
when setting the format of a `DatasetDict` to `numpy`, all the `Dataset`s there
34+
will be formatted as `numpy`:
35+
36+
```py
37+
>>> from datasets import DatasetDict
38+
>>> data = {"train": {"data": [[1, 2], [3, 4]]}, "test": {"data": [[5, 6], [7, 8]]}}
39+
>>> dds = DatasetDict.from_dict(data)
40+
>>> dds = dds.with_format("numpy")
41+
>>> dds["train"][:2]
42+
{'data': array([
43+
[1, 2],
44+
[3, 4]])}
45+
```
46+
47+
48+
### N-dimensional arrays
49+
50+
If your dataset consists of N-dimensional arrays, you will see that by default they are considered as the same array if the shape is fixed:
51+
52+
```py
53+
>>> from datasets import Dataset
54+
>>> data = [[[1, 2],[3, 4]], [[5, 6],[7, 8]]] # fixed shape
55+
>>> ds = Dataset.from_dict({"data": data})
56+
>>> ds = ds.with_format("numpy")
57+
>>> ds[0]
58+
{'data': array([[1, 2],
59+
[3, 4]])}
60+
```
61+
62+
```py
63+
>>> from datasets import Dataset
64+
>>> data = [[[1, 2],[3]], [[4, 5, 6],[7, 8]]] # varying shape
65+
>>> ds = Dataset.from_dict({"data": data})
66+
>>> ds = ds.with_format("numpy")
67+
>>> ds[0]
68+
{'data': array([array([1, 2]), array([3])], dtype=object)}
69+
```
70+
71+
However this logic often requires slow shape comparisons and data copies.
72+
To avoid this, you must explicitly use the [`Array`] feature type and specify the shape of your tensors:
73+
74+
```py
75+
>>> from datasets import Dataset, Features, Array2D
76+
>>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]]
77+
>>> features = Features({"data": Array2D(shape=(2, 2), dtype='int32')})
78+
>>> ds = Dataset.from_dict({"data": data}, features=features)
79+
>>> ds = ds.with_format("numpy")
80+
>>> ds[0]
81+
{'data': array([[1, 2],
82+
[3, 4]])}
83+
>>> ds[:2]
84+
{'data': array([[[1, 2],
85+
[3, 4]],
86+
87+
[[5, 6],
88+
[7, 8]]])}
89+
```
90+
91+
### Other feature types
92+
93+
[`ClassLabel`] data is properly converted to arrays:
94+
95+
```py
96+
>>> from datasets import Dataset, Features, ClassLabel
97+
>>> labels = [0, 0, 1]
98+
>>> features = Features({"label": ClassLabel(names=["negative", "positive"])})
99+
>>> ds = Dataset.from_dict({"label": labels}, features=features)
100+
>>> ds = ds.with_format("numpy")
101+
>>> ds[:3]
102+
{'label': array([0, 0, 1])}
103+
```
104+
105+
String and binary objects are unchanged, since NumPy only supports numbers.
106+
107+
The [`Image`] and [`Audio`] feature types are also supported.
108+
109+
<Tip>
110+
111+
To use the [`Image`] feature type, you'll need to install the `vision` extra as
112+
`pip install datasets[vision]`.
113+
114+
</Tip>
115+
116+
```py
117+
>>> from datasets import Dataset, Features, Image
118+
>>> images = ["path/to/image.png"] * 10
119+
>>> features = Features({"image": Image()})
120+
>>> ds = Dataset.from_dict({"image": images}, features=features)
121+
>>> ds = ds.with_format("numpy")
122+
>>> ds[0]["image"].shape
123+
(512, 512, 3)
124+
>>> ds[0]
125+
{'image': array([[[ 255, 255, 255],
126+
[ 255, 255, 255],
127+
...,
128+
[ 255, 255, 255],
129+
[ 255, 255, 255]]], dtype=uint8)}
130+
>>> ds[:2]["image"].shape
131+
(2, 512, 512, 3)
132+
>>> ds[:2]
133+
{'image': array([[[[ 255, 255, 255],
134+
[ 255, 255, 255],
135+
...,
136+
[ 255, 255, 255],
137+
[ 255, 255, 255]]]], dtype=uint8)}
138+
```
139+
140+
<Tip>
141+
142+
To use the [`Audio`] feature type, you'll need to install the `audio` extra as
143+
`pip install datasets[audio]`.
144+
145+
</Tip>
146+
147+
```py
148+
>>> from datasets import Dataset, Features, Audio
149+
>>> audio = ["path/to/audio.wav"] * 10
150+
>>> features = Features({"audio": Audio()})
151+
>>> ds = Dataset.from_dict({"audio": audio}, features=features)
152+
>>> ds = ds.with_format("numpy")
153+
>>> ds[0]["audio"]["array"]
154+
array([-0.059021 , -0.03894043, -0.00735474, ..., 0.0133667 ,
155+
0.01809692, 0.00268555], dtype=float32)
156+
>>> ds[0]["audio"]["sampling_rate"]
157+
array(44100, weak_type=True)
158+
```
159+
160+
## Data loading
161+
162+
NumPy doesn't have any built-in data loading capabilities, so you'll either need to materialize the NumPy arrays like `X, y` to use in `scikit-learn` or use a library such as [PyTorch](https://pytorch.org/) to load your data using a `DataLoader`.
163+
164+
### Using `with_format('numpy')`
165+
166+
The easiest way to get NumPy arrays out of a dataset is to use the `with_format('numpy')` method. Lets assume
167+
that we want to train a neural network on the [MNIST dataset](http://yann.lecun.com/exdb/mnist/) available
168+
at the HuggingFace Hub at https://huggingface.co/datasets/mnist.
169+
170+
```py
171+
>>> from datasets import load_dataset
172+
>>> ds = load_dataset("mnist")
173+
>>> ds = ds.with_format("numpy")
174+
>>> ds["train"][0]
175+
{'image': array([[ 0, 0, 0, ...],
176+
[ 0, 0, 0, ...],
177+
...,
178+
[ 0, 0, 0, ...],
179+
[ 0, 0, 0, ...]], dtype=uint8),
180+
'label': array(5)}
181+
```
182+
183+
Once the format is set we can feed the dataset to the model based on NumPy in batches using the `Dataset.iter()`
184+
method:
185+
186+
```py
187+
>>> for epoch in range(epochs):
188+
... for batch in ds["train"].iter(batch_size=32):
189+
... x, y = batch["image"], batch["label"]
190+
... ...
191+
```

0 commit comments

Comments
 (0)