|
| 1 | +# Use with NumPy |
| 2 | + |
| 3 | +This document is a quick introduction to using `datasets` with NumPy, with a particular focus on how to get |
| 4 | +`numpy.ndarray` objects out of our datasets, and how to use them to train models based on NumPy such as `scikit-learn` models. |
| 5 | + |
| 6 | + |
| 7 | +## Dataset format |
| 8 | + |
| 9 | +By default, datasets return regular Python objects: integers, floats, strings, lists, etc.. |
| 10 | + |
| 11 | +To get NumPy arrays instead, you can set the format of the dataset to `numpy`: |
| 12 | + |
| 13 | +```py |
| 14 | +>>> from datasets import Dataset |
| 15 | +>>> data = [[1, 2], [3, 4]] |
| 16 | +>>> ds = Dataset.from_dict({"data": data}) |
| 17 | +>>> ds = ds.with_format("numpy") |
| 18 | +>>> ds[0] |
| 19 | +{'data': array([1, 2])} |
| 20 | +>>> ds[:2] |
| 21 | +{'data': array([ |
| 22 | + [1, 2], |
| 23 | + [3, 4]])} |
| 24 | +``` |
| 25 | + |
| 26 | +<Tip> |
| 27 | + |
| 28 | +A [`Dataset`] object is a wrapper of an Arrow table, which allows fast reads from arrays in the dataset to NumPy arrays. |
| 29 | + |
| 30 | +</Tip> |
| 31 | + |
| 32 | +Note that the exact same procedure applies to `DatasetDict` objects, so that |
| 33 | +when setting the format of a `DatasetDict` to `numpy`, all the `Dataset`s there |
| 34 | +will be formatted as `numpy`: |
| 35 | + |
| 36 | +```py |
| 37 | +>>> from datasets import DatasetDict |
| 38 | +>>> data = {"train": {"data": [[1, 2], [3, 4]]}, "test": {"data": [[5, 6], [7, 8]]}} |
| 39 | +>>> dds = DatasetDict.from_dict(data) |
| 40 | +>>> dds = dds.with_format("numpy") |
| 41 | +>>> dds["train"][:2] |
| 42 | +{'data': array([ |
| 43 | + [1, 2], |
| 44 | + [3, 4]])} |
| 45 | +``` |
| 46 | + |
| 47 | + |
| 48 | +### N-dimensional arrays |
| 49 | + |
| 50 | +If your dataset consists of N-dimensional arrays, you will see that by default they are considered as the same array if the shape is fixed: |
| 51 | + |
| 52 | +```py |
| 53 | +>>> from datasets import Dataset |
| 54 | +>>> data = [[[1, 2],[3, 4]], [[5, 6],[7, 8]]] # fixed shape |
| 55 | +>>> ds = Dataset.from_dict({"data": data}) |
| 56 | +>>> ds = ds.with_format("numpy") |
| 57 | +>>> ds[0] |
| 58 | +{'data': array([[1, 2], |
| 59 | + [3, 4]])} |
| 60 | +``` |
| 61 | + |
| 62 | +```py |
| 63 | +>>> from datasets import Dataset |
| 64 | +>>> data = [[[1, 2],[3]], [[4, 5, 6],[7, 8]]] # varying shape |
| 65 | +>>> ds = Dataset.from_dict({"data": data}) |
| 66 | +>>> ds = ds.with_format("numpy") |
| 67 | +>>> ds[0] |
| 68 | +{'data': array([array([1, 2]), array([3])], dtype=object)} |
| 69 | +``` |
| 70 | + |
| 71 | +However this logic often requires slow shape comparisons and data copies. |
| 72 | +To avoid this, you must explicitly use the [`Array`] feature type and specify the shape of your tensors: |
| 73 | + |
| 74 | +```py |
| 75 | +>>> from datasets import Dataset, Features, Array2D |
| 76 | +>>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]] |
| 77 | +>>> features = Features({"data": Array2D(shape=(2, 2), dtype='int32')}) |
| 78 | +>>> ds = Dataset.from_dict({"data": data}, features=features) |
| 79 | +>>> ds = ds.with_format("numpy") |
| 80 | +>>> ds[0] |
| 81 | +{'data': array([[1, 2], |
| 82 | + [3, 4]])} |
| 83 | +>>> ds[:2] |
| 84 | +{'data': array([[[1, 2], |
| 85 | + [3, 4]], |
| 86 | + |
| 87 | + [[5, 6], |
| 88 | + [7, 8]]])} |
| 89 | +``` |
| 90 | + |
| 91 | +### Other feature types |
| 92 | + |
| 93 | +[`ClassLabel`] data is properly converted to arrays: |
| 94 | + |
| 95 | +```py |
| 96 | +>>> from datasets import Dataset, Features, ClassLabel |
| 97 | +>>> labels = [0, 0, 1] |
| 98 | +>>> features = Features({"label": ClassLabel(names=["negative", "positive"])}) |
| 99 | +>>> ds = Dataset.from_dict({"label": labels}, features=features) |
| 100 | +>>> ds = ds.with_format("numpy") |
| 101 | +>>> ds[:3] |
| 102 | +{'label': array([0, 0, 1])} |
| 103 | +``` |
| 104 | + |
| 105 | +String and binary objects are unchanged, since NumPy only supports numbers. |
| 106 | + |
| 107 | +The [`Image`] and [`Audio`] feature types are also supported. |
| 108 | + |
| 109 | +<Tip> |
| 110 | + |
| 111 | +To use the [`Image`] feature type, you'll need to install the `vision` extra as |
| 112 | +`pip install datasets[vision]`. |
| 113 | + |
| 114 | +</Tip> |
| 115 | + |
| 116 | +```py |
| 117 | +>>> from datasets import Dataset, Features, Image |
| 118 | +>>> images = ["path/to/image.png"] * 10 |
| 119 | +>>> features = Features({"image": Image()}) |
| 120 | +>>> ds = Dataset.from_dict({"image": images}, features=features) |
| 121 | +>>> ds = ds.with_format("numpy") |
| 122 | +>>> ds[0]["image"].shape |
| 123 | +(512, 512, 3) |
| 124 | +>>> ds[0] |
| 125 | +{'image': array([[[ 255, 255, 255], |
| 126 | + [ 255, 255, 255], |
| 127 | + ..., |
| 128 | + [ 255, 255, 255], |
| 129 | + [ 255, 255, 255]]], dtype=uint8)} |
| 130 | +>>> ds[:2]["image"].shape |
| 131 | +(2, 512, 512, 3) |
| 132 | +>>> ds[:2] |
| 133 | +{'image': array([[[[ 255, 255, 255], |
| 134 | + [ 255, 255, 255], |
| 135 | + ..., |
| 136 | + [ 255, 255, 255], |
| 137 | + [ 255, 255, 255]]]], dtype=uint8)} |
| 138 | +``` |
| 139 | + |
| 140 | +<Tip> |
| 141 | + |
| 142 | +To use the [`Audio`] feature type, you'll need to install the `audio` extra as |
| 143 | +`pip install datasets[audio]`. |
| 144 | + |
| 145 | +</Tip> |
| 146 | + |
| 147 | +```py |
| 148 | +>>> from datasets import Dataset, Features, Audio |
| 149 | +>>> audio = ["path/to/audio.wav"] * 10 |
| 150 | +>>> features = Features({"audio": Audio()}) |
| 151 | +>>> ds = Dataset.from_dict({"audio": audio}, features=features) |
| 152 | +>>> ds = ds.with_format("numpy") |
| 153 | +>>> ds[0]["audio"]["array"] |
| 154 | +array([-0.059021 , -0.03894043, -0.00735474, ..., 0.0133667 , |
| 155 | + 0.01809692, 0.00268555], dtype=float32) |
| 156 | +>>> ds[0]["audio"]["sampling_rate"] |
| 157 | +array(44100, weak_type=True) |
| 158 | +``` |
| 159 | + |
| 160 | +## Data loading |
| 161 | + |
| 162 | +NumPy doesn't have any built-in data loading capabilities, so you'll either need to materialize the NumPy arrays like `X, y` to use in `scikit-learn` or use a library such as [PyTorch](https://pytorch.org/) to load your data using a `DataLoader`. |
| 163 | + |
| 164 | +### Using `with_format('numpy')` |
| 165 | + |
| 166 | +The easiest way to get NumPy arrays out of a dataset is to use the `with_format('numpy')` method. Lets assume |
| 167 | +that we want to train a neural network on the [MNIST dataset](http://yann.lecun.com/exdb/mnist/) available |
| 168 | +at the HuggingFace Hub at https://huggingface.co/datasets/mnist. |
| 169 | + |
| 170 | +```py |
| 171 | +>>> from datasets import load_dataset |
| 172 | +>>> ds = load_dataset("mnist") |
| 173 | +>>> ds = ds.with_format("numpy") |
| 174 | +>>> ds["train"][0] |
| 175 | +{'image': array([[ 0, 0, 0, ...], |
| 176 | + [ 0, 0, 0, ...], |
| 177 | + ..., |
| 178 | + [ 0, 0, 0, ...], |
| 179 | + [ 0, 0, 0, ...]], dtype=uint8), |
| 180 | + 'label': array(5)} |
| 181 | +``` |
| 182 | + |
| 183 | +Once the format is set we can feed the dataset to the model based on NumPy in batches using the `Dataset.iter()` |
| 184 | +method: |
| 185 | + |
| 186 | +```py |
| 187 | +>>> for epoch in range(epochs): |
| 188 | +... for batch in ds["train"].iter(batch_size=32): |
| 189 | +... x, y = batch["image"], batch["label"] |
| 190 | +... ... |
| 191 | +``` |
0 commit comments