Skip to content

Commit 1990150

Browse files
committed
Update docs and docstrings
1 parent f0c1997 commit 1990150

File tree

2 files changed

+65
-136
lines changed

2 files changed

+65
-136
lines changed

docs/source/python/extending_types.rst

+58-91
Original file line numberDiff line numberDiff line change
@@ -68,34 +68,43 @@ message).
6868
See the :ref:`format_metadata_extension_types` section of the metadata
6969
specification for more details.
7070

71-
Pyarrow allows you to define such extension types from Python.
72-
73-
There are currently two ways:
74-
75-
* Subclassing :class:`PyExtensionType`: the (de)serialization is based on pickle.
76-
This is a good option for an extension type that is only used from Python.
77-
* Subclassing :class:`ExtensionType`: this allows to give a custom
78-
Python-independent name and serialized metadata, that can potentially be
79-
recognized by other (non-Python) Arrow implementations such as PySpark.
71+
Pyarrow allows you to define such extension types from Python by subclassing
72+
:class:`ExtensionType` and giving the derived class its own extension name
73+
and serialization mechanism. The extension name and serialized metadata
74+
can potentially be recognized by other (non-Python) Arrow implementations
75+
such as PySpark.
8076

8177
For example, we could define a custom UUID type for 128-bit numbers which can
82-
be represented as ``FixedSizeBinary`` type with 16 bytes.
83-
Using the first approach, we create a ``UuidType`` subclass, and implement the
84-
``__reduce__`` method to ensure the class can be properly pickled::
78+
be represented as ``FixedSizeBinary`` type with 16 bytes::
8579

86-
class UuidType(pa.PyExtensionType):
80+
class UuidType(pa.ExtensionType):
8781

8882
def __init__(self):
89-
pa.PyExtensionType.__init__(self, pa.binary(16))
83+
super().__init__(pa.binary(16), "my_package.uuid")
84+
85+
def __arrow_ext_serialize__(self):
86+
# Since we don't have a parameterized type, we don't need extra
87+
# metadata to be deserialized
88+
return b''
9089

91-
def __reduce__(self):
92-
return UuidType, ()
90+
@classmethod
91+
def __arrow_ext_deserialize__(cls, storage_type, serialized):
92+
# Sanity checks, not required but illustrate the method signature.
93+
assert storage_type == pa.binary(16)
94+
assert serialized == b''
95+
# Return an instance of this subclass given the serialized
96+
# metadata.
97+
return UuidType()
98+
99+
The special methods ``__arrow_ext_serialize__`` and ``__arrow_ext_deserialize__``
100+
define the serialization of an extension type instance. For non-parametric
101+
types such as the above, the serialization payload can be left empty.
93102

94103
This can now be used to create arrays and tables holding the extension type::
95104

96105
>>> uuid_type = UuidType()
97106
>>> uuid_type.extension_name
98-
'arrow.py_extension_type'
107+
'my_package.uuid'
99108
>>> uuid_type.storage_type
100109
FixedSizeBinaryType(fixed_size_binary[16])
101110

@@ -112,8 +121,11 @@ This can now be used to create arrays and tables holding the extension type::
112121
]
113122

114123
This array can be included in RecordBatches, sent over IPC and received in
115-
another Python process. The custom UUID type will be preserved there, as long
116-
as the definition of the class is available (the type can be unpickled).
124+
another Python process. The receiving process must explicitly register the
125+
extension type for deserialization, otherwise it will fall back to the
126+
storage type::
127+
128+
>>> pa.register_extension_type(UuidType())
117129

118130
For example, creating a RecordBatch and writing it to a stream using the
119131
IPC protocol::
@@ -129,43 +141,12 @@ and then reading it back yields the proper type::
129141
>>> with pa.ipc.open_stream(buf) as reader:
130142
... result = reader.read_all()
131143
>>> result.column('ext').type
132-
UuidType(extension<arrow.py_extension_type>)
133-
134-
We can define the same type using the other option::
135-
136-
class UuidType(pa.ExtensionType):
137-
138-
def __init__(self):
139-
pa.ExtensionType.__init__(self, pa.binary(16), "my_package.uuid")
140-
141-
def __arrow_ext_serialize__(self):
142-
# since we don't have a parameterized type, we don't need extra
143-
# metadata to be deserialized
144-
return b''
145-
146-
@classmethod
147-
def __arrow_ext_deserialize__(self, storage_type, serialized):
148-
# return an instance of this subclass given the serialized
149-
# metadata.
150-
return UuidType()
151-
152-
This is a slightly longer implementation (you need to implement the special
153-
methods ``__arrow_ext_serialize__`` and ``__arrow_ext_deserialize__``), and the
154-
extension type needs to be registered to be received through IPC (using
155-
:func:`register_extension_type`), but it has
156-
now a unique name::
157-
158-
>>> uuid_type = UuidType()
159-
>>> uuid_type.extension_name
160-
'my_package.uuid'
161-
162-
>>> pa.register_extension_type(uuid_type)
144+
UuidType(FixedSizeBinaryType(fixed_size_binary[16]))
163145

164146
The receiving application doesn't need to be Python but can still recognize
165-
the extension type as a "uuid" type, if it has implemented its own extension
166-
type to receive it.
167-
If the type is not registered in the receiving application, it will fall back
168-
to the storage type.
147+
the extension type as a "my_package.uuid" type, if it has implemented its own
148+
extension type to receive it. If the type is not registered in the receiving
149+
application, it will fall back to the storage type.
169150

170151
Parameterized extension type
171152
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -187,7 +168,7 @@ of the given frequency since 1970.
187168
# attributes need to be set first before calling
188169
# super init (as that calls serialize)
189170
self._freq = freq
190-
pa.ExtensionType.__init__(self, pa.int64(), 'my_package.period')
171+
super().__init__(pa.int64(), 'my_package.period')
191172

192173
@property
193174
def freq(self):
@@ -198,7 +179,7 @@ of the given frequency since 1970.
198179

199180
@classmethod
200181
def __arrow_ext_deserialize__(cls, storage_type, serialized):
201-
# return an instance of this subclass given the serialized
182+
# Return an instance of this subclass given the serialized
202183
# metadata.
203184
serialized = serialized.decode()
204185
assert serialized.startswith("freq=")
@@ -209,31 +190,10 @@ Here, we ensure to store all information in the serialized metadata that is
209190
needed to reconstruct the instance (in the ``__arrow_ext_deserialize__`` class
210191
method), in this case the frequency string.
211192

212-
Note that, once created, the data type instance is considered immutable. If,
213-
in the example above, the ``freq`` parameter would change after instantiation,
214-
the reconstruction of the type instance after IPC will be incorrect.
193+
Note that, once created, the data type instance is considered immutable.
215194
In the example above, the ``freq`` parameter is therefore stored in a private
216195
attribute with a public read-only property to access it.
217196

218-
Parameterized extension types are also possible using the pickle-based type
219-
subclassing :class:`PyExtensionType`. The equivalent example for the period
220-
data type from above would look like::
221-
222-
class PeriodType(pa.PyExtensionType):
223-
224-
def __init__(self, freq):
225-
self._freq = freq
226-
pa.PyExtensionType.__init__(self, pa.int64())
227-
228-
@property
229-
def freq(self):
230-
return self._freq
231-
232-
def __reduce__(self):
233-
return PeriodType, (self.freq,)
234-
235-
Also the storage type does not need to be fixed but can be parameterized.
236-
237197
Custom extension array class
238198
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
239199

@@ -252,12 +212,16 @@ the data as a 2-D Numpy array ``(N, 3)`` without any copy::
252212
return self.storage.flatten().to_numpy().reshape((-1, 3))
253213

254214

255-
class Point3DType(pa.PyExtensionType):
215+
class Point3DType(pa.ExtensionType):
256216
def __init__(self):
257-
pa.PyExtensionType.__init__(self, pa.list_(pa.float32(), 3))
217+
super().__init__(pa.list_(pa.float32(), 3), "my_package.Point3DType")
258218

259-
def __reduce__(self):
260-
return Point3DType, ()
219+
def __arrow_ext_serialize__(self):
220+
return b''
221+
222+
@classmethod
223+
def __arrow_ext_deserialize__(cls, storage_type, serialized):
224+
return Point3DType()
261225

262226
def __arrow_ext_class__(self):
263227
return Point3DArray
@@ -289,11 +253,8 @@ The additional methods in the extension class are then available to the user::
289253

290254

291255
This array can be sent over IPC, received in another Python process, and the custom
292-
extension array class will be preserved (as long as the definitions of the classes above
293-
are available).
294-
295-
The same ``__arrow_ext_class__`` specialization can be used with custom types defined
296-
by subclassing :class:`ExtensionType`.
256+
extension array class will be preserved (as long as the receiving process registers
257+
the extension type using :func:`register_extension_type` before reading the IPC data).
297258

298259
Custom scalar conversion
299260
~~~~~~~~~~~~~~~~~~~~~~~~
@@ -304,18 +265,24 @@ If you want scalars of your custom extension type to convert to a custom type wh
304265
For example, if we wanted the above example 3D point type to return a custom
305266
3D point class instead of a list, we would implement::
306267

268+
from collections import namedtuple
269+
307270
Point3D = namedtuple("Point3D", ["x", "y", "z"])
308271

309272
class Point3DScalar(pa.ExtensionScalar):
310273
def as_py(self) -> Point3D:
311274
return Point3D(*self.value.as_py())
312275

313-
class Point3DType(pa.PyExtensionType):
276+
class Point3DType(pa.ExtensionType):
314277
def __init__(self):
315-
pa.PyExtensionType.__init__(self, pa.list_(pa.float32(), 3))
278+
super().__init__(pa.list_(pa.float32(), 3), "my_package.Point3DType")
316279

317-
def __reduce__(self):
318-
return Point3DType, ()
280+
def __arrow_ext_serialize__(self):
281+
return b''
282+
283+
@classmethod
284+
def __arrow_ext_deserialize__(cls, storage_type, serialized):
285+
return Point3DType()
319286

320287
def __arrow_ext_scalar_class__(self):
321288
return Point3DScalar

python/pyarrow/types.pxi

+7-45
Original file line numberDiff line numberDiff line change
@@ -1437,7 +1437,10 @@ cdef class ExtensionType(BaseExtensionType):
14371437
Parameters
14381438
----------
14391439
storage_type : DataType
1440+
The underlying storage type for the extension type.
14401441
extension_name : str
1442+
A unique name distinguishing this extension type. The name will be
1443+
used when deserializing IPC data.
14411444
14421445
Examples
14431446
--------
@@ -1679,55 +1682,14 @@ cdef class PyExtensionType(ExtensionType):
16791682
Concrete base class for Python-defined extension types based on pickle
16801683
for (de)serialization.
16811684
1685+
.. warning::
1686+
This class is deprecated and its deserialization is disabled by default.
1687+
:class:`ExtensionType` is recommended instead.
1688+
16821689
Parameters
16831690
----------
16841691
storage_type : DataType
16851692
The storage type for which the extension is built.
1686-
1687-
Examples
1688-
--------
1689-
Define a UuidType extension type subclassing PyExtensionType:
1690-
1691-
>>> import pyarrow as pa
1692-
>>> class UuidType(pa.PyExtensionType):
1693-
... def __init__(self):
1694-
... pa.PyExtensionType.__init__(self, pa.binary(16))
1695-
... def __reduce__(self):
1696-
... return UuidType, ()
1697-
...
1698-
1699-
Create an instance of UuidType extension type:
1700-
1701-
>>> uuid_type = UuidType() # doctest: +SKIP
1702-
>>> uuid_type # doctest: +SKIP
1703-
UuidType(FixedSizeBinaryType(fixed_size_binary[16]))
1704-
1705-
Inspect the extension type:
1706-
1707-
>>> uuid_type.extension_name # doctest: +SKIP
1708-
'arrow.py_extension_type'
1709-
>>> uuid_type.storage_type # doctest: +SKIP
1710-
FixedSizeBinaryType(fixed_size_binary[16])
1711-
1712-
Wrap an array as an extension array:
1713-
1714-
>>> import uuid
1715-
>>> storage_array = pa.array([uuid.uuid4().bytes for _ in range(4)],
1716-
... pa.binary(16)) # doctest: +SKIP
1717-
>>> uuid_type.wrap_array(storage_array) # doctest: +SKIP
1718-
<pyarrow.lib.ExtensionArray object at ...>
1719-
[
1720-
...
1721-
]
1722-
1723-
Or do the same with creating an ExtensionArray:
1724-
1725-
>>> pa.ExtensionArray.from_storage(uuid_type,
1726-
... storage_array) # doctest: +SKIP
1727-
<pyarrow.lib.ExtensionArray object at ...>
1728-
[
1729-
...
1730-
]
17311693
"""
17321694

17331695
def __cinit__(self):

0 commit comments

Comments
 (0)