Handling numpy ndarray or tensor objects with atleast 1 dimension having variable size #48099

RayZ0rr · 2025-11-11T17:22:59Z

RayZ0rr
Nov 11, 2025

I want to use objects which like numpy ndarray or pytorch tensors which can have atleast 1 dimension where the size varies. For example consider list of 2D pointclouds. Each pointcloud data or example has shape (N, 2). Here N can be different for different pointcloud data.

pyarrow.FixedShapeTensorType doesn't work for this usecase. VariableShapeTensor implementations 1 and 2 has not been merged. While waiting for these merges I have implemented this in the following way for zero-copy retrieval of the original list of variable tensors from the pyarrow table.

For each variable shape tensor keep two columns one of type pyarrow.ListType with the child type same as dtype of the tensor and other column of type pyarrow.ListType with child as int32.
Take for example 1st column as "points_val" and other "points_shape". Each element of "points_val" will be a flattened list of values of a single tensor (view(-1) or reshape(-1)). Each element of "points_shape" will have the shape of the tensor.
Using the following function we can get a list of original variable shape tensors back. There is a more efficient way to do this if the full tensor fits in memory.

def getTensors(table: pa.Table):
    vals = table["points_val"]
    shapes = table["points_shape"]
    out = []
    M = len(vals)
    for i in range(M):
        data_np  = vals[i].values.to_numpy()
        dims_np  = shapes[i].values
        o = data_np.reshape(tuple(int(x) for x in dims_np))
        out.append(o)
    return out

Does anyone know of a better way or think this is not zero-copy?

rok · 2025-11-12T14:51:38Z

rok
Nov 12, 2025
Collaborator

Hey @RayZ0rr ! Nice to see there's interest in using arrow for this.

For each variable shape tensor keep two columns one of type pyarrow.ListType with the child type same as dtype of the tensor and other column of type pyarrow.ListType with child as int32.

In VariableShapeTensor we specify shapes are stored in pyarrow.ListType[n] where it would be n=2 for your case. From your snippet I can't tell if you do pyarrow.ListType[n] or pyarrow.ListType.

Using the following function we can get a list of original variable shape tensors back. There is a more efficient way to do this if the full tensor fits in memory.

In VariableShapeTensor Python PR we propose from_numpy_ndarray which does the reverse of what you want to do. For your case I would check to make sure no copies occur in the reshape, I would also create and use dims_np like so to avoid copying and pure python:

dims_np = pa.array(shapes[i], pa.list_(pa.int32(), 2))[0].values.to_numpy()
o = data_np.reshape(dims_np)

1 reply

RayZ0rr Nov 13, 2025
Author

Hey @rok ,
I use pyarrow.ListType instead of pyarrow.ListType[n] because I don't have to carry around the information of n when I'm saving or loading the data. It's not anything complicated at all but maybe one or two lines of code less. Does this have any other bad side effects?

dims_np = pa.array(shapes[i], pa.list_(pa.int32(), 2))[0].values.to_numpy()
o = data_np.reshape(dims_np)

in creating dims_np can I not use dims_np = shapes[i].values.to_numpy() instead of creating an Array and doing the same for the first element?

rok · 2025-11-13T13:10:34Z

rok
Nov 13, 2025
Collaborator

I use pyarrow.ListType instead of pyarrow.ListType[n] because I don't have to carry around the information of n when I'm saving or loading the data. It's not anything complicated at all but maybe one or two lines of code less. Does this have any other bad side effects?

FixedSizeListArray is more memory efficient (it doesn't require an offsets buffer like the ListArray) and we use FixedSizeListArray in the VariableShapeTensorArray specification for storing shapes. So if you'll switch to VariableShapeTensorArray at some point you might want to use the same memory layout.
Since your shapes will probably be relatively small compared to your values array probably won't be so important to optimize it though.

in creating dims_np can I not use dims_np = shapes[i].values.to_numpy() instead of creating an Array and doing the same for the first element?

Sorry, my example was not great. dims_np = shapes[i].values.to_numpy() is definitely better and should be zero-copy.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handling numpy ndarray or tensor objects with atleast 1 dimension having variable size #48099

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Handling numpy ndarray or tensor objects with atleast 1 dimension having variable size #48099

Uh oh!

Uh oh!

RayZ0rr Nov 11, 2025

Replies: 2 comments · 1 reply

Uh oh!

rok Nov 12, 2025 Collaborator

Uh oh!

Uh oh!

RayZ0rr Nov 13, 2025 Author

Uh oh!

rok Nov 13, 2025 Collaborator

RayZ0rr
Nov 11, 2025

Replies: 2 comments 1 reply

rok
Nov 12, 2025
Collaborator

RayZ0rr Nov 13, 2025
Author

rok
Nov 13, 2025
Collaborator