Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvement of CLI array interface for ND cases #1887

Open
BCSharp opened this issue Jan 28, 2025 · 0 comments
Open

Improvement of CLI array interface for ND cases #1887

BCSharp opened this issue Jan 28, 2025 · 0 comments

Comments

@BCSharp
Copy link
Member

BCSharp commented Jan 28, 2025

The Problem

The current interface of the CLI arrays (beyond of what is provided directly by .NET) is in my opinion somewhat unfortunate. It is not wrong: it simply follows the pattern of other builtin collections, like list, tuple etc. However, all those other types are 1-dimensional, so their API makes very much sense. At the same time, it does not scale well to more dimensions. This can be seen with memoryview, the only builtin Python type that has any notion of multidimensionality. It too follows the list pattern, and when operations are applied on an N-dimensional (ND) memoryview, it is either implicitly flattened or raises NotImplementedError.

To be more specific, I will use the two cases that the CLI array currently implements that are from the list pattern: the addition and multiplication operator.

With list etc. addition is concatenation:

>>> [1, 2] + [3, 4]
[1, 2, 3, 4]

This is very useful for 1D structures, but does not scale up to higher dimensions: even if operator + were to mean concatenation, along which dimension should it operate? There is no way to pass an extra parameter indicating the dimension to use. It can be assumed that it is the last dimension, but it leaves a big gap in API for concatenating along arbitrary dimension.

With list etc. multiplication is repeated concatenation with itself:

>>> (1, 2) * 3
(1, 2, 1, 2, 1, 2)
>>> "ab" * 3
'ababab'

This is occasionally useful, and again it does not scale up to higher dimensions.

The next one to consider is slicing. Slicing can be seen as as specific way of indexing, except that it retrieves a substructure, rather than an individual element. The question is how slicing would work on an ND array (currently it is not supported). The only guidance from Python itself would be the behaviour of memoryview as the only builtin ND structure. However, beyond a simple element retrieval, memoryview does not support much of anything.

>>> m = memoryview(b"abcd")
>>> list(m)
[97, 98, 99, 100]
>>> m2 = m.cast('b', (2, 2))
>>> list(m2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NotImplementedError: multi-dimensional sub-views are not implemented
>>> m[0]
97
>>> m2[0, 0]
97
>>> m2[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NotImplementedError: multi-dimensional sub-views are not implemented
>>> list(m[:1])
[97]
>>> list(m2[:1])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NotImplementedError: multi-dimensional sub-views are not implemented
>>> list(m2[:1,0])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: memoryview: invalid slice key

So memoryview does not offer much of the guidance on what would bye a Pythonic API for ND arrays.

The Proposal

But there is another reference point for ND arrays that is by now practically accepted in the Python community as the golden standard: NumPy. Python lacks the batteries to support multidimensional arrays, but practically NumPy can be now consideded as a CPython's extension; this being one of the reasons it is so hard to support NumPy in IronPython. There is even a special language element in Python introduced specifically to support NumPy's ndarray: ... (or Ellipsis). IronPython may not support NumPy directly (see Ironclad), but it does come with a fundamental data structure that looks like a lot like NumPy's ndarray, and that is the CLI array: it is memory contiguous, elements are typed, supports multiple dimensions.

The proposal is to adopt the NumPy's API pattern to IronPython's extended (Pythonized) support for CLI arrays. CLI arrays may not replace genuine numpy.ndarray, (for one thing, they are limited to sightly less than 2 GiB in size, which is a limitation of .NET), but surely they can do a lot. A lot more than they do now. Perhaps even enough to let other packages that use NumPy run on IronPython, e.g. pandas. Also, 2 GiB is still a lot of data, should be enough to fit an Excel spreadsheet.

There is one aspect that CLI arrays have that NumPy doesn't have: non-zero based arrays. Since this is such a niche feature, I think it is sufficient to limit the operation on arrays of compatible (i..e. the same) base.

Examples

Here are some examples how the operations are defined on ndarray and would henceforth be applicable to System.Array as well:

Addition is element-wise. When adding arrays with different dimensions, the missing dimensions are completed with a broadcast from lower dimensions.

>>> import numpy as np
>>> np.array([1, 2], dtype=int) + 2
array([3, 4])
>>> np.array([1, 2], dtype=int) + np.array([3, 4], dtype=int)
array([4, 6])
>>> np.array([[1, 2], [3, 4]], dtype=int) + np.array([10, 20], dtype=int)
array([[11, 22],
       [13, 24]])

Multiplication is, like addition, element wise and follows the same rules.

Array concatenation is done by a function call. The most versatile is concatenate, but there are a few more that make assumptions about the dimension along which arrays should be concatenated (e.g. vstack, hstack).

>>> np.concatenate((np.array([1, 2], dtype=int), np.array([3, 4], dtype=int)))
array([1, 2, 3, 4])
>>> np.concatenate((np.array([[1, 2], [3, 4]], dtype=int), np.array([[10, 20], [30, 40]], dtype=int)))
array([[ 1,  2],
       [ 3,  4],
       [10, 20],
       [30, 40]])
>>> np.concatenate((np.array([[1, 2], [3, 4]], dtype=int), np.array([[10, 20], [30, 40]], dtype=int)), axis=1)
array([[ 1,  2, 10, 20],
       [ 3,  4, 30, 40]])

Array repeating concatenation with itself is done with tile:

>>> np.tile(np.array([1, 2], dtype=int), 3)
array([1, 2, 1, 2, 1, 2])
>>> np.tile(np.array([[1, 2], [3, 4]], dtype=int), 3)
array([[1, 2, 1, 2, 1, 2],
       [3, 4, 3, 4, 3, 4]])
>>> np.tile(np.array([[1, 2], [3, 4]], dtype=int), (3, 1))
array([[1, 2],
       [3, 4],
       [1, 2],
       [3, 4],
       [1, 2],
       [3, 4]])

And of course, ndarray has a first-class well-defined slicing interface.

The Plan

Luckily, the Python-level CLI array support is currently so limited that there is little in the way towards the ndarray API pattern. The currently "not implemented" cases can simply be implemented following the ndarray lines, rather than the list lines. The only two problematic cases are the operators + and *. Changing their semantics would be a breaking change to the existing IronPython codebase. Personally, I doubt they are being used a lot (it is currently a way easier e.g. to use lists or tuples than CLI arrays), nevertheless this change has to be managed properly. I see the following steps as a possible way:

  1. Implement concatenate and tile but leave operators + and * unchanged.
  2. Add a new IronPython option (command-line and as an option to the engine, maybe even as an environment variable) that changes the semantics of the operators. The option would have three values default, legacy, and ndarray. If it is ndarray the numpy.ndarray semantics is applied. In the remaining cases the old semantics stays in place.
  3. Start generating runtime warnings if the operators in question are being used but the runtime option is not explicitly set to legacy or ndarray (default will still be default and defaulting to legacy with a warning).
  4. Do a release, so that the users that are affected by this have the time to adapt and choose what to do in their code.
  5. Change the meaning of default to perform the ndarray semantics, but still with a warning (which can be silenced by being explicit about the choice).
  6. Do a release (probably a year later).
  7. Remove the warning. There is enough time for the users to adapt. For those coming from IronPython 2, the note can be put in the document about migration.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant