Skip to content

Proposal for improvements to the IDTxl Data object #90

@daehrlich

Description

@daehrlich

Problem: An IDTxl Data object internally saves the data samples as a three-dimensional numpy array with dimensions (n_processes, n_samples, n_repetitions). Currently, there is no way of representing more complex processes with 1) higher dimensionality and 2) meta-information such as a continuous/discrete flag that need to be passed through to the estimator layer.

Proposal: Augment the representation of samples in the Data class by replacing the internal numpy array by an nparray subclass that

  • contains additional meta-data such as dimensionality information and continuous/discrete flags for the individual processes
  • implements process-wise slicing along the first dimension, i.e. if the data consists of three processes with two dimensions each, data[1:3] should return a container of shape (4, n_samples, n_repetitions) containing the first two processes
  • Can easily be transformed to a regular numpy array in the estimator layer

Requirements:

  • Full back-compatibility with existing implementation of algorithms: If the dimensions of all variables are set to 1, the proposed ndarray subclass should behave like a regular numpy array
  • The numpy subclass objects need to be unpacked to regular ndarrays in the estimators to use regular indexing and slicing on the first axis.
  • Minimal overhead: The implementation needs to ensure that data is not copied unnecessary and memory views are used whenever possible

Example of expected behaviour

a = NewArray([[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11]], process_dimensions=(1, 2, 1), continuous=(True, False, True))

# Process-level indexing. This shows how to get only the second process, which is two-dimensional
a[1]
--> NewArray[[3, 4, 5], [6, 7, 8]], process_dimensions=(2,), continuous=(False,))

# Process-level slicing
a[:2]
--> NewArray[[0, 1, 2], [3, 4, 5], [6, 7, 8]], process_dimensions=(1, 2), continuous=(True, False))

# Regular slicing along second and possibly array third dimension
a[:, 1:]
--> NewArray([[1, 2], [4, 5], [7, 8], [10, 11]], process_dimensions=(1, 2, 1), continuous=(True, False, True))

# Exposing the underlying numpy array (view) for estimation
a.to_numpy()
--> np.ndarray([[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11]])

Issues/Open Questions

  • numerical operations might return unexpected results when using the overwritten __getitem__ operator. Since algorithms are expected to only use slicing/indexing/reordering but not numerical operations, I suggest these should raise an error unless unpacked to a regular numpy array in the estimation layer

Comments
This suggestion might be split into two: Just adding continuous/discrete meta information to the arrays without allowing for multi-dimensional processes should be easily achievable by a transparent plug-in replacement of the internal numpy arrays with no need for conversion in the estimator layer.

Note that this proposal is as of yet not final and subject to discussion.

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions