Client-side flattening of a container #901

genematx · 2025-02-28T15:10:00Z

A third attempt at implementing a flat-namespaced container, this time (mostly) on the client-side. While the data is stored in a new Composite container, the flat namespace is implemented by the python client itself: the client keeps a mapping of table column names to their full paths. The distinct structure family serves a role akin to a spec here, but its functionality is thought to be expanded in the future.

Related: #668
Issue: #824

Checklist

Add a Changelog entry
Add the ticket number which this PR closes to the comment section

genematx · 2025-02-28T15:13:15Z

tiled/client/container.py


+class Composite(Container):
+
+    @property


possibly cached?

danielballan

Awesome, nice to see this converging.

I'd like to test-drive it a bit, which I haven't yet done, but I gave a close read.

tiled/client/container.py

tiled/catalog/adapter.py

tiled/structures/core.py

tiled/client/container.py

danielballan · 2025-03-12T16:18:32Z

This will also need a database migration to extend the StructureFamilies enum in PostgreSQL. This command will create a new stub file with a random name.

python -m tiled.catalog revision -m "Add composite structure family"

Add content with reference to this example: https://github.com/bluesky/tiled/blob/main/tiled/catalog/migrations/versions/0b033e7fbe30_add_awkward_to_structurefamily_enum.py

Then update the list of revisions in tiled.catalog.core.py.

danielballan · 2025-03-13T01:42:57Z

Here is @genematx's test script from our interactive session today, for Future Us:

import pandas as pd
import numpy as np
df = pd.DataFrame({"colA": np.random.randn(10), "colB": np.random.randint(0, 10, 10), "colC": np.random.choice(["a", "b", "c", "d", "e"], 10)})

Y = c.create_composite("test", metadata={"attrs": {"b": 2}})

Y.write_array(np.random.randn(10, ), key="arr1", metadata={"attrs": {"c": 3}})
Y.write_array(np.random.randn(10, ), key="arr2", metadata={"attrs": {"d": 4}})
Y.write_array(np.random.randn(20, ), key="arr3", metadata={"attrs": {"e": 5}})
Y.write_dataframe(df, key="tab1", metadata={"attrs": {"f": 6}})

I like how this turned out:

In [35]: c['test']
Out[35]: <Composite {'arr1', 'arr2', 'arr3', 'colA', 'colB', 'colC'}>

In [36]: c['test']['tab1']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[36], line 1
----> 1 c['test']['tab1']

File ~/Repos/bnl/tiled/tiled/client/container.py:1149, in Composite.__getitem__(self, key, _ignore_inlined_contents)
   1147     key = self._flat_keys_mapping[key]
   1148 else:
-> 1149     raise KeyError(
   1150         f"Key '{key}' not found. If it refers to a table, use .parts['{key}'] instead."
   1151     )
   1153 return super().__getitem__(key, _ignore_inlined_contents)

KeyError: "Key 'tab1' not found. If it refers to a table, use .parts['tab1'] instead."

In [37]: c['test'].parts['tab1']
Out[37]: <DataFrameClient ['colA', 'colB', 'colC']>

Perhaps we should make that error message even more generous by checking whether 'tab1' is in fact in parts.

Soon we will need c['test'].read() (which I guess returns an xarray.Dataset) but since Composite is usable without this, I would be in favor making that a separate follow-up PR.

I haven't done line review yet, but conceptually I think this has landed.

genematx · 2025-03-13T14:11:23Z

Perhaps we should make that error message even more generous by checking whether 'tab1' is in fact in parts.

I was thinking about this too, @danielballan. I'm afraid this would force us to make another API call (since we are not currently caching .parts and don't know what's in there). Do you think it would be worth caching the contents (similarly to how we do this with __len__)? Or I can use some heuristic, e.g. _flat_keys_mapping = {'tab1': None, 'colA': 'tab1/colA'}, so if tab1 is not in Keys -- it's definitely not existing, and if the value is None, it's just not addressable.

danielballan · 2025-03-13T15:00:35Z

I can think of other counterarguments, too. Like, maybe it's not in parts yet but it will be a moment later. And the point you raise about caching would make that even more complicated.

Let's leave it as is, at least for now.

danielballan

I took one more detailed read through.

tiled/adapters/sql.py

tiled/adapters/table.py

tiled/catalog/adapter.py

danielballan · 2025-03-13T16:58:04Z

tiled/catalog/migrations/versions/45a702586b2a_add_composite_structure_family.py

+
+
+def downgrade():
+    raise NotImplementedError


As of the beta releases, we have started implementing a downgrade path. I gather it shouldn't be too hard to do in this case?

Oh, nice! Added.
Please, doublecheck, @danielballan

tiled/client/container.py

Co-authored-by: Dan Allan <[email protected]>

genematx added 3 commits February 27, 2025 15:31

ENH: metadata for table columns

35fd53a

ENH: add composite structure family

65a4063

ENH: sketch of a simplest mapped flat structure

d1eff0a

genematx requested a review from danielballan February 28, 2025 15:10

genematx commented Feb 28, 2025

View reviewed changes

tiled/client/container.py Outdated

class Composite(Container):

@property

Copy link

Contributor Author

genematx Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibly cached?

genematx added 8 commits March 7, 2025 10:57

ENH: zipping 1D arrays

9f751fe

ENH: generate a dataset from composite

00f12dc

Remove virtual datasets WIP

9069911

ENH: Check for column name collisions

76560c2

TST: composite writing tests

afbdc78

MNT: lint

24a9366

TST: test for composite

fe99855

ENH: make _get_contents a method

ff9c9df

danielballan reviewed Mar 11, 2025

View reviewed changes

tiled/client/container.py Outdated Show resolved Hide resolved

tiled/catalog/adapter.py Outdated Show resolved Hide resolved

tiled/structures/core.py Show resolved Hide resolved

tiled/client/container.py Show resolved Hide resolved

genematx added 4 commits March 11, 2025 11:46

FIX: split keys string client-side in the Container

04c57d5

MNT: address review comments

b60b75f

FIX:

82ec950

ENH: a hacky way to directly address composite columns from container

c377c00

genematx added 6 commits March 12, 2025 12:47

MNT: refactor and convert lists to sets

5b09190

Merge branch 'main' into flat-container

c60fa9e

ENH: restrict access to table in composite

b49885e

MNT: DB migration

f73c8f3

ENH: keep metadata for table columns

ce75101

ENH: fix metadata bugs

cea4924

genematx marked this pull request as ready for review March 12, 2025 21:37

genematx added 4 commits March 12, 2025 17:46

DOC: add Composite docs

53ea819

MNT: changelog entry

5718ae5

MNT: linting

643feb8

FIX: serialization of tables

c9d1ea1

danielballan mentioned this pull request Mar 13, 2025

Add composite structure family #668

Closed

FIX: ensure keys are tuple

9712ada

TST: fix paths for Windows tests

93dbe13

danielballan reviewed Mar 13, 2025

View reviewed changes

genematx and others added 5 commits March 13, 2025 13:23

REV: metadata for table columns

a99cca2

Update tiled/client/container.py

92090c1

Co-authored-by: Dan Allan <[email protected]>

MNT: minor renames

18d3efa

REL: a downgrade path for DB versioning

5d14318

Cannot drop value from enum

c2ddb25

danielballan merged commit d3c5fb9 into bluesky:main Mar 13, 2025
8 checks passed

Client-side flattening of a container #901

Client-side flattening of a container #901

Uh oh!

Conversation

genematx commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

genematx Feb 28, 2025

Choose a reason for hiding this comment

Uh oh!

danielballan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danielballan commented Mar 12, 2025 • edited by genematx Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielballan commented Mar 13, 2025

Uh oh!

genematx commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielballan commented Mar 13, 2025

Uh oh!

danielballan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danielballan Mar 13, 2025

Choose a reason for hiding this comment

Uh oh!

genematx Mar 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

genematx commented Feb 28, 2025 •

edited

Loading

danielballan commented Mar 12, 2025 •

edited by genematx

Loading

genematx commented Mar 13, 2025 •

edited

Loading