Skip to content

Conversation

genematx
Copy link
Contributor

@genematx genematx commented Feb 28, 2025

A third attempt at implementing a flat-namespaced container, this time (mostly) on the client-side. While the data is stored in a new Composite container, the flat namespace is implemented by the python client itself: the client keeps a mapping of table column names to their full paths. The distinct structure family serves a role akin to a spec here, but its functionality is thought to be expanded in the future.

Related: #668
Issue: #824

Checklist

  • Add a Changelog entry
  • Add the ticket number which this PR closes to the comment section

@@ -1051,6 +1050,57 @@ def write_dataframe(
return client


class Composite(Container):

@property
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibly cached?

Copy link
Member

@danielballan danielballan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, nice to see this converging.

I'd like to test-drive it a bit, which I haven't yet done, but I gave a close read.

@danielballan
Copy link
Member

danielballan commented Mar 12, 2025

This will also need a database migration to extend the StructureFamilies enum in PostgreSQL. This command will create a new stub file with a random name.

python -m tiled.catalog revision -m "Add composite structure family"

Add content with reference to this example: https://github.com/bluesky/tiled/blob/main/tiled/catalog/migrations/versions/0b033e7fbe30_add_awkward_to_structurefamily_enum.py

Then update the list of revisions in tiled.catalog.core.py.

@genematx genematx marked this pull request as ready for review March 12, 2025 21:37
@danielballan
Copy link
Member

Here is @genematx's test script from our interactive session today, for Future Us:

import pandas as pd
import numpy as np
df = pd.DataFrame({"colA": np.random.randn(10), "colB": np.random.randint(0, 10, 10), "colC": np.random.choice(["a", "b", "c", "d", "e"], 10)})

Y = c.create_composite("test", metadata={"attrs": {"b": 2}})

Y.write_array(np.random.randn(10, ), key="arr1", metadata={"attrs": {"c": 3}})
Y.write_array(np.random.randn(10, ), key="arr2", metadata={"attrs": {"d": 4}})
Y.write_array(np.random.randn(20, ), key="arr3", metadata={"attrs": {"e": 5}})
Y.write_dataframe(df, key="tab1", metadata={"attrs": {"f": 6}})

I like how this turned out:

In [35]: c['test']
Out[35]: <Composite {'arr1', 'arr2', 'arr3', 'colA', 'colB', 'colC'}>

In [36]: c['test']['tab1']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[36], line 1
----> 1 c['test']['tab1']

File ~/Repos/bnl/tiled/tiled/client/container.py:1149, in Composite.__getitem__(self, key, _ignore_inlined_contents)
   1147     key = self._flat_keys_mapping[key]
   1148 else:
-> 1149     raise KeyError(
   1150         f"Key '{key}' not found. If it refers to a table, use .parts['{key}'] instead."
   1151     )
   1153 return super().__getitem__(key, _ignore_inlined_contents)

KeyError: "Key 'tab1' not found. If it refers to a table, use .parts['tab1'] instead."

In [37]: c['test'].parts['tab1']
Out[37]: <DataFrameClient ['colA', 'colB', 'colC']>

Perhaps we should make that error message even more generous by checking whether 'tab1' is in fact in parts.


Soon we will need c['test'].read() (which I guess returns an xarray.Dataset) but since Composite is usable without this, I would be in favor making that a separate follow-up PR.


I haven't done line review yet, but conceptually I think this has landed.

@genematx
Copy link
Contributor Author

genematx commented Mar 13, 2025

Perhaps we should make that error message even more generous by checking whether 'tab1' is in fact in parts.

I was thinking about this too, @danielballan. I'm afraid this would force us to make another API call (since we are not currently caching .parts and don't know what's in there). Do you think it would be worth caching the contents (similarly to how we do this with __len__)? Or I can use some heuristic, e.g. _flat_keys_mapping = {'tab1': None, 'colA': 'tab1/colA'}, so if tab1 is not in Keys -- it's definitely not existing, and if the value is None, it's just not addressable.

@danielballan
Copy link
Member

I can think of other counterarguments, too. Like, maybe it's not in parts yet but it will be a moment later. And the point you raise about caching would make that even more complicated.

Let's leave it as is, at least for now.

Copy link
Member

@danielballan danielballan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took one more detailed read through.



def downgrade():
raise NotImplementedError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of the beta releases, we have started implementing a downgrade path. I gather it shouldn't be too hard to do in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, nice! Added.
Please, doublecheck, @danielballan

@danielballan danielballan merged commit d3c5fb9 into bluesky:main Mar 13, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants