Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NestedFrame.query() should handle mixed base and nested columns without erroring #154

Open
gitosaurus opened this issue Oct 15, 2024 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@gitosaurus
Copy link
Contributor

Today, NestedFrame.query checks the expression it's given and raises ValueError if it mixes nested and base columns. This is because in order to handle such expressions correctly, it would need to tease out the sub-expressions that are strictly against the nested columns (by traversing the abstract syntax tree of the input expression), apply and re-pack into an intermediate result, and then apply the base column expressions to this intermediate result.

In an expression like a > 2 & nested.flux > 50, for example, the user would expect the resulting NestedFrame to have no a values which were <= 2 and no nested.flux values which were <= 50. And in an expression like a > 2 | nested.flux > 50, the user would still expect to retain rows where a <= 2 so long as it had some nested.flux > 50, but within those rows, they wouldn't expect to see any nested.flux <= 50. For those rows where a > 2, though, they'd expect to see all the nested.flux rows. In other words, as soon as there is mixed-level expression, the nested rows sometimes need to be queried and repacked before continuing, or at least that should be the final effect.

Logically, if there was a method to unpack all nests and broadcast all base columns across them, then we would take the result of self.eval(expr) and do something like self.flatten_all().loc[result].repack_all(), but this would likely not be performant.

@hombit
Copy link
Collaborator

hombit commented Oct 16, 2024

I’m against the broadcasting approach: it may cause memory usage to explode, while one of the core ideas of nested-pandas is to never have the “joined” version of the base and nested columns. We either need to find another way to do it or not implement this feature.

@dougbrn dougbrn added the enhancement New feature or request label Oct 17, 2024
@gitosaurus gitosaurus self-assigned this Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants