Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEP 1: Namespaces #14

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from
Draft

DEP 1: Namespaces #14

wants to merge 8 commits into from

Conversation

hmgaudecker
Copy link
Member

No description provided.

@hmgaudecker hmgaudecker marked this pull request as draft February 21, 2023 11:39
tests/test_trees.py Outdated Show resolved Hide resolved
Co-authored-by: Lars Reimann <[email protected]>
@lars-reimann
Copy link
Member

lars-reimann commented Feb 27, 2023

What should the signature of concatenate_functions be after the change?

Concatenate functions currently has the following arguments:

Name Type Default value
functions Union[list[Callable], dict[str, Callable]]
targets Union[str, list[str], NoneType] None
return_type Literal["tuple", "list", "dict"] "tuple"
aggregator Optional[Callable] None
enforce_signature bool True

Based on the tests it seems that these new arguments are needed:

Name Type Default value
functions_mode Literal["tree", "flat"] "tree"
input_mode Literal["tree", "flat"] "tree"
output_mode Literal["tree", "flat"] "tree"
name_clashes Literal["raise", "ignore", "warn"] "raise"

Questions:

  1. Can aggregator and enforce_signature be kept unchanged?
  2. Does output_mode replace return_type? Are the options "tuple", "list", "dict" not needed anymore?
  3. Do we need a targets_mode (values "tree" and "flat") as well?
  4. What should the type of functions be (Possible Containers)?
  5. What should the type of targets be? I guess the same as before with the additional option to provide an arbitrarily nested dict.

Other considerations:

input_mode and output_mode control the inputs and outputs of the returned function. These arguments are necessary. functions_mode and a potential targets_mode, however, only tell the system how other arguments, namely functions and targets are supposed to be interpreted.

Problem 1: It's easy to accidentally pass e.g. functions in one format but forget to set functions_mode properly. At best this error is caught during a validation phase. At worst, functions is silently accepted and the system produces wrong results.

Problem 2: The type of an argument like functions or targets is a very long union type, which keeps growing whenever we add another option. Any client of the API that, say, wraps concatenate_api into another function, has to repeat the union type.

This tight coupling between the arguments is an indicator, that a class for the function registry (Possible Containers) and maybe another for a target specification would make sense. These could then have some static/class methods like from_flat_dict or from_nested_dict to create instances. When we want to add another way to construct them, we add another method.

The function registry would have the additional benefit, that it can immediately check for name clashes as functions get added.

@hmgaudecker
Copy link
Member Author

Thanks!

Questions:

  1. Can aggregator and enforce_signature be kept unchanged?

I would think so. Maybe we'll need to restrict the possible combinations of aggregator and output_mode, but that would be it. We'll need to think about the docstring (Binary reduction function that is used to aggregate the targets into a single target), I would appreciate an example.

  1. Does output_mode replace return_type? Are the options "tuple", "list", "dict" not needed anymore?

No, but if output_mode = "tree", only return_type = "dict" makes sense.

  1. Do we need a targets_mode (values "tree" and "flat") as well?

I think that might just be a better name than output_mode. Sorry for the confusion.

  1. What should the type of functions be (Possible Containers)?

We would keep the current set and add dictionaries with arbitrary nesting, I believe. Question is whether we allow a module as syntactic sugar to add all functions contained in there as candidates.

In the example:

{
    "pensions": pensions
}

instead of:

{
    "pensions": {
        "eligible": pensions.eligible,
        "benefit": pensions.benefit,
    },
}

But this might be opening a can of worms (e.g., what do you do with functions imported into pensions.py?), so I am not sure it would be worth it.

  1. What should the type of targets be? I guess the same as before with the additional option to provide an arbitrarily nested dict.

Exactly, see 2.

I am not sure we want the functions_mode. @janosg, could you check the bottom of test_trees.py and voice your opinion?

The type of an argument like functions or targets is a very long union type, which keeps growing whenever we add another option. Any client of the API that, say, wraps concatenate_api into another function, has to repeat the union type.

This tight coupling between the arguments is an indicator, that a class for the function registry (Possible Containers) and maybe another for a target specification would make sense. These could then have some static/class methods like from_flat_dict or from_nested_dict to create instances. When we want to add another way to construct them, we add another method.

The function registry would have the additional benefit, that it can immediately check for name clashes as functions get added.

I am generally sympathetic to the idea of those classes, your argument makes sense to me. Only restriction would be that the user would never need to specify them but could just pass the Python built-ins we have so far.

@janosg
Copy link
Member

janosg commented Feb 27, 2023

What I write below assumes that we want a very general solution (similar to pytrees in jax and estimagic) with arbitrary levels of nesting and almost arbitrary container types, including lists, dicts and mixes thereof. I am not sure, this is what you need and want for gettsim. Designing something at this generality in a way that every edge case makes sense is vey hard and took us months in estimagic despite the fact that we just needed minor extensions to what JAX does.

I don't think we need functions_mode nor targets_mode. Whether functions are flat or nested can be discovered and targets should always mirror the structure of functions. Examples of such mirroring structures can be found in many places in JAX (e.g. in_axes argument of vmap) and we also copied it in estimagic (e.g. for bounds).

@hmgaudecker In the test file you wanted me to look at I found just one commented out mention of functions_mode and it was not clear what it would do. Maybe a leftover from some early stage brainstorming session?

  1. Can aggregator and enforce_signature be kept unchanged?

Enforce signature should not be a problem, defining what aggregator should do in the very general case is a hard design problem.

  1. Does output_mode replace return_type? Are the options "tuple", "list", "dict" not needed anymore?

This is also a hard design problem as there seems to be some overlap, but the overlap is incomplete. E.g. there could be multiple flat return types (tuple, list, ...). I use this level of control in multiple applications.

  1. Do we need a targets_mode (values "tree" and "flat") as well?

No. targets should mirror functions, see examples above.

  1. What should the type of functions be (Possible Containers)?

Almost anything but that does not actually make it difficult because of how pybaum or jax.tree_utils work. What makes everything difficult is the arbitrary nesting depth.

  1. What should the type of targets be? I guess the same as before with the additional option to provide an arbitrarily nested dict.

Targets should mirror functions, see examples above.

I think the coupling problem does not really exist as it was based on the assumption that we need a functions_mode and targets_mode.

I am against a class for functions. I think the whole reason why pytrees are so amazing from a user perspective is that they are not a new class but a combination of built in containers and you can choose for each application what is best.

What makes them so amazing from a developer perspective is that the small number of operations defined on them (tree_flatten, tree_unflatten, tree_map, ...) let you go very quickly from something extremely flexible to something that is very easy to work with. The constructors you suggest (from_flat_dict, from_nested_dict) look like they are basically tree_flatten and tree_unflatten in disguise.

Regarding the issue in typing: I think Jaxtyping has found a nice way of dealing with this generality.

Above I mention many open design problems. Those would all become easier if we are not talking about general pytrees with arbitrary nesting for functions but let's say a two level nested dict.

@hmgaudecker
Copy link
Member Author

@hmgaudecker In the test file you wanted me to look at I found just one commented out mention of functions_mode and it was not clear what it would do.

Apologies, I guess my thinking was to pass a flat dict and to use the nesting structure based on double underscores nevertheless. However, it makes perfect sense to forget about this, I don't think anybody ever wants to do this and if they want, they should build in a call of tree_unflatten first.

@janosg
Copy link
Member

janosg commented Feb 27, 2023

Brief extension: I think the only reasonable behavior of aggregator is to call it on flattened outputs. Would be backwards compatible.

So the main design problem left is how the current return_type interacts with the new output_mode

@hmgaudecker
Copy link
Member Author

Thanks for all your comments, @janosg ! Yes we should stick as closely as possible to pytrees in Jax and we probably should support only tree as mode with the tree-based nesting. People can call tree_unflatten if they seriously want to work with flat dictionaries outside of dags. The flat stuff mostly made sense to fix behaviour in the tests.

I am not sure I understand why a two-level nested dict would be easier to handle than arbitrary levels? I would think that for dags in tree mode, we could restrict the registry to dictionaries and the leafs to functions?

@janosg
Copy link
Member

janosg commented Feb 27, 2023

I think the hardest step in the implementation will be to extend short function names to function names that are unique in the global namespaces. Everything else should be easy to handle with pybaum functionality (even though sometimes a bit of creativity might be required).

This step would be much easier for a dict with fixed nesting depth. It probably also has an elegant recursive solution, but (at least for me) it's usually not trivial to find these.

@hmgaudecker
Copy link
Member Author

I think the hardest step in the implementation will be to extend short function names to function names that are unique in the global namespaces.

You mean cases like:

{
   "a": {"b": {"c": d},
   "a__b": {"c": d}
}

?

But I am probably just missing something fundamental once more.

@codecov
Copy link

codecov bot commented Feb 28, 2023

Codecov Report

Merging #14 (711f869) into main (ab7f8a2) will decrease coverage by 5.08%.
The diff coverage is 51.72%.

❗ Current head 711f869 differs from pull request most recent head eec3d23. Consider uploading reports for the commit eec3d23 to get more accurate results

@@            Coverage Diff             @@
##             main      #14      +/-   ##
==========================================
- Coverage   96.46%   91.38%   -5.08%     
==========================================
  Files           8       12       +4     
  Lines         453      511      +58     
==========================================
+ Hits          437      467      +30     
- Misses         16       44      +28     
Impacted Files Coverage Δ
tests/unemployment_insurance.py 37.50% <37.50%> (ø)
tests/pensions.py 40.00% <40.00%> (ø)
tests/test_trees_flat.py 41.93% <41.93%> (ø)
tests/test_trees.py 85.71% <85.71%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@hmgaudecker
Copy link
Member Author

So, just spent some time on this, wrote a long post and lost it. Apologies. Bottom line is that I think I/we got carried away when working on this in January. We really should stick to the basics, a tree in dags can be a nested dictionary with leafs = functions.

I also believe that we cannot determine a unique signature based on functions alone. The reason is that if we ask for a variable somewhere lower in the hierarchy than at the very top, we cannot know whether the user will pass it a variable at that level or somewhere further up. See test_trees.py for an example.

Updating @lars-reimann's table above, I would think of the interface as follows:

Name Type Default value Notes
functions Union[list[Callable], dict[str, Callable]] Only dict allowed if mode="tree"
targets Union[str, list[str], dict[str], NoneType] None Only dict or None allowed if mode="tree"
mode Literal["tree", "flat"] "flat"
return_type Literal["tuple", "list", "dict"] "tuple" Only "dict" allowed if mode="tree"
inputs Union[dict, NoneType] None Only dict allowed if mode="tree"
aggregator Optional[Callable] None Only None allowed if mode="tree"
enforce_signature bool True
name_clashes Literal["raise", "ignore", "warn"] "raise" Ignored if mode="flat"

Looking at it suggests to me that we'll want to functions as entry point to dags though:

  • concatenate_functions -- current interface, no changes
  • concatenate_functions_tree -- the tree mode outlined here

Does that makes sense? I'll also be available in the afternoon for chatting.

@lars-reimann
Copy link
Member

Looking at it suggests to me that we'll want to functions as entry point to dags though:

  • concatenate_functions -- current interface, no changes
  • concatenate_functions_tree -- the tree mode outlined here

I'm all for that: It's easier to implement (less validation required) and use (less documentation to understand).

@janosg
Copy link
Member

janosg commented Feb 28, 2023

I think the hardest step in the implementation will be to extend short function names to function names that are unique in the global namespaces.

You mean cases like:

{
   "a": {"b": {"c": d},
   "a__b": {"c": d}
}

?

But I am probably just missing something fundamental once more.

No. Sorry, I should have said argument names instead of function names. From normal tree_flatten and leaf_names you can easily get function names that are globally unique and look like your example above. The hard step is to also rename the arguments of functions that depend on the renamed functions.

@hmgaudecker
Copy link
Member Author

hmgaudecker commented Mar 1, 2023

No. Sorry, I should have said argument names instead of function names. From normal tree_flatten and leaf_names you can easily get function names that are globally unique and look like your example above. The hard step is to also rename the arguments of functions that depend on the renamed functions.

Okay, I think that is precisely the reason why I believe we'll need to include the (structure of) inputs in concatenate_functions_tree.

Minimal example:

def b(c):
    return c ** 2


{
   {"a": b},
}

Based on that and the envisaged structure, it is impossible to tell whether the inputs will be:

  1. Top-level
    {
        "c": 0.5
    }
  2. Bottom-level
    {
       "a": {"c": 0.5}},
    }

With deeper nesting, intermediate versions would of course also be possible.

I'd think the best thing we could do is to provide a template generator with a keyword level, which might be Literal["top", "bottom"].

This would solve the uniqueness problem, too. In the global namespace, for the two cases above:

  1. The argument would be called c
  2. The argument would be called a__c

@janosg @lars-reimann : If you agree I'd continue a little with the DEP along the lines we discussed so that Lars can go ahead. Thanks!

@janosg
Copy link
Member

janosg commented Mar 1, 2023

I thought this is what the input mode is for. If you have input_mode = "tree", the inputs for the concatenated function are like in 2. If it is "flat" the input would be similar to 1 but using long names, i.e.

{"a__c": 0.5}

I am not saying, that this would necessarily be user friendly. Probably most would find it confusing. But it is probably the only natural way to define something that makes sense for working with arbitrarily nested structures.

According to JAX philosophy, there would probably be no input mode and only 2 would work as it is more in line with pytree thinking.

@hmgaudecker
Copy link
Member Author

My point is that in concatenate_functions, we can uniquely predict the names of inputs from the DAG. Note that in the case 1, we could also use c in d here:

def b(c):
    return c ** 2


def d(c):
    return c ** 0.5


{
   {"a": b},
    d,
}

So the two cases represent different behavior, not different ways to specify the inputs (we would have to specify def(a__c) if we wanted to get at c in the second case).

Put differently, we cannot predict unique input names anymore in concatenate_functions_tree without knowing the structure of the inputs because we cannot predict the level at which a parameter will be passed.

flat vs tree is a separate discussion and if we agree on the above, we might be able to discover that automatically.

@janosg
Copy link
Member

janosg commented Mar 1, 2023

Ok, I see the point.

It might be a good idea to start over and think about solutions to the problems in gettsim (mainly long function/variable names) that do not have the side effect that the concatenated taxes and transfers function is much harder to use.

@hmgaudecker
Copy link
Member Author

What do you mean by harder to use? The only thing that is different is that I'll need to know the inputs ex ante (which should always be the case, you know your data).

Other than that, if thinking about flat input mode, all that happens is that your input names become longer and more comprehensible. I am all on that side of the trade-off.

Am I missing anything?

@hmgaudecker
Copy link
Member Author

(The deeper issue in GETTSIM is to get namespaces in there, not long names per se, though related )

@lars-reimann
Copy link
Member

lars-reimann commented Mar 1, 2023

It seems the problem is that we need to map a simple name to a qualified name. In your example we could map the simple name c to the qualified names c or a__c, however. The mapping can be ambiguous. This makes resolution impossible, since we need to either start at the top of the tree and follow the path described by the qualified name or look up the name at the current level.

Option 1: Start at the level that the function is defined at and work your way up. Use the first matching function/input. Problems:

  • We cannot access functions at a higher level if a function at a lower level has the same name.
  • We need to know the (shape of the) input data for the resolution.

Option 2: Mimic Python's module system to some extent. This would mean that simple names can only directly be used to access functions/inputs at the same level as the function. For everything else we could require qualified names. We could also offer a way to import a function/input, i.e. make them available by a simple name, though that seems rarely worth it. Problem:

  • How do we differentiate whether an argument is specified by a simple or qualified name?
    • If it has double underscores, it's a qualified name.
    • If it doesn't have double underscores, it can be either. Then we again don't know whether we describe a top-level function/input or a function/input at the same level again.

@janosg
Copy link
Member

janosg commented Mar 1, 2023

I think getting namespaces into gettsim is not necessarily the same as getting namespaces into dags by allowing nested specifications of functions. This would possibly go more in the direction of a DSL then, but it could basically try to leverage how modules and namespaces are handled in Python instead of implementing a new way of handling this stuff via nested dictionaries.

Any solution on the dags level seems to involve that users (or gettsim authors) have to:

  • learn how to translate short names into long names
  • construct a nested dictionary of functions with arbitrary structure (i.e. no linters, no autocomplete, problems if names contain typos)
  • use very long names for function inputs or construct nested function inputs

But probably @lars-reimann is in a much better position to think about how such an alternative solution could look like.

@janosg
Copy link
Member

janosg commented Mar 1, 2023

@lars-reimann saw it just now: I like the idea to mimic python's module system and I think it goes in the same direction of what I suggested above.

@hmgaudecker
Copy link
Member Author

Great discussion!!!

Option 2 also sounds good to me.

Maybe at deeper levels of the hierarchy, we could require something like an absolute path? Say top__x or abs__x to access a variable from elsewhere?

This said, I still do not see a large downside to requiring knowledge of the input structure upfront, not in applications of GETTSIM and not in anything else I can think of. Could you elaborate a little why you are worried about this?

@janosg
Copy link
Member

janosg commented Mar 1, 2023

This said, I still do not see a large downside to requiring knowledge of the input structure upfront, not in applications of GETTSIM and not in anything else I can think of. Could you elaborate a little why you are worried about this?

I agree that this is not really a problem.

@hmgaudecker
Copy link
Member Author

I just updated the example and DEP a little bit, largely within Option 1 still.

If we found a good way to use absolute paths / imports (aka qualified names, I guess) I'd be fine with Option 2. I might be stuck in my own thinking too much.

It would be great to see a prototype to play with as the next step, the limits of my imagination have been pretty much reached...

@lars-reimann lars-reimann mentioned this pull request Mar 10, 2023
5 tasks
@lars-reimann lars-reimann mentioned this pull request Apr 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants