Skip to content

GEP 7 and updates to GEPs 1-5 necessitated by GEP 6 #855

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 353 commits into from
Jul 23, 2025

Conversation

hmgaudecker
Copy link
Collaborator

@hmgaudecker hmgaudecker commented Apr 7, 2025

What problem do you want to solve?

  • Add a GEP for the revamped interface
  • Update earlier GEPs to reflect the changes that have become necessary after GEP 6 (since our documentation is small, it does not make sense to keep outdated things around).
  • Add the finalised schema from Validate params files #880 as an appendix to GEP 3

@hmgaudecker hmgaudecker changed the base branch from main to collect-components-of-namespaces April 7, 2025 09:57
Copy link

codecov bot commented Apr 7, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 77.57%. Comparing base (8267dbd) to head (3525917).

Additional details and impacted files
@@                        Coverage Diff                        @@
##           collect-components-of-namespaces     #855   +/-   ##
=================================================================
  Coverage                             77.57%   77.57%           
=================================================================
  Files                                   175      175           
  Lines                                  7563     7563           
=================================================================
  Hits                                   5867     5867           
  Misses                                 1696     1696           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

MImmesberger and others added 27 commits May 12, 2025 17:35
Title says it all. Better be explicit in the structure and allow for nulls than leaving things out accidentally.

---------

Co-authored-by: Marvin Immesberger <[email protected]>
…m:iza-institute-of-labor-economics/gettsim into rename-gettsim-params-fix-yaml-validation
Next set of to-dos from #897.

- Rename the parameters in GETTSIM's yaml files
- Restructure where useful (often, moving from scalars to dicts does
wonders for readability)
- Add now-required unit, reference_period, type keywords

---------

Co-authored-by: Marvin Immesberger <[email protected]>
@MImmesberger MImmesberger added this to the v1.0 milestone Jul 9, 2025
Copy link
Member

@ChristianZimpelmann ChristianZimpelmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, it looks cool, but I feel like the entry barrier is still quite high for beginners. I made some comments (especially on aspects that I beginners might find complicated)


```{raw} html
---
file: ./interface_dag.html
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dag does not fit on the page and it is unclear how to scroll to the right.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I know... You can scroll by clicking at the bottom of the graph and dragging the pointer.

Copy link
Collaborator Author

@hmgaudecker hmgaudecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, it looks cool, but I feel like the entry barrier is still quite high for beginners. I made some comments (especially on aspects that I beginners might find complicated)

Thanks! Implemented all of those except for one (no default for main_target(s), it is too important to understand conceptually that you have to request a particular target). Any further concrete suggestions of lowering entry barriers are very welcome!

@ChristianZimpelmann
Copy link
Member

Some other observations from playing around in the notebook:

result = main(
    date_str="2025-01-01",
    main_target=MainTarget.results.df_with_mapper,
)

Leads to

ValueError: The following arguments to `main` are missing for computing the desired output:

[
    "('input_data', 'flat')",
]

"flat" seems wrong here.

result = main(
    date_str="2025-01-01",
    main_target=MainTarget.specialized_environment.tax_transfer_dag,
)

Leads to

ValueError: The following data columns are missing.

[ ....

Probably better to respond that the argument input_data is missing completely

  • Why does the original_policy_environment obtained from MainTarget.policy_environment contain keys like anzahl_erwachsene_hhorp_id`? It is not clear to me why these are part of the policy environment?

@hmgaudecker
Copy link
Collaborator Author

Some other observations from playing around in the notebook:

Thanks!!!

result = main(
    date_str="2025-01-01",
    main_target=MainTarget.results.df_with_mapper,
)

Leads to

ValueError: The following arguments to `main` are missing for computing the desired output:

[
    "('input_data', 'flat')",
]

"flat" seems wrong here.

It is not wrong, but the message should be improved -- see #1005.

result = main(
    date_str="2025-01-01",
    main_target=MainTarget.specialized_environment.tax_transfer_dag,
)

Leads to

ValueError: The following data columns are missing.

[ ....

Probably better to respond that the argument input_data is missing completely

Agreed, see #1006

  • Why does the original_policy_environment obtained from MainTarget.policy_environment contain keys like anzahl_erwachsene_hh orp_id? It is not clear to me why these are part of the policy environment?

As has always been the case, it includes all functions operating on data that are around, like anzahl_erwachsene_hh. In addition, we now have possible input columns, too (essentially a dynamic version of TYPES_INPUT_VARIABLES). Ofc, "policy environment" is too narrow a term for some of these elements, but that has always been the case and I don't have a good term in store to improve upon it. Suggestions welcome, ofc!

hmgaudecker and others added 14 commits July 15, 2025 14:08
### What problem do you want to solve?

`processed_data` uses an $O(n^2)$ approach to link original and internal
IDs. This PR implements an $O(n\cdot \log(n))$ approach.

## Benchmarks

### On `gep-07` (3525917):

```cmd
====================================================================
SUMMARY TABLE
====================================================================
Dataset             numpy_time  numpy_hash  jax_time    jax_hash
--------------------------------------------------------------------
df_5000.parquet     1.2681      13106402    15.5897     bf85cb3d
df_10000.parquet    4.6791      308ca129    30.7932     57ba7579
df_20000.parquet    15.7451     51e8d0b4    62.4070     21636ea4
df_40000.parquet    54.0340     6ae704d8    137.1975    30bbf3ea
```

### This PR:

**[EDIT: updated results after cf37b75]**
```cmd
====================================================================
SUMMARY TABLE
====================================================================
Dataset             numpy_time  numpy_hash  jax_time    jax_hash
--------------------------------------------------------------------
df_5000.parquet     0.0378      13106402    0.8950      bf85cb3d
df_10000.parquet    0.0402      308ca129    0.8108      57ba7579
df_20000.parquet    0.1107      51e8d0b4    1.1354      21636ea4
df_40000.parquet    0.0853      6ae704d8    1.8208      30bbf3ea

```

The benchmark essentially runs

```python
        result = main(
            date_str=None,
            input_data=InputData.df_and_mapper(
                df=data,
                mapper=MAPPER,
            ),
            main_targets=[MainTarget.processed_data],
            tt_targets=TTTargets(tree=TT_TARGETS),
            backend=backend,
        )
```

on the targets defined in `interface_playground.ipynb` with differently
sized datasets that replicate the example household from the same
notebook `N` times (i.e., `N*3` persons in each dataset). The hashes
demonstrate that this PR creates `result` objects that are identical to
the ones created with the $O(n^2)$ approach.

To reproduce the benchmarks:
- Run `make_data.py` (see attached .zip) to create example datasets
- Run `benchmark_comparison.py` to create tables above


[benchmark.zip](https://github.com/user-attachments/files/21327575/benchmark.zip)

---------

Co-authored-by: Hans-Martin von Gaudecker <[email protected]>
Co-authored-by: mj023 <[email protected]>
@hmgaudecker hmgaudecker merged commit 5fe956f into collect-components-of-namespaces Jul 23, 2025
14 of 15 checks passed
@hmgaudecker hmgaudecker deleted the gep-07 branch July 23, 2025 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DOC: Update GEP 01 to reflect namespaces
5 participants