-
Notifications
You must be signed in to change notification settings - Fork 32
GEP 7 and updates to GEPs 1-5 necessitated by GEP 6 #855
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## collect-components-of-namespaces #855 +/- ##
=================================================================
Coverage 77.57% 77.57%
=================================================================
Files 175 175
Lines 7563 7563
=================================================================
Hits 5867 5867
Misses 1696 1696 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…ithub.com/iza-institute-of-labor-economics/gettsim into rename-gettsim-params-fix-yaml-validation
Title says it all. Better be explicit in the structure and allow for nulls than leaving things out accidentally. --------- Co-authored-by: Marvin Immesberger <[email protected]>
…m:iza-institute-of-labor-economics/gettsim into rename-gettsim-params-fix-yaml-validation
Next set of to-dos from #897. - Rename the parameters in GETTSIM's yaml files - Restructure where useful (often, moving from scalars to dicts does wonders for readability) - Add now-required unit, reference_period, type keywords --------- Co-authored-by: Marvin Immesberger <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, it looks cool, but I feel like the entry barrier is still quite high for beginners. I made some comments (especially on aspects that I beginners might find complicated)
|
||
```{raw} html | ||
--- | ||
file: ./interface_dag.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dag does not fit on the page and it is unclear how to scroll to the right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know I know... You can scroll by clicking at the bottom of the graph and dragging the pointer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, it looks cool, but I feel like the entry barrier is still quite high for beginners. I made some comments (especially on aspects that I beginners might find complicated)
Thanks! Implemented all of those except for one (no default for main_target(s)
, it is too important to understand conceptually that you have to request a particular target). Any further concrete suggestions of lowering entry barriers are very welcome!
Some other observations from playing around in the notebook: result = main(
date_str="2025-01-01",
main_target=MainTarget.results.df_with_mapper,
) Leads to
"flat" seems wrong here. result = main(
date_str="2025-01-01",
main_target=MainTarget.specialized_environment.tax_transfer_dag,
) Leads to
Probably better to respond that the argument
|
Thanks!!!
It is not wrong, but the message should be improved -- see #1005.
Agreed, see #1006
As has always been the case, it includes all functions operating on data that are around, like |
### What problem do you want to solve? `processed_data` uses an $O(n^2)$ approach to link original and internal IDs. This PR implements an $O(n\cdot \log(n))$ approach. ## Benchmarks ### On `gep-07` (3525917): ```cmd ==================================================================== SUMMARY TABLE ==================================================================== Dataset numpy_time numpy_hash jax_time jax_hash -------------------------------------------------------------------- df_5000.parquet 1.2681 13106402 15.5897 bf85cb3d df_10000.parquet 4.6791 308ca129 30.7932 57ba7579 df_20000.parquet 15.7451 51e8d0b4 62.4070 21636ea4 df_40000.parquet 54.0340 6ae704d8 137.1975 30bbf3ea ``` ### This PR: **[EDIT: updated results after cf37b75]** ```cmd ==================================================================== SUMMARY TABLE ==================================================================== Dataset numpy_time numpy_hash jax_time jax_hash -------------------------------------------------------------------- df_5000.parquet 0.0378 13106402 0.8950 bf85cb3d df_10000.parquet 0.0402 308ca129 0.8108 57ba7579 df_20000.parquet 0.1107 51e8d0b4 1.1354 21636ea4 df_40000.parquet 0.0853 6ae704d8 1.8208 30bbf3ea ``` The benchmark essentially runs ```python result = main( date_str=None, input_data=InputData.df_and_mapper( df=data, mapper=MAPPER, ), main_targets=[MainTarget.processed_data], tt_targets=TTTargets(tree=TT_TARGETS), backend=backend, ) ``` on the targets defined in `interface_playground.ipynb` with differently sized datasets that replicate the example household from the same notebook `N` times (i.e., `N*3` persons in each dataset). The hashes demonstrate that this PR creates `result` objects that are identical to the ones created with the $O(n^2)$ approach. To reproduce the benchmarks: - Run `make_data.py` (see attached .zip) to create example datasets - Run `benchmark_comparison.py` to create tables above [benchmark.zip](https://github.com/user-attachments/files/21327575/benchmark.zip) --------- Co-authored-by: Hans-Martin von Gaudecker <[email protected]> Co-authored-by: mj023 <[email protected]>
5fe956f
into
collect-components-of-namespaces
What problem do you want to solve?