Finalize/clean up ingest models, add additional preprocessing#1166
Finalize/clean up ingest models, add additional preprocessing#1166fvankrieken merged 13 commits intomainfrom
Conversation
63262f4 to
4353246
Compare
1a3fd42 to
fffae94
Compare
4b36136 to
c5289dd
Compare
fffae94 to
f045aa0
Compare
67266f6 to
2091fc5
Compare
8b162d1 to
fab67bd
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1166 +/- ##
==========================================
- Coverage 67.89% 67.84% -0.05%
==========================================
Files 109 110 +1
Lines 5762 5847 +85
Branches 848 641 -207
==========================================
+ Hits 3912 3967 +55
- Misses 1721 1754 +33
+ Partials 129 126 -3 ☔ View full report in Codecov by Sentry. |
55f09f1 to
f5dd475
Compare
48e842a to
8858670
Compare
| description: str | None = None | ||
|
|
||
|
|
||
| class Template(BaseModel, extra="forbid"): |
There was a problem hiding this comment.
this one will be eventually phased out, correct?
There was a problem hiding this comment.
No - this is the class that maps to the dataset template definitions
There was a problem hiding this comment.
Okay, I see. I'm a bit confused with the presence of both Config and this class... They are somewhat the same but not exactly
There was a problem hiding this comment.
Template exists to simply read in and validate the dataset definition
Config is the "computed" config of a dataset. It has the template definition (attributes, instructions to ingest, etc) as well as the details of archival (where to find the archived raw file, timestamp, etc) and any other fields that need to be determined at runtime.
There was a problem hiding this comment.
I thought It's the Config class that's computed, not Template?
Edit: ignore my comment here. lol thought i was responding to the other thread
| file_format: file.Format | ||
| processing_mode: str | None = None | ||
| processing_steps: list[PreprocessingStep] = [] | ||
| crs: str | None = None |
There was a problem hiding this comment.
is this attribute here to accommodate the old template?
There was a problem hiding this comment.
Nothing in this PR is really to accommodate anything old. These config objects are meant for reading data about an archived dataset, so with that I thought that crs was worthy of being a top-level field since it's quite relevant to how we use the data. Given that it's also stored in the parquet metadata, this is maybe up for debate
There was a problem hiding this comment.
I see. My thought is that crs should "live" in one place under ingestion to avoid 1) duplication and 2) any confusion whethercrs should or shouldn't be specified in both places.
There was a problem hiding this comment.
I should clarify again that this model is just "computed" - it's never going to be defined manually
There was a problem hiding this comment.
Got it. Then I will leave this to you and the group. I thought it was in the Template model - my bad
e46bc2d to
c4ba473
Compare
dcpy/models/lifecycle/ingest.py
Outdated
| acl: recipes.ValidAclValues | ||
|
|
||
| target_crs: str | None = None | ||
| attributes: DatasetAttributes | None = None |
There was a problem hiding this comment.
maybe make it a required attribute? To avoid creation of new templates without relevant dataset info
There was a problem hiding this comment.
That seems like the right choice
There was a problem hiding this comment.
I haven't seen this file. Nice!
|
Ok, finished my review! Love how the new dataset templates look like and the models representing them make so much more sense. Great work. The first commit is a bit obscure to me since I missed the previous related PR - can't provide feedback Questions:
|
|
@fvankrieken I'd spent some time on this Thursday, and will continue today. Might prod you for a walkthrough later. |
|
@fvankrieken One immediate though: would it make sense to publish JSON schemas for these models? (similar to |
| config = configure.get_config( | ||
| dataset_id, version=version, mode=mode, template_dir=template_dir | ||
| ) | ||
| transform.validate_processing_steps(config.id, config.ingestion.processing_steps) |
There was a problem hiding this comment.
is this redundant since transform.validate_process_steps is called during transform.preprocess?
and if we do wanna validate early, would it make sense to do it during configure.get_config or maybe read_template?
There was a problem hiding this comment.
talked about it and we'll keep it here. it's nicer to assemble these pieces at a high-level rather than have ingest modules importing each other
sf-dcp
left a comment
There was a problem hiding this comment.
addressed my question during a code walk-through. Great work!
|
Notes from group chat
For follow-up pr
|
38812d9 to
2b856c3
Compare
I'd say the biggest thing for review here is still the models. See changes here https://github.com/NYCPlanning/data-engineering/pull/1166/files#diff-6fffbad770f67b00cdc414260a7922b1de4258230b2dab380fefbb7843401fd0
For model changes, easier to look at that file with overall changes than by commit. Other than that, easier to go by commit (see below the yml/json for commit summaries)
The new templates look like this
And the computed config then looks like this
{ "id": "dcp_commercialoverlay", "version": "20240831", "crs": "EPSG:4326", "attributes": { "name": "DCP NYC Commercial Overlay Districts", "description": "Polygon features representing the within-tax-block limits for commercial overlay districts,\nas shown on the DCP zoning maps. Commercial overlay district designations are indicated in the OVERLAY attribute.\n", "url": "https://www1.nyc.gov/site/planning/data-maps/open-data/dwn-gis-zoning.page#metadata" }, "archival": { "acl": "public-read", "raw_filename": "dcp_commercial_overlays.zip", "archival_timestamp": "2024-09-30T16:23:51.693603-04:00" }, "ingestion": { "target_crs": "EPSG:4326", "file_format": { "crs": "EPSG:2263", "encoding": "utf-8", "type": "shapefile" }, "processing_steps": [ { "name": "reproject", "args": { "target_crs": "EPSG:4326" } }, { "name": "rename_columns", "args": { "map": { "geom": "wkb_geometry" } } } ], "source": { "type": "edm_publishing_gis_dataset", "name": "dcp_commercial_overlays" } }, "columns": [ { "id": "overlay", "data_type": "text" }, { "id": "shape_leng", "data_type": "decimal" }, { "id": "shape_area", "data_type": "decimal" }, { "id": "wkb_geometry", "data_type": "geometry" } ], "run_details": { "type": "manual", "runner": { "username": "Finn van Krieken" }, "timestamp": "2024-09-30T16:23:51.693603-04:00", "logged": false } }Other than that, it's helpful to go by commit
Attributessection toTemplateandConfigmodels, falling in line a little bit with our product metadataSortedBaseacross the models that get dumped toconfig.jsonfor nice serialization. Make a few more submodels of config to simplify what's at the top levelrun.runto make it easier to test (and aim it somewhere other than the "production" template folder)ST_MULTIin postgis. See test cases for the utility for expected behavior. This seems like maybe a necessary evil for now - I'd prefer to keep geometry as they are, but this keeps things more consistent with library outputs (and makes sure things are consistent when we import to postgis, which can be more finicky/specific with types)