-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finalize/clean up ingest models, add additional preprocessing #1166
Conversation
63262f4
to
4353246
Compare
1a3fd42
to
fffae94
Compare
4b36136
to
c5289dd
Compare
fffae94
to
f045aa0
Compare
67266f6
to
2091fc5
Compare
8b162d1
to
fab67bd
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1166 +/- ##
==========================================
- Coverage 67.89% 67.84% -0.05%
==========================================
Files 109 110 +1
Lines 5762 5847 +85
Branches 848 641 -207
==========================================
+ Hits 3912 3967 +55
- Misses 1721 1754 +33
+ Partials 129 126 -3 ☔ View full report in Codecov by Sentry. |
55f09f1
to
f5dd475
Compare
48e842a
to
8858670
Compare
description: str | None = None | ||
|
||
|
||
class Template(BaseModel, extra="forbid"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one will be eventually phased out, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No - this is the class that maps to the dataset template definitions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I see. I'm a bit confused with the presence of both Config
and this class... They are somewhat the same but not exactly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Template exists to simply read in and validate the dataset definition
Config is the "computed" config of a dataset. It has the template definition (attributes, instructions to ingest, etc) as well as the details of archival (where to find the archived raw file, timestamp, etc) and any other fields that need to be determined at runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought It's the Config
class that's computed, not Template
?
Edit: ignore my comment here. lol thought i was responding to the other thread
file_format: file.Format | ||
processing_mode: str | None = None | ||
processing_steps: list[PreprocessingStep] = [] | ||
crs: str | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this attribute here to accommodate the old template?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing in this PR is really to accommodate anything old. These config objects are meant for reading data about an archived dataset, so with that I thought that crs
was worthy of being a top-level field since it's quite relevant to how we use the data. Given that it's also stored in the parquet metadata, this is maybe up for debate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. My thought is that crs should "live" in one place under ingestion to avoid 1) duplication and 2) any confusion whethercrs
should or shouldn't be specified in both places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should clarify again that this model is just "computed" - it's never going to be defined manually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Then I will leave this to you and the group. I thought it was in the Template
model - my bad
e46bc2d
to
c4ba473
Compare
dcpy/models/lifecycle/ingest.py
Outdated
class Template(BaseModel, extra="forbid", arbitrary_types_allowed=True): | ||
"""Definition of a dataset for ingestion/processing/archiving in edm-recipes""" | ||
|
||
id: str | ||
acl: recipes.ValidAclValues | ||
|
||
target_crs: str | None = None | ||
attributes: DatasetAttributes | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe make it a required attribute? To avoid creation of new templates without relevant dataset info
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems like the right choice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't seen this file. Nice!
Ok, finished my review! Love how the new dataset templates look like and the models representing them make so much more sense. Great work. The first commit is a bit obscure to me since I missed the previous related PR - can't provide feedback Questions:
|
@fvankrieken I'd spent some time on this Thursday, and will continue today. Might prod you for a walkthrough later. |
@fvankrieken One immediate though: would it make sense to publish JSON schemas for these models? (similar to |
config = configure.get_config( | ||
dataset_id, version=version, mode=mode, template_dir=template_dir | ||
) | ||
transform.validate_processing_steps(config.id, config.ingestion.processing_steps) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this redundant since transform.validate_process_steps
is called during transform.preprocess
?
and if we do wanna validate early, would it make sense to do it during configure.get_config
or maybe read_template
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
talked about it and we'll keep it here. it's nicer to assemble these pieces at a high-level rather than have ingest modules importing each other
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed my question during a code walk-through. Great work!
Notes from group chat
For follow-up pr
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
38812d9
to
2b856c3
Compare
I'd say the biggest thing for review here is still the models. See changes here https://github.com/NYCPlanning/data-engineering/pull/1166/files#diff-6fffbad770f67b00cdc414260a7922b1de4258230b2dab380fefbb7843401fd0
For model changes, easier to look at that file with overall changes than by commit. Other than that, easier to go by commit (see below the yml/json for commit summaries)
The new templates look like this
And the computed config then looks like this
Other than that, it's helpful to go by commit
Attributes
section toTemplate
andConfig
models, falling in line a little bit with our product metadataSortedBase
across the models that get dumped toconfig.json
for nice serialization. Make a few more submodels of config to simplify what's at the top levelrun.run
to make it easier to test (and aim it somewhere other than the "production" template folder)ST_MULTI
in postgis. See test cases for the utility for expected behavior. This seems like maybe a necessary evil for now - I'd prefer to keep geometry as they are, but this keeps things more consistent with library outputs (and makes sure things are consistent when we import to postgis, which can be more finicky/specific with types)