Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finalize/clean up ingest models, add additional preprocessing #1166

Merged
merged 13 commits into from
Oct 16, 2024

Conversation

fvankrieken
Copy link
Contributor

@fvankrieken fvankrieken commented Sep 30, 2024

I'd say the biggest thing for review here is still the models. See changes here https://github.com/NYCPlanning/data-engineering/pull/1166/files#diff-6fffbad770f67b00cdc414260a7922b1de4258230b2dab380fefbb7843401fd0

For model changes, easier to look at that file with overall changes than by commit. Other than that, easier to go by commit (see below the yml/json for commit summaries)

The new templates look like this

id: dcp_commercialoverlay
acl: public-read

attributes:
  name: DCP NYC Commercial Overlay Districts
  description: |
    Polygon features representing the within-tax-block limits for commercial overlay districts,
    as shown on the DCP zoning maps. Commercial overlay district designations are indicated in the OVERLAY attribute.
  url: https://www1.nyc.gov/site/planning/data-maps/open-data/dwn-gis-zoning.page#metadata

ingestion:
  target_crs: EPSG:4326
  source:
    type: edm_publishing_gis_dataset
    name: dcp_commercial_overlays
  file_format:
    type: shapefile
    crs: EPSG:2263
  processing_steps:
  - name: rename_columns
    args:
      map: {"geom": "wkb_geometry"}

columns:
- id: overlay
  data_type: text
- id: shape_leng
  data_type: decimal
- id: shape_area
  data_type: decimal
- id: wkb_geometry
  data_type: geometry

And the computed config then looks like this

{
    "id": "dcp_commercialoverlay",
    "version": "20240831",
    "crs": "EPSG:4326",
    "attributes": {
        "name": "DCP NYC Commercial Overlay Districts",
        "description": "Polygon features representing the within-tax-block limits for commercial overlay districts,\nas shown on the DCP zoning maps. Commercial overlay district designations are indicated in the OVERLAY attribute.\n",
        "url": "https://www1.nyc.gov/site/planning/data-maps/open-data/dwn-gis-zoning.page#metadata"
    },
    "archival": {
        "acl": "public-read",
        "raw_filename": "dcp_commercial_overlays.zip",
        "archival_timestamp": "2024-09-30T16:23:51.693603-04:00"
    },
    "ingestion": {
        "target_crs": "EPSG:4326",
        "file_format": {
            "crs": "EPSG:2263",
            "encoding": "utf-8",
            "type": "shapefile"
        },
        "processing_steps": [
            {
                "name": "reproject",
                "args": {
                    "target_crs": "EPSG:4326"
                }
            },
            {
                "name": "rename_columns",
                "args": {
                    "map": {
                        "geom": "wkb_geometry"
                    }
                }
            }
        ],
        "source": {
            "type": "edm_publishing_gis_dataset",
            "name": "dcp_commercial_overlays"
        }
    },
    "columns": [
        {
            "id": "overlay",
            "data_type": "text"
        },
        {
            "id": "shape_leng",
            "data_type": "decimal"
        },
        {
            "id": "shape_area",
            "data_type": "decimal"
        },
        {
            "id": "wkb_geometry",
            "data_type": "geometry"
        }
    ],
    "run_details": {
        "type": "manual",
        "runner": {
            "username": "Finn van Krieken"
        },
        "timestamp": "2024-09-30T16:23:51.693603-04:00",
        "logged": false
    }
}

Other than that, it's helpful to go by commit

  1. Some cleanup from Tools to compare datasets, random ingest fixes #1155 - getting test coverage up and fixing a random literal
  2. New ingest template that's been my guinea pig
  3. Move the current "ingest" templates to a "dev" folder - in next PR, want to add logic to start using any found ingest templates, then fall back to library for our cli calls, but don't want to throw these ones which are sort of "in-progress" away (and didn't want to update them all right now without validating the data)
  4. add Attributes section to Template and Config models, falling in line a little bit with our product metadata
  5. use SortedBase across the models that get dumped to config.json for nice serialization. Make a few more submodels of config to simplify what's at the top level
  6. adjustments for tests from commit 5 (just seemed easier to have these separate). Also tweaked run.run to make it easier to test (and aim it somewhere other than the "production" template folder)
  7. For the four datasets that I've actually archived with ingest, move them to the production folder (and adjust to new structure)
  8. Add columns to ingest templates. Include sthe most basic validation step, that all defined columns exist in the dataset. This is also where eventually column-specific checks could be specified
  9. Fix a bug - gdfs get unhappy if you try to naively rename the geometry column (as in, just treating it as any other pandas column)
  10. Add a default step that any dataset with geometry converts its geom to multi. Effectively same as calling ST_MULTI in postgis. See test cases for the utility for expected behavior. This seems like maybe a necessary evil for now - I'd prefer to keep geometry as they are, but this keeps things more consistent with library outputs (and makes sure things are consistent when we import to postgis, which can be more finicky/specific with types)

@fvankrieken fvankrieken force-pushed the fvk-ingest branch 2 times, most recently from 63262f4 to 4353246 Compare October 2, 2024 21:30
@fvankrieken fvankrieken changed the base branch from main to fvk-compare October 2, 2024 21:30
@fvankrieken fvankrieken force-pushed the fvk-ingest branch 3 times, most recently from 1a3fd42 to fffae94 Compare October 2, 2024 21:37
@fvankrieken fvankrieken force-pushed the fvk-compare branch 3 times, most recently from 67266f6 to 2091fc5 Compare October 8, 2024 14:03
Base automatically changed from fvk-compare to main October 8, 2024 14:19
@fvankrieken fvankrieken force-pushed the fvk-ingest branch 5 times, most recently from 8b162d1 to fab67bd Compare October 9, 2024 15:24
Copy link

codecov bot commented Oct 9, 2024

Codecov Report

Attention: Patch coverage is 99.06542% with 1 line in your changes missing coverage. Please review.

Project coverage is 67.84%. Comparing base (420a92a) to head (2b856c3).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
dcpy/models/lifecycle/ingest.py 97.50% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1166      +/-   ##
==========================================
- Coverage   67.89%   67.84%   -0.05%     
==========================================
  Files         109      110       +1     
  Lines        5762     5847      +85     
  Branches      848      641     -207     
==========================================
+ Hits         3912     3967      +55     
- Misses       1721     1754      +33     
+ Partials      129      126       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@fvankrieken fvankrieken force-pushed the fvk-ingest branch 6 times, most recently from 55f09f1 to f5dd475 Compare October 9, 2024 20:05
@fvankrieken fvankrieken changed the title Finalize/clean up ingest models Finalize/clean up ingest models, add additional preprocessing Oct 10, 2024
@fvankrieken fvankrieken marked this pull request as ready for review October 10, 2024 13:38
@fvankrieken fvankrieken force-pushed the fvk-ingest branch 3 times, most recently from 48e842a to 8858670 Compare October 10, 2024 14:07
description: str | None = None


class Template(BaseModel, extra="forbid"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one will be eventually phased out, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - this is the class that maps to the dataset template definitions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I see. I'm a bit confused with the presence of both Config and this class... They are somewhat the same but not exactly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Template exists to simply read in and validate the dataset definition

Config is the "computed" config of a dataset. It has the template definition (attributes, instructions to ingest, etc) as well as the details of archival (where to find the archived raw file, timestamp, etc) and any other fields that need to be determined at runtime.

Copy link
Contributor

@sf-dcp sf-dcp Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought It's the Config class that's computed, not Template?

Edit: ignore my comment here. lol thought i was responding to the other thread

file_format: file.Format
processing_mode: str | None = None
processing_steps: list[PreprocessingStep] = []
crs: str | None = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this attribute here to accommodate the old template?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing in this PR is really to accommodate anything old. These config objects are meant for reading data about an archived dataset, so with that I thought that crs was worthy of being a top-level field since it's quite relevant to how we use the data. Given that it's also stored in the parquet metadata, this is maybe up for debate

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. My thought is that crs should "live" in one place under ingestion to avoid 1) duplication and 2) any confusion whethercrs should or shouldn't be specified in both places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should clarify again that this model is just "computed" - it's never going to be defined manually

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Then I will leave this to you and the group. I thought it was in the Template model - my bad

class Template(BaseModel, extra="forbid", arbitrary_types_allowed=True):
"""Definition of a dataset for ingestion/processing/archiving in edm-recipes"""

id: str
acl: recipes.ValidAclValues

target_crs: str | None = None
attributes: DatasetAttributes | None = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe make it a required attribute? To avoid creation of new templates without relevant dataset info

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like the right choice

Copy link
Contributor

@sf-dcp sf-dcp Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen this file. Nice!

@sf-dcp
Copy link
Contributor

sf-dcp commented Oct 15, 2024

Ok, finished my review! Love how the new dataset templates look like and the models representing them make so much more sense. Great work.

The first commit is a bit obscure to me since I missed the previous related PR - can't provide feedback

Questions:

  • ongoing convo above about Template vs Config
  • Could you please explain the need for the last commit creating a multi geometry? What is the benefit

@alexrichey
Copy link
Contributor

@fvankrieken I'd spent some time on this Thursday, and will continue today. Might prod you for a walkthrough later.

@alexrichey
Copy link
Contributor

@fvankrieken One immediate though: would it make sense to publish JSON schemas for these models? (similar to admin.ops.update_product_metadata_schema)

config = configure.get_config(
dataset_id, version=version, mode=mode, template_dir=template_dir
)
transform.validate_processing_steps(config.id, config.ingestion.processing_steps)
Copy link
Member

@damonmcc damonmcc Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this redundant since transform.validate_process_steps is called during transform.preprocess?

and if we do wanna validate early, would it make sense to do it during configure.get_config or maybe read_template?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

talked about it and we'll keep it here. it's nicer to assemble these pieces at a high-level rather than have ingest modules importing each other

Copy link
Contributor

@sf-dcp sf-dcp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed my question during a code walk-through. Great work!

@fvankrieken
Copy link
Contributor Author

fvankrieken commented Oct 16, 2024

Notes from group chat

  • type signature for reprojection func
  • make attributes required

For follow-up pr

  • separate out default processing step logic into helper
  • clean up language "processing" vs "preprocessing" (make consistent)
  • raise not implemented for custom script
  • ability for steps to log information about themselves
  • publish json schema

Copy link
Contributor

@alexrichey alexrichey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@fvankrieken fvankrieken merged commit 90497a2 into main Oct 16, 2024
20 checks passed
@fvankrieken fvankrieken deleted the fvk-ingest branch October 16, 2024 14:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants