Finalize/clean up ingest models, add additional preprocessing #1166

fvankrieken · 2024-09-30T19:45:57Z

I'd say the biggest thing for review here is still the models. See changes here https://github.com/NYCPlanning/data-engineering/pull/1166/files#diff-6fffbad770f67b00cdc414260a7922b1de4258230b2dab380fefbb7843401fd0

For model changes, easier to look at that file with overall changes than by commit. Other than that, easier to go by commit (see below the yml/json for commit summaries)

The new templates look like this

id: dcp_commercialoverlay
acl: public-read

attributes:
  name: DCP NYC Commercial Overlay Districts
  description: |
    Polygon features representing the within-tax-block limits for commercial overlay districts,
    as shown on the DCP zoning maps. Commercial overlay district designations are indicated in the OVERLAY attribute.
  url: https://www1.nyc.gov/site/planning/data-maps/open-data/dwn-gis-zoning.page#metadata

ingestion:
  target_crs: EPSG:4326
  source:
    type: edm_publishing_gis_dataset
    name: dcp_commercial_overlays
  file_format:
    type: shapefile
    crs: EPSG:2263
  processing_steps:
  - name: rename_columns
    args:
      map: {"geom": "wkb_geometry"}

columns:
- id: overlay
  data_type: text
- id: shape_leng
  data_type: decimal
- id: shape_area
  data_type: decimal
- id: wkb_geometry
  data_type: geometry

And the computed config then looks like this

{
    "id": "dcp_commercialoverlay",
    "version": "20240831",
    "crs": "EPSG:4326",
    "attributes": {
        "name": "DCP NYC Commercial Overlay Districts",
        "description": "Polygon features representing the within-tax-block limits for commercial overlay districts,\nas shown on the DCP zoning maps. Commercial overlay district designations are indicated in the OVERLAY attribute.\n",
        "url": "https://www1.nyc.gov/site/planning/data-maps/open-data/dwn-gis-zoning.page#metadata"
    },
    "archival": {
        "acl": "public-read",
        "raw_filename": "dcp_commercial_overlays.zip",
        "archival_timestamp": "2024-09-30T16:23:51.693603-04:00"
    },
    "ingestion": {
        "target_crs": "EPSG:4326",
        "file_format": {
            "crs": "EPSG:2263",
            "encoding": "utf-8",
            "type": "shapefile"
        },
        "processing_steps": [
            {
                "name": "reproject",
                "args": {
                    "target_crs": "EPSG:4326"
                }
            },
            {
                "name": "rename_columns",
                "args": {
                    "map": {
                        "geom": "wkb_geometry"
                    }
                }
            }
        ],
        "source": {
            "type": "edm_publishing_gis_dataset",
            "name": "dcp_commercial_overlays"
        }
    },
    "columns": [
        {
            "id": "overlay",
            "data_type": "text"
        },
        {
            "id": "shape_leng",
            "data_type": "decimal"
        },
        {
            "id": "shape_area",
            "data_type": "decimal"
        },
        {
            "id": "wkb_geometry",
            "data_type": "geometry"
        }
    ],
    "run_details": {
        "type": "manual",
        "runner": {
            "username": "Finn van Krieken"
        },
        "timestamp": "2024-09-30T16:23:51.693603-04:00",
        "logged": false
    }
}

Other than that, it's helpful to go by commit

Some cleanup from Tools to compare datasets, random ingest fixes #1155 - getting test coverage up and fixing a random literal
New ingest template that's been my guinea pig
Move the current "ingest" templates to a "dev" folder - in next PR, want to add logic to start using any found ingest templates, then fall back to library for our cli calls, but don't want to throw these ones which are sort of "in-progress" away (and didn't want to update them all right now without validating the data)
add Attributes section to Template and Config models, falling in line a little bit with our product metadata
use SortedBase across the models that get dumped to config.json for nice serialization. Make a few more submodels of config to simplify what's at the top level
adjustments for tests from commit 5 (just seemed easier to have these separate). Also tweaked run.run to make it easier to test (and aim it somewhere other than the "production" template folder)
For the four datasets that I've actually archived with ingest, move them to the production folder (and adjust to new structure)
Add columns to ingest templates. Include sthe most basic validation step, that all defined columns exist in the dataset. This is also where eventually column-specific checks could be specified
Fix a bug - gdfs get unhappy if you try to naively rename the geometry column (as in, just treating it as any other pandas column)
Add a default step that any dataset with geometry converts its geom to multi. Effectively same as calling ST_MULTI in postgis. See test cases for the utility for expected behavior. This seems like maybe a necessary evil for now - I'd prefer to keep geometry as they are, but this keeps things more consistent with library outputs (and makes sure things are consistent when we import to postgis, which can be more finicky/specific with types)

codecov · 2024-10-09T15:30:41Z

Codecov Report

Attention: Patch coverage is 99.06542% with 1 line in your changes missing coverage. Please review.

Project coverage is 67.84%. Comparing base (420a92a) to head (2b856c3).
Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
dcpy/models/lifecycle/ingest.py	97.50%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1166      +/-   ##
==========================================
- Coverage   67.89%   67.84%   -0.05%     
==========================================
  Files         109      110       +1     
  Lines        5762     5847      +85     
  Branches      848      641     -207     
==========================================
+ Hits         3912     3967      +55     
- Misses       1721     1754      +33     
+ Partials      129      126       -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

sf-dcp · 2024-10-15T15:10:48Z

dcpy/models/lifecycle/ingest.py

+    description: str | None = None
+
+
+class Template(BaseModel, extra="forbid"):


this one will be eventually phased out, correct?

No - this is the class that maps to the dataset template definitions

Okay, I see. I'm a bit confused with the presence of both Config and this class... They are somewhat the same but not exactly

Template exists to simply read in and validate the dataset definition

Config is the "computed" config of a dataset. It has the template definition (attributes, instructions to ingest, etc) as well as the details of archival (where to find the archived raw file, timestamp, etc) and any other fields that need to be determined at runtime.

I thought It's the Config class that's computed, not Template?

Edit: ignore my comment here. lol thought i was responding to the other thread

sf-dcp · 2024-10-15T15:13:48Z

dcpy/models/lifecycle/ingest.py

-    file_format: file.Format
-    processing_mode: str | None = None
-    processing_steps: list[PreprocessingStep] = []
+    crs: str | None = None


is this attribute here to accommodate the old template?

Nothing in this PR is really to accommodate anything old. These config objects are meant for reading data about an archived dataset, so with that I thought that crs was worthy of being a top-level field since it's quite relevant to how we use the data. Given that it's also stored in the parquet metadata, this is maybe up for debate

I see. My thought is that crs should "live" in one place under ingestion to avoid 1) duplication and 2) any confusion whethercrs should or shouldn't be specified in both places.

I should clarify again that this model is just "computed" - it's never going to be defined manually

Got it. Then I will leave this to you and the group. I thought it was in the Template model - my bad

sf-dcp · 2024-10-15T16:01:42Z

dcpy/models/lifecycle/ingest.py

 class Template(BaseModel, extra="forbid", arbitrary_types_allowed=True):
    """Definition of a dataset for ingestion/processing/archiving in edm-recipes"""

    id: str
    acl: recipes.ValidAclValues

-    target_crs: str | None = None
+    attributes: DatasetAttributes | None = None


maybe make it a required attribute? To avoid creation of new templates without relevant dataset info

That seems like the right choice

sf-dcp · 2024-10-15T16:18:17Z

dcpy/test/lifecycle/ingest/__init__.py

I haven't seen this file. Nice!

sf-dcp · 2024-10-15T16:36:34Z

Ok, finished my review! Love how the new dataset templates look like and the models representing them make so much more sense. Great work.

The first commit is a bit obscure to me since I missed the previous related PR - can't provide feedback

Questions:

ongoing convo above about Template vs Config
Could you please explain the need for the last commit creating a multi geometry? What is the benefit

alexrichey · 2024-10-15T16:45:57Z

@fvankrieken I'd spent some time on this Thursday, and will continue today. Might prod you for a walkthrough later.

alexrichey · 2024-10-15T17:26:55Z

@fvankrieken One immediate though: would it make sense to publish JSON schemas for these models? (similar to admin.ops.update_product_metadata_schema)

damonmcc · 2024-10-15T19:30:12Z

dcpy/lifecycle/ingest/run.py

+    config = configure.get_config(
+        dataset_id, version=version, mode=mode, template_dir=template_dir
+    )
+    transform.validate_processing_steps(config.id, config.ingestion.processing_steps)


is this redundant since transform.validate_process_steps is called during transform.preprocess?

and if we do wanna validate early, would it make sense to do it during configure.get_config or maybe read_template?

talked about it and we'll keep it here. it's nicer to assemble these pieces at a high-level rather than have ingest modules importing each other

sf-dcp

addressed my question during a code walk-through. Great work!

fvankrieken · 2024-10-16T13:58:11Z

Notes from group chat

type signature for reprojection func
make attributes required

For follow-up pr

separate out default processing step logic into helper
clean up language "processing" vs "preprocessing" (make consistent)
raise not implemented for custom script
ability for steps to log information about themselves
publish json schema

alexrichey

🎉

fvankrieken force-pushed the fvk-ingest branch 2 times, most recently from 63262f4 to 4353246 Compare October 2, 2024 21:30

fvankrieken changed the base branch from main to fvk-compare October 2, 2024 21:30

fvankrieken force-pushed the fvk-ingest branch 3 times, most recently from 1a3fd42 to fffae94 Compare October 2, 2024 21:37

fvankrieken force-pushed the fvk-compare branch from 4b36136 to c5289dd Compare October 3, 2024 13:51

fvankrieken force-pushed the fvk-ingest branch from fffae94 to f045aa0 Compare October 3, 2024 21:21

fvankrieken force-pushed the fvk-compare branch 3 times, most recently from 67266f6 to 2091fc5 Compare October 8, 2024 14:03

Base automatically changed from fvk-compare to main October 8, 2024 14:19

fvankrieken force-pushed the fvk-ingest branch 5 times, most recently from 8b162d1 to fab67bd Compare October 9, 2024 15:24

fvankrieken force-pushed the fvk-ingest branch 6 times, most recently from 55f09f1 to f5dd475 Compare October 9, 2024 20:05

fvankrieken changed the title ~~Finalize/clean up ingest models~~ Finalize/clean up ingest models, add additional preprocessing Oct 10, 2024

fvankrieken marked this pull request as ready for review October 10, 2024 13:38

fvankrieken force-pushed the fvk-ingest branch 3 times, most recently from 48e842a to 8858670 Compare October 10, 2024 14:07

fvankrieken assigned damonmcc Oct 10, 2024

sf-dcp reviewed Oct 15, 2024

View reviewed changes

fvankrieken added 11 commits October 15, 2024 11:34

cleanup #1155

40afa80

add dcp_commercialoverlay to ingest

c371fd7

move ingest templates to 'dev' folder

b1a886d

add attributes section to ingest templates

ad740fa

use sorted base, rework ingest models

8f2d9e1

adjust tests for previous commit

aa6cd11

add acs2010 datasets to 'production'

9a2c5bc

add columns to ingest templates

da3e437

fix renaming of geom column

d6d4e4a

add default step to convert geoms to multi

d9ca493

REVIEW: remove one remaining BaseModel reference

c4ba473

fvankrieken force-pushed the fvk-ingest branch from e46bc2d to c4ba473 Compare October 15, 2024 15:34

sf-dcp reviewed Oct 15, 2024

View reviewed changes

dcpy/test/lifecycle/ingest/__init__.py Outdated

Copy link

Contributor

sf-dcp Oct 15, 2024 •

edited

Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen this file. Nice!

damonmcc reviewed Oct 15, 2024

View reviewed changes

sf-dcp approved these changes Oct 15, 2024

View reviewed changes

damonmcc approved these changes Oct 15, 2024

View reviewed changes

REVIEW: missing type signature

f2ecc05

alexrichey approved these changes Oct 16, 2024

View reviewed changes

make attributes field required in ingest config

2b856c3

fvankrieken force-pushed the fvk-ingest branch from 38812d9 to 2b856c3 Compare October 16, 2024 14:35

fvankrieken merged commit 90497a2 into main Oct 16, 2024
20 checks passed

fvankrieken deleted the fvk-ingest branch October 16, 2024 14:41

fvankrieken mentioned this pull request Oct 16, 2024

Ingest - tweaks following group demo #1202

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finalize/clean up ingest models, add additional preprocessing #1166

Finalize/clean up ingest models, add additional preprocessing #1166

fvankrieken commented Sep 30, 2024 •

edited

Loading

codecov bot commented Oct 9, 2024 •

edited

Loading

sf-dcp Oct 15, 2024

fvankrieken Oct 15, 2024

sf-dcp Oct 15, 2024

fvankrieken Oct 15, 2024

sf-dcp Oct 15, 2024 •

edited

Loading

sf-dcp Oct 15, 2024

fvankrieken Oct 15, 2024

sf-dcp Oct 15, 2024

fvankrieken Oct 15, 2024

sf-dcp Oct 15, 2024

sf-dcp Oct 15, 2024

fvankrieken Oct 15, 2024

sf-dcp Oct 15, 2024 •

edited

Loading

sf-dcp commented Oct 15, 2024

alexrichey commented Oct 15, 2024

alexrichey commented Oct 15, 2024

damonmcc Oct 15, 2024 •

edited

Loading

damonmcc Oct 15, 2024

sf-dcp left a comment

fvankrieken commented Oct 16, 2024 •

edited

Loading

alexrichey left a comment

		description: str \| None = None


		class Template(BaseModel, extra="forbid"):

Finalize/clean up ingest models, add additional preprocessing #1166

Finalize/clean up ingest models, add additional preprocessing #1166

Conversation

fvankrieken commented Sep 30, 2024 • edited Loading

codecov bot commented Oct 9, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sf-dcp Oct 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sf-dcp Oct 15, 2024 • edited Loading

Choose a reason for hiding this comment

sf-dcp commented Oct 15, 2024

alexrichey commented Oct 15, 2024

alexrichey commented Oct 15, 2024

damonmcc Oct 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sf-dcp left a comment

Choose a reason for hiding this comment

fvankrieken commented Oct 16, 2024 • edited Loading

alexrichey left a comment

Choose a reason for hiding this comment

fvankrieken commented Sep 30, 2024 •

edited

Loading

codecov bot commented Oct 9, 2024 •

edited

Loading

sf-dcp Oct 15, 2024 •

edited

Loading

sf-dcp Oct 15, 2024 •

edited

Loading

damonmcc Oct 15, 2024 •

edited

Loading

fvankrieken commented Oct 16, 2024 •

edited

Loading