Add support for pyarc2stac based ingest by kyle-lesinger · Pull Request #361 · NASA-IMPACT/veda-data-airflow

kyle-lesinger · 2025-04-25T22:00:51Z

Summary:

1.) Create a DAG that create a collection using pyarc2stac
2.) Make this DAG schedulable
2.) Install pyarc2stac within airflow_worker and airflow_services within virtual environment (because of conflict with some library on noaa-hrr stac tools dag (we should do this for all the dags - libraries specifically only needed by the dag needs to go within the dag specific virtual environment)

PR Checklist

Ad-hoc testing - Deploy changes and test manually
Integration tests

slesaad

@kyle-lesinger thanks for starting the work on this.

Ideally, we want to have a generic DAG (that has nothing to do with disasters) that's not always running on schedule

Scenario: I want to ingest an ArcGIS dataset for the air quality (or some other) portal and run it only once. The ideal solution should also cover this scenario. It's kinda getting there with you reading the json from an s3 bucket, but not completely.

If you don't know this already, our airflow instance comes with an s3 bucket that saves the dags in the /dags folder. The bucket is available through the following env var: MWAA_STAC_CONF["EVENT_BUCKET"]. The same s3 bucket has been used to hold scheduled datasets in the /collections folder. Airflow knows to look into that folder and read the json from there, and based on the schedule provided, run the discover dag for the dataset in schedule. This feature was added in #155, you can check it out there.

Here's what I'd do:

remove all reference of "disasters" - we want a generic solution that works for everything
update the dag so that any config can be run manually
handled scheduled workflow for pyarc2stac by updating the generate_dags function

i'd accept the input config that looks something like this:

{
  "url": "https://arcgis.url/no2/imageserver",
  "id": "no2",
  "description": "description of no2 data",
  ....(any other stac collection compatible properties that you'd like to overwrite in the collection created by pyarc2stac)
}

let me know if you wanna meet in person to go over the comments.

dags/generate_dags.py

slesaad · 2025-05-02T14:54:09Z

dags/generate_dags.py

            )

+        # veda_pyarc2stac_ingest
+        scheduled_pyarcstac_configs = filter_configs_by_dag(collection_configs, "veda_pyarc2stac_ingest")


where is this name veda_pyarc2stac_ingest defined? 🤔

It's located within the veda-tf-state-shared/collections/veda-pyarc2stac.json as a key/value pair

Ah, I thought the mapping to a dag was automatic based on the dag ID rather than manually. Maybe we can update this, though. Instead of having three different for loops, have one and just map the configs to the respective dag based on the dag ID.

I was wondering if you're missing something like this for the pyarc2stac dag

veda-data-airflow/dags/veda_data_pipeline/veda_discover_pipeline.py

Line 101 in 8ff9fc9

get_discover_dag(id="veda_discover", event={})

@slesaad I've now creating a mapping function for each pipeline type. Additionally, I've added get_ingest_pyarc2stac_dag(id="veda_pyarc2stac_ingest", event={}) per recommendation to account for manual runs.

slesaad · 2025-05-02T14:58:29Z

dags/veda_data_pipeline/veda_pyarc2stac_pipeline.py

+    # Default from the AWS .json file. If the value is an empty string, it will default to the pyarc2stac generated value.
+    keys_to_overwrite = ["id", "title", "description", "license",]
+    for key in keys_to_overwrite:
+        event_value = event.get(key)
+        if isinstance(event_value, str) and event_value.strip():
+            collection[key] = event_value
+
+
+    # Overwrite collection values with template configuration values. This will only overwrite if the value is a non-empty string within the template_conf file.
+    for key, value in template_conf.items():
+        if isinstance(value, str) and value.strip():
+            collection[key] = value


i'm unsure why two of these are needed

To increase the flexibility of the code, these cover the options to a.) use only information from the AWS .json file, b.) use information from the manual triggering. Pyarc2stac generates its own information but I foresee that having only the .json file content overwriting the pyarc2stac outputs cannot account for manual triggering and changes in the configuration file.

Just for clarity: The "event" is from the .json file, while the "template_conf" is from SM2A airflow.

not sure if that's needed. i think the veda_discover_pipeline does the same thing without needing to account for the two different options, can you confirm if this is absolutely needed?

I've added new functionality to allow for a specific order of collection values.

Manual config triggering = precedence 1

AWS .json config. = precedence 2

pyarc2stac outputs = precedence 3

for key in collection.keys(): collection[key] = (template_conf.get(key) or event.get(key) or collection[key])

slesaad · 2025-05-06T16:18:49Z

dags/generate_dags.py

+        if prefix == "pyarc2stac":
+            id = f"{prefix}-{collection['id']}"


is this really necessary? why not follow the pattern that's already there? (then you wont need the dag_config values to be a tuple) and if you feel strongly, update all of them to use collection id?

It's only necessary if we want the names of the collection being ingested to be understandable within airflow. If we only set it to id = f"{prefix}-{file_name}" then we will have the dags created as pyarc2stac-veda-pyarcstac-1, pyarc2stac-veda-pyarcstac-2, pyarc2stac-veda-pyarcstac-3, etc.

It only assists to reduce the overhead during debug if one of the ingests fails. But it's not needed at all. So I'll let you decide what works best. Just let me know.

that makes sense, let's use this pattern for the others too?

slesaad · 2025-05-06T16:20:16Z

dags/generate_dags.py

+    '''
+
+    dag_configs = {
+        "veda_discover":          (get_discover_dag,      "discover"),


you don't really need the values to be a tuple. my_function.__name__ should give you the name of the function as a string and you should be able to do any conditional based on that

Yes that makes sense. I was trying to allow for different naming conventions to be created, but I'll simply draw the name from the string.

you can also use the dictionary key itself (dag_key) - has the same information

Complete. Final function call is:

"veda_discover": get_discover_dag, "veda_ingest_vector": get_ingest_vector_dag, "veda_pyarc2stac_ingest": get_ingest_pyarc2stac_dag, }

And the name comes directly from the function __name__.

anayeaye · 2025-05-07T18:20:20Z

dags/veda_data_pipeline/veda_pyarc2stac_pipeline.py

+    "stac_version": "1.0.0",
+    "description": "collection description",
+    "data_type": "",
+    "is_periodic": true,


Note: We may need to add a prefix here and in the template. The dashboard prefixed properties dashboard:is_periodic and dashboard:time_density have been handled different ways--in our deprecated workflows API is_periodic is an alias for dashboard:is_periodic but I don't think we support that alias right now

I will raise this in tag up and follow up here--I think the we may need to make adjustments to all of our DAGs that generate collection metadata (not just this new DAG).
cc: @slesaad

Added dashboard to the config file:

template_conf = { "url": "", "id": "", "title": "", "stac_version": "", "description": "", "data_type": "", "license": "", "dashboard:is_periodic": "", "dashboard:time_density": "" }

Fix collection group app secret

slesaad · 2025-05-16T16:02:23Z

@kyle-lesinger looks pretty good to be, let's resolve the conflicts and then we'll merge

kyle-lesinger · 2025-08-11T19:13:38Z

@slesaad the conflicts have been resolved. Please re-review at your earliest convenience.

kyle-lesinger · 2025-08-11T19:17:11Z

dags/veda_data_pipeline/veda_pyarc2stac_pipeline.py

+
+    # Handle temporal extent separately if it exists in configs.
+    # This is useful for items with no temporal extent in the initial pyarc2stac item creation
+    if "temporal" in filtered_event:


For this part, we need to replace the temporal extent for those collections which don't have a date. They will produce a dashboard:is_timeless which overrides the dashboard:time_density. Some of the NASA Disasters data has this issue.

kyle-lesinger added 2 commits April 25, 2025 16:41

initial disaster dag ingest

7758c0f

update Dockerfile

735b9df

slesaad requested changes Apr 30, 2025

View reviewed changes

kyle-lesinger added 2 commits May 2, 2025 09:41

update pyarc2stac pipeline

fb2ca98

remove disaster_dag

e2a400b

slesaad reviewed May 2, 2025

View reviewed changes

dags/generate_dags.py Outdated Show resolved Hide resolved

slesaad reviewed May 2, 2025

View reviewed changes

kyle-lesinger added 5 commits May 2, 2025 10:14

update generate_dags

8ff9fc9

merge for loops into a single loop

2d53d93

update generate_dags

0260ba2

update generate_dags 2

113814a

create mapping functions for each pipeline

37106e7

slesaad reviewed May 6, 2025

View reviewed changes

slesaad requested review from amarouane-ABDELHAK, anayeaye, ividito and smohiudd May 6, 2025 19:03

slesaad changed the title ~~Disaster dag ingest - addition of pyarc2stac~~ Add support for pyarc2stac based ingest May 6, 2025

anayeaye reviewed May 7, 2025

View reviewed changes

smohiudd and others added 7 commits May 7, 2025 18:48

Merge pull request #366 from NASA-IMPACT/feat/submit-stac-keycloak

36d9d85

Fix collection group app secret

Use veda keycloak in submit stac util

3c9234f

update overwriting process

f7001b9

update generate dags

ce8c6f9

update objects

f4e1cfb

update verbage in documentation

7bb02a7

update pyarc2stac pipeline

b2c6487

update pyarc2stac flow

2d8576f

kyle-lesinger added 3 commits May 23, 2025 08:48

update pyarc2stac flow

37bf1bc

update pyarc2stac pipeline

dbc1442

update generate_dag

13d8585

kyle-lesinger requested a review from slesaad July 28, 2025 16:17

kyle-lesinger added 4 commits July 28, 2025 12:06

update pyarc2stac dag

639103d

add virtualenv to pyproject.toml

2152c6c

update conftest

98bcc05

comment out dag to remove duplicate error

87bdf93

kyle-lesinger force-pushed the disasterDag branch from 34c726f to 87bdf93 Compare July 28, 2025 17:50

kyle-lesinger commented Aug 11, 2025

View reviewed changes

slesaad approved these changes Aug 20, 2025

View reviewed changes

Merge branch 'dev' into disasterDag

efa1e41

kyle-lesinger merged commit 314a0e6 into dev Aug 22, 2025
4 checks passed

kyle-lesinger deleted the disasterDag branch August 22, 2025 14:00

anayeaye mentioned this pull request Aug 25, 2025

Vector and COG Discovery ingests no longer scheduled after latest airflow updates #398

Closed

2 tasks

		if prefix == "pyarc2stac":
		id = f"{prefix}-{collection['id']}"

Conversation

kyle-lesinger commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary:

PR Checklist

Uh oh!

slesaad left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kyle-lesinger May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kyle-lesinger May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kyle-lesinger May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kyle-lesinger May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slesaad commented May 16, 2025

Uh oh!

kyle-lesinger commented Aug 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kyle-lesinger commented Apr 25, 2025 •

edited

Loading

kyle-lesinger May 2, 2025 •

edited

Loading

kyle-lesinger May 8, 2025 •

edited

Loading

kyle-lesinger May 6, 2025 •

edited

Loading

kyle-lesinger May 8, 2025 •

edited

Loading