Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocessing pipeline for TDX Hydro Files #8

Open
8 tasks done
ptomasula opened this issue Jul 23, 2024 · 5 comments
Open
8 tasks done

Preprocessing pipeline for TDX Hydro Files #8

ptomasula opened this issue Jul 23, 2024 · 5 comments
Assignees
Labels
enhancement New feature or request

Comments

@ptomasula
Copy link
Contributor

ptomasula commented Jul 23, 2024

Summary

Much of the initial groundwork for processing the TDX Hydro files has been laid under issues #2, #3, #4 and with PRs #5 and #6. Its time to stitch that work together into a processing pipeline that modifies the raw TDX Hydro files by dropping and remaining fields, creating global LINKNO/streamID, and adding the modified nested set index information.

Closure Criteria

  • Processing pipeline developed to do the follow
    • Convert LINKNO/streamID to global unique variety
    • Extraneous fields have been dropped from the streamnet and basins layers (see this method)
    • Modified nest set index information has been added to the basin layers
    • Streams with no basin geometries identified
  • Pipeline run on all downloaded TDX Hydro files
  • Processed TDX Hydro data have been output in compressed geoparquet, as recommended in Develop Write Example for pyogroio #4
  • Processed files uploaded to file exchange location (AWS S3, sharefile, etc.)
@aufdenkampe
Copy link
Member

aufdenkampe commented Jul 29, 2024

@rajadain, we have finalized our processing pipeline for the TDX Hydro stream network ('streamnet) and corresponding basins ('stream reach_basins') files. We are presently running the full set of files for the globe, which should be completed this afternoon.

In the meanwhile, here is an example set of three GeoParquet files that will be produced for each of the 62 TDX Hydro Regions (provided in the tdx_regions.parquet):

These supersede the files we shared with you two weeks ago under #4 (comment).

These files have been substantially compressed vs NGA's GeoPackage files.

@rajadain, could you work with your team to:

  • Ingest the 'streamnet' files into your vector tiling pipeline.
  • Explore approaches to find/select a LINKNO from a Lat/Lon (user clicking on map) and the 'streamreach_basins' geometries. This will likely first require determining the
    TDX Hydro Region from tdx_regions.parquet).
  • Provide us an way to get these 3x62 files to you.

We are getting close to delivering this to you:

All of the above will likely benefit from using a parallel set of simplified geometries, which we are also exploring.\

For now, read these files using the gpd.read_parquet(geoparquet_path) method , but we can speed up reading the geometry fields 2x by reading as pyarrow.Tables, as described in #1 (comment). Note that either way, you need to have GDAL 3.9 installed, as described in that comment.

@aufdenkampe
Copy link
Member

aufdenkampe commented Oct 14, 2024

from @ptomasula's Oct 4 email to @rajadain:

We have uploaded parquet files with the modified nested set index (MNSI) information for 61 of the 62 TDXHydro regions to that S3 bucket. The missing files (5020054880) are for a region in Australia and failed during our initial run of the processing pipeline. We still wanted to get you over the bulk of the data since it will likely take some time to download and get integrated into the system. We’ll investigate that last file next week and get that over to you soon.

Anthony outlined a fair bit of this under this issue when he provided you with an example set of files, but I think it’s worth repeating here. For each TDXHydro region there are 3 files;

  • TDX_streamnet_mnsi’ contains the stream reach polylines and is indexed by the LINKNO field.
  • TDX_streamreach_basins_mnsi’ contains the full-resolution basin polygons. This is also indexed by the LINKNO field (renamed from ‘streamID’ in the original dataset to match naming convention)
  • TDX_streams_no_basin` contains the streamnet rows for which the LINKNO does not have a corresponding basin geometry.

In addition to the TDXHydro data fields, these files also each contain the MNSI fields. We’ll send a follow-up email with additional information and instructions on how to leverage the fields for delineation algorithms, but here is a brief explanation of the fields we have added.

  • ROOT_ID: identifies the downstream most stream reach or point of confluence for the watershed. This is useful in differentiating the watersheds when interpreting the rest of the MSNI fields.
  • DISCOVER_TIME: indicates the number of iterations in a depth first search to reach the stream reach
  • FINISH_TIME: indicates the number of iterations to revisit the reach stream.

For the basin files, there are also two additional fields to support pre-dissolving basin geometries and improve delineation performance.

  • DISSOLVE_ROOT_ID: identifies the most downstream elements of a subshed (grouping of basins to pre-dissolve).
  • ELEMENT_COUNT: indicates the number of upstream basins for a stream reach

Lastly, we have converted the index values in LINKNO, DSLINKNO, USLINKNO1, and USLINKNO2 into a globally unique version. You may recall that the index as provided by TDXHydro is only unique for a given region; however, we need a global unique identified for the entire dataset. We have applied logic based of the Geoglows V2 approach using the following equation LINKNO_NEW = LINKNO_OLD + (TDX_HEADER_NUMBER * 10_000_000).

@rajadain
Copy link
Member

rajadain commented Oct 15, 2024

@ptomasula @aufdenkampe

Thanks for the info. I was able to ingest the GeoParquet files into PostGIS after some trial and error.

I ingested the TDX_streamnet_mnsi files to a tdxstreams table which will be used for analyzing streams, and for visualizing blue lines (still working on styling updates recommended in WikiWatershed/model-my-watershed#3625 (comment)). I've added an index on stream_order (renamed from strmorder for consistency with NHD tables) to help with the visualization.

I ingested the TDX_streamreach_basins_mnsi to a tdxbasins table, which I imagine will be used for Global RWD based on a forthcoming algorithm. We may potentially also use these basins as Global HUC equivalents, perhaps.

Here's a couple questions I had:

  1. What should I do with the TDX_streams_no_basin dataset? I have not yet ingested it. Should I add these to the tdxstreams table?
  2. What additional fields (eg ROOT_ID, LINKNO, etc) should we add indexes to?

@aufdenkampe
Copy link
Member

@rajadain, that's great news.

LINKNO serves as the primary key for all tables, so it should definitely be indexed or possibly even get set to the Feature ID (if that is a thing in PostGIS).

ROOT_ID is used for quickly subsetting the dataset for delineation (i.e. find nearest LINKNO and then select all records that share the same ROOT_ID). So it should probably also be indexed (although I'm not as familiar with PostgreSQL indexing).

The geometries in the TDX_streamreach_basins_mnsi.parquet are reach-level, so more equivalent to NHDplus catchments.

We developed the DISSOLVE_ROOT_ID to serve a similar purpose as a HUC. There are typically 200 LINKNO records for every unique DISSOLVE_ROOT_ID. So DISSOLVE_ROOT_ID should also be indexed. Our plan is to create a new set of simplified geometries for these, but I think we wanted to explore performance with the raw data first to decide if this was necessary.

rajadain added a commit to WikiWatershed/model-my-watershed that referenced this issue Oct 15, 2024
This is a new delivery from LimnoTech, from
WikiWatershed/global-hydrography#8
aufdenkampe added a commit that referenced this issue Oct 22, 2024
@aufdenkampe
Copy link
Member

@rajadain, please see our new example notebook, examples/5_DelineateWatershed.ipynb, for a walk-through on how to use our new fields for watershed delineation.

In my last commit, 3d7c0c2, I also demonstrated how to use the DISSOLVE_ROOT_ID and TopoSimplify to created HUC-like boundaries that could be used as an intermediate for rapid unions of basin polygons into a watershed boundary, if necessary.

Also, when using the gdf.dissolve() function, I found an 18.5x speedup with the method="coverage" option, optimized for non-overlapping polygons. I confirmed that this is appropriate for our dataset as it does not produce any invalid geometries.

rajadain added a commit to WikiWatershed/model-my-watershed that referenced this issue Oct 28, 2024
This is a new delivery from LimnoTech, from
WikiWatershed/global-hydrography#8
aufdenkampe added a commit that referenced this issue Oct 29, 2024
To for rapid delineation of large watersheds. Last step is to wrap into a package function. #8
aufdenkampe added a commit that referenced this issue Nov 5, 2024
For #8 & #9, working toward a new function that automatically uses predissolved Hydro Units
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants