-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate API results for grid cell area weighted zonal statistics #251
Comments
I started working on this here: https://github.com/NASA-IMPACT/veda-docs/tree/ab/zonal-stats-validation I would like to add a few datasets to test from VEDA original, perhaps from the boreal. As there are already some differences, it may be necessary to dig deeper into the internals of each method. I am also interested in a comparison with exactextract as well. |
I started to look at datasets in VEDA which are not 4326. The notebook as-is has a requirement on using the 4326 projection. I started to look into how to clip the dataset using the same projection (such as 3857, by reprojecting the geojson, which is in 4326, to the same projection as the original dataset) and using rad2deg as way to get the weights of the grid cells for different conformal projections. I'm not sure of the results (see branch above, the notebook veda-docs/notebooks/tutorials/veda-zonal-statistics-validation.ipynb) but will take another look tomorrow. Any pointers are welcome. |
I have been comparing the results of the rioxarray clip method with titiler, rio_tiler and the methods @j08lue built with just calculating a simple mean in this notebook: https://github.com/NASA-IMPACT/veda-docs/blob/ab/zonal-stats-validation/notebooks/tutorials/different-stats-methods.ipynb In order to assess some "edge cases" I did include 2 datasets which have a CRS that is not 4326 (3857 and 3395). I have included a notebook to inspect the CRS of datasets in VEDA STAC, https://github.com/NASA-IMPACT/veda-docs/blob/ab/zonal-stats-validation/notebooks/tutorials/get-epsgs.ipynb. At the bottom, you see the count. Most datasets are 4326. 17 are 3395 which is a projection used for all of the datasets associated with Houston. I have a few questions about the results so I would appreciate it if @j08lue and @vincentsarago took a look at the notebook and then we could meet to assess if any of it is useful. I still agree finding another expert in raster calculations would be helpful to assess what other testing or documentation might be useful. |
@j08lue @vincentsarago thanks for meeting today so late! As I mentioned on the call I am going to pause on investigating some of the open questions I have in those notebook, but it sounds like @vincentsarago is willing to investigate why the coverage array is all 1s when both the dataset and the geojson bounding box are in the 4326 projection and the geojson bounding box only partially covers some of the data grid cells. @vincentsarago I will send you the file that goes along with that notebook for the |
Status update:
|
Another status update:
|
Statistics calculations were fixed in We successfully completed a comparison between a known method to calculate area-weighted statistics (for datasets in geographic coordinates), showing that the latest rio-tiler produces results that agree well enough with the known method: Here is a gist with a notebook doing the comparison: https://gist.github.com/j08lue/eb65d3d816878e9bcc53376e42c0bae3#file-zonal-statistics-validation-south-america-total-fluxes-ipynb There is a little gotcha in my implementation of the rio-tiler code - I use the area calculated from geographical coordinates to convert the rio-tiler calculated averages (it’s all we got) to total fluxes, so they are comparable and in familiar units. I also tried changing the geographic coordinates formula to calculate averages instead - the result was the same: that rio-tiler’s result is slightly lower (0.1%). I’ll move forward and implement this method in the GHG Center Exploration & Analysis interface and then we can repeat the test there. Something that remains to do is to repeat the same benchmark test as above in the UI, to get a validation of the end user result. |
Context
The VEDA data services APIs include a service to calculate zonal statistics (e.g. average, median, standard deviation, min, max) over (subsets of) a raster / gridded dataset.
Some of these datasets span large geographic areas and have projections where data points correspond to varying area / volume in their native projection. A common case is global datasets of climate variables on a regular lat/lon grid.
Some of the statistics we calculate are sensitive to the represented area - e.g. simply average data points despite their varying grid cell / pixel area will give a result that over-represents the data points with small area, in the case of lat/lon grids those at higher latitudes. For example, for a field of global sea surface temperatures that are lower towards the poles, the plain, unweighted average would be lower than the accurate, weighted one.
Method
To mitigate this inaccuracy, we implemented a method to reproject the source data to an equal area projection before calculating the statistics. For a lat/lon grid, we could also have calculated the grid cell / pixel area weights instead, since the formula is simple. However, since the API handles data in an arbitrary source projection, we chose the reprojection approach that should work for any source projection.
In the API code, what is going on is basically this: #209 (comment). I am happy to give more hints at where to find the exact code, if need be.
The ask
We need to test the robustness of the API implementation and build confidence that it is ready for a production release
Approach
What ever works - we have a live API for testing that has access to all the datasets in the GHG Center and VEDA STAC catalogs.
The results should be documented in an executable Jupyter Notebook, for future reference.
A notebook that runs a validation for a standard case is already in the VEDA docs and can be downloaded from the docs repo. You can launch this notebook directly in the GHG Center JupyterLab with this link: https://hub.ghg.center/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2FNASA-IMPACT%2Fveda-docs%2F&urlpath=lab%2Ftree%2F%2Fnotebooks%2Ftutorials%2Fzonal-statistics-validation.ipynb&branch=main
Acceptance criteria
The text was updated successfully, but these errors were encountered: