Automatic threshold for peaks-over-threshold #268

cmichelenstrofer · 2023-10-03T17:00:51Z

replaces #214.

MHKiT already has a peaks-over-threshold method, which takes in threshold as one of the inputs. This method here is to automatically calculate the "best" threshold. It is from a paper by Vince and Seongho:

Neary, V. S., S. Ahn, B. E. Seng, M. N. Allahdadi, T. Wang, Z. Yang and R. He (2020). "Characterization of Extreme Wave Conditions for Wave Energy Converter Design and Project Risk Assessment.” J. Mar. Sci. Eng. 2020, 8(4), 289; https://doi.org/10.3390/jmse8040289.

They had it in Matlab and I converted it to Python here.

Sync `develop` & `master`

This reverts commit 4477580.

cmichelenstrofer · 2023-10-03T21:37:24Z

@ssolson I cleaned this up quite a bit and made a Matlab minimal example form the code I got from Seongho. He unfortunately left which makes figuring some of this out more difficult. The problem is that the solution I get with my Python implementation is slightly different from what I get with the original Matlab code. I think the issue might be that I am not accounting for all the metrics they used for "independent storms". I would appreciate some help if anyone else can take a look at this.

The failing errors don't seem to be related to this PR.

ssolson · 2023-10-04T17:17:52Z

@ssolson I cleaned this up quite a bit and made a Matlab minimal example form the code I got from Seongho. He unfortunately left which makes figuring some of this out more difficult. The problem is that the solution I get with my Python implementation is slightly different from what I get with the original Matlab code. I think the issue might be that I am not accounting for all the metrics they used for "independent storms". I would appreciate some help if anyone else can take a look at this.

The failing errors don't seem to be related to this PR.

Carlos thank you.

The errors are unrelated to this PR and are fixed in #241 which should be reviewed and approved soon.

I took a bit of a look at his today. Running the tests I see that I get the same answer as you have specified but I was wondering how big is the difference from the Matlab version?

I started looking into the generalized pareto distribution in scipy vs matlab via the docs but I was ultimately hindered by not being able to run the Matlab version without getting Matlab installed on my machine. It seems like the idea would be to step-by-step compare the python vs Matlab output to see where the difference arises.

@akeeste are you setup to run Matlab and do you have time to take a look into this further?

akeeste · 2023-10-04T17:34:52Z

@ssolson I'm set-up for this. I will review this and debug next week.

cmichelenstrofer · 2023-10-04T20:24:21Z

@ssolson @akeeste Thanks! The difference I'm getting is this:

Matlab: 99.53%, 1.041 m
Python: 98.94%, 1.028 m

@akeeste LMK if you want to go over it together.

akeeste · 2023-10-12T19:25:28Z

@cmichelenstrofer I made some progress today and noting where I'm at for reference. I noticed a couple things, but I'm still trying to figure out how much they affect the end result, given the input Hs used in your working example. I'll refer to Seongho's original code as the 'matlab' version and your refactored code as the 'python version'

xarray's stats.scoreatpercentile gives slightly different values than MATLAB's prctile function. The differences are very small but may become more significant as the threshold range is refined. Also, they recommend moving to numpy.percentile
- Percentile, xarray threshold, matlab threshold
- 0.99, 1.0294, 1.0295
- 0.991, 1.0311, 1.0313
- 0.992, 1.0334, 1.0334
- 0.995, 1.0406, 1.0506
the refinement logic is working a little differently in the python version. the condition rate_per_year < 2 is checked for every threshold and refinement iteration in Python, but only checked every refinement iteration in MATLAB. So all 6 initial thresholds are used in MATLAB. But in the python version only the first 3 thresholds pass the check for 2+ peaks/year.
In python for the first refinement iteration, the thresholds 0.99, 0.991, 0.992 all give an identical correlation. This seems suspicious and is not the behavior seen in the MATLAB code. The _pearks_over_threshold method gives different output for thresholds of e.g. 0.99 and 0.991, but the distribution fit returns identical distribution_parameters.

I'll continue debugging next week

akeeste · 2023-10-18T20:51:39Z

I'm still digging into the settings of various functions, but I can confirm that the functions for percentile, Generalized Pareto distribution fitting, and probability plotting functions all give slightly different results between their respective MATLAB and SciPy versions. I'll see if there are optional settings we can use to converge the results

Updates:

Percentile

MATLAB's prctile function (both 'exact/linear' and 'approximate/T-digest' methods) gives identical results to NumPy's np.percentile with the Hazen method

The Hazen method requires NumPy>1.22.0. For older versions (1.22.0 > Numpy > 1.9), the 'higher' linear method is very close, but not identical to MATLAB's percentile.

Generalized Pareto distribution fitting

MATLAB and SciPy are giving different distribution parameters when the excess is fit to a Generalized Pareto Distribution.

Both MATLAB's GeneralizedParetoDistribution (description) and SciPy's stats.genpareto have an identical definition. MATLAB's fit and SciPy's fit both estimate the distribution parameters using Maximum Likelihood Estimation and assume that the location/threshold is zero. A zero threshold is correct given that both versions of the code already use the excess (peaks-threshold) to estimate the tail distribution.

MATLAB fitting does not take any other options. SciPy takes initial guesses for shape, scale, etc but that outcome seems unpredictable. I'm not sure what else can be done to make the distribution fitting converge. There are not many points in the tail. In this example, even the 90th percentile only has 9 points. This will only decrease as the percentile is iteratively increased.

Example with excess defined using a 90th percentile threshold:

   excess = [0.016299, 0.025899, 0.006499, 0.024699, 0.005099, 0.017399, 0.004299, 0.013899, 0.000199]
   shape, scale, threshold:
   MATLAB: -1.21434, 0.0314501, 0.0
   SciPy: -1.62408, 0.0420621, 0.0
   SciPy (using MATLAB results as initial guess): -1.26645, 0.032800, 0.0

Probability plotting

My stats background begins to fail here. I'm not confident in how the correlations are each defined, but when given the same distribution parameters the results are nearly identical.

The MATLAB code's result is defined as the correlation between the quantile of the input (excess) and the quantile of the fitted distribution (qqplot)
The Python code's result is defined as the r value of the least-squares fit to the excess vs the quantile of the fitted distribution (probplot)

Example, given the resultant MATLAB distribution above:

MATLAB correlation: 0.978299509175096
SciPy correlation: 0.97872960601213 (0.044% error)

mhkit/loads/extreme.py

akeeste

@cmichelenstrofer Suggestions to improve a couple minor issues with the storm peak calculation. Otherwise, the storm peak calculation seems to pull all of the same independent storm peaks that the original matlab code does.

The comparison is closer. I think the remaining discrepancy is from the distribution fitting described above. Comparison after these changes:
Matlab: 99.53%, 1.041 m
Python: 99.13%, 1.032 m

mhkit/loads/extreme.py

Co-authored-by: Adam Keester <[email protected]>

cmichelenstrofer · 2023-10-19T19:11:51Z

@akeeste this is great! Thanks for looking into this in so much detail! I think the results are basically identical. The very tiny difference is from the techniques to fit the distribution. Do you think it is ready merge?

akeeste · 2023-10-19T19:42:11Z

@cmichelenstrofer Sounds good. Can you first remove the MATLAB working example and updated the test result to 99.13% and 1.032m. After that we can merge

mhkit/tests/loads/test_loads.py

akeeste · 2023-10-19T20:59:44Z

Seems like I could commit the suggestion directly to the branch. I mistyped last time, the percentile still returns as a fraction. I updated the test to compare to 0.9913. The test should pass now

mhkit/tests/loads/test_loads.py

Update threshold test value one more time

mhkit/loads/extreme.py

ssolson · 2023-10-20T17:40:23Z

Overall, I'm happy with this PR as is. I have a few suggestions, neither required but I think would round out the PR nicely.

Consider consolidating all the scipy imports onto a single line.
I recommend adding a docstring at the beginning of the file. While we haven't formally adopted this in MHKiT's style guidelines, it's something I've been introducing in files I've worked on.

Lastly, I noticed the use of a nested function. Historically, we've kept these functions separate. While nesting does highlight that the inner function is only needed by the parent function, it can also make the code appear more cluttered. I'm not opposed to the current structure, but I wanted to initiate a brief discussion on this topic. What are your thoughts?

akeeste · 2023-10-30T19:29:13Z

To wrap up this PR I

broke out the nested function
consolidated the scipy imports

I leave the submodule-level docstring and any other optional developments for future PRs. All tests have passed. Merging this PR

# MHKiT v0.8.0 We're excited to announce the release of MHKiT v0.8.0, which brings a host of new features, enhancements, and bug fixes across various modules, ensuring compatibility with Python 3.10 and 3.11, and introducing full xarray support for more flexible data handling. Significant updates in the Wave and DOLfYN modules improve functionality and extend capabilities. ## Python 3.10 & 3.11 Support MHKiT now supports python 3.10 and 3.11. Support for 3.12 will follow in the next minor update. - #240 ## Wave Module ### Enhancements: **Automatic Threshold Calculation for Peaks-Over-Threshold**: We've introduced a new feature that automatically calculates the "best" threshold for identifying significant wave events. This method, originally developed by Neary, V. S., et al. in their 2020 study, has now been translated from Matlab to Python, enhancing our existing peaks-over-threshold functionality. **Wave Heights Analysis**: A new function, `wave_heights`, has been added to extract the heights of individual waves from a time series. This function uses zero up-crossing analysis to accurately measure wave heights, improving upon our previous methods which only provided the maximum value between up-crossings. **Enhanced Zero Crossing Analysis**: Building on the above, the zero crossing code previously embedded in `global_peaks` has been isolated into a helper function. This modular approach not only refines the codebase but also supports new functionalities such as calculating wave heights, zero crossing periods, and identifying crests. ### Bug Fixes: **Contour Sampling Error in Wave Contours**: A bug identified in `mhkit.wave.contours.samples_contour` has been resolved. The issue occurred when period samples defined using the maximum period resulted in values outside the interpolation range of the contour data. This was corrected by ensuring that all sampling points are within the interpolation range and adjusting the contour data selection process accordingly. - #268 - #252 - #278 ## Xarray Support MHKiT functions now fully support the use of xarray for passing and returning data. - #279 - #282 - #285 - #302 - #310 ## DOLfYN Thanks to the many user contributions and users of MHKiT the DOLFYN module include a significant number of enhancements and bug fixes. ### Enhancements: **Altimeter Support**: Enhanced the Nortek Signature Reader to add capability for reading ADCP dual profile configurations. **Data Handling Improvements**: Introduced logic to skip messy header data that can accumulate during measurements collected via Nortek software on PCs and Macs. **Instrument Noise Subtraction**: Added a function to subtract instrument noise from turbulence intensity estimation using RMS calculations, providing results that differ by approximately 1% from the existing standard deviation-based "I" property. **Improved File Handling**: Updates for RDI files to handle changing "number of cells" and variable "cell sizes," which are now bin-averaged into the largest cell size. ### Bug Fixes: **Power Spectra Calculation**: Fixed a bug where a given noise value was not being subtracted from the power spectra, and noise was inadvertently added as an input to dissipation rate calculations. **Improved Header Handling**: Allowed RDI reader to skip junk headers effectively. **Nortek Reader C Types Update**: Adjusted C types in the Nortek reader to handle below-zero water temperatures and to allow pitch and roll values to go negative. - #280 - #289 - #290 - #292 - #293 - #294 - #299 ## River & Tidal: D3D Added limits to `variable_interpolation` and added 3 array input capability to `create_points` - #271 ## Developer Experience ### Black formatting Black formatting is now enforced on all MHKiT files. This ensures consistent formatting across the MHKiT package. - #281 ### Linting & Type Hints MHKiT is in the process of enforcing pylint and adding type hints to all functions. Currently this has been achieved and is enforced in the Loads and Power modules. - #288 - #296 ### CI/CD This release introduces significant reduction in testing time for development. This is achieved by reducing the number of tests for pulls against the develop branch and only running hindcast test when changes are made to it. A bug in the hindcast CI was fixed which only ran on changes to the hindcast tests instead of the hindcast module. Additionally the wave and wind hindcast needed to be separated in 2 jobs due to the excessive time taken to run a wind cache. This created a number of follow on PRs around solidifying the logic of these job. A special case for Python 3.8, pip, and Mac OS was added to use homebrew to install NetCDF and HDF5 to get tests to pass. - #241 - #270 - #306 - #311 - #317 - #318 - #319 - #320 - #324 ### Clean Up MHKiT fixed an implementation error where functions used assert instead of built in errors for type and value checking. Additionally these PRs removed unused files, fixed typos, and created an argument which allows users to run CDIP API calls silently. - #276 - #272 - #273

ssolson and others added 4 commits August 31, 2023 07:11

Merge pull request MHKiT-Software#267 from MHKiT-Software/develop

2ccc286

Sync `develop` & `master`

automatic Hs threshold

4477580

Revert "automatic Hs threshold"

dc56001

This reverts commit 4477580.

automatic Hs threshold

6462315

cmichelenstrofer self-assigned this Oct 3, 2023

simplify & include MATLAB example for debugging

c3efa6d

ssolson changed the base branch from master to develop October 4, 2023 16:28

akeeste self-requested a review October 4, 2023 17:35

fix independent storms

8f5dedd

ssolson mentioned this pull request Oct 12, 2023

Run hindcast tests only on change to hindcast module #270

Merged

akeeste reviewed Oct 19, 2023

View reviewed changes

mhkit/loads/extreme.py Outdated Show resolved Hide resolved

akeeste reviewed Oct 19, 2023

View reviewed changes

mhkit/loads/extreme.py Outdated Show resolved Hide resolved

mhkit/loads/extreme.py Outdated Show resolved Hide resolved

mhkit/loads/extreme.py Outdated Show resolved Hide resolved

cmichelenstrofer and others added 4 commits October 19, 2023 12:55

Update mhkit/loads/extreme.py

89bbcba

Co-authored-by: Adam Keester <[email protected]>

Update mhkit/loads/extreme.py

a9f94e3

Co-authored-by: Adam Keester <[email protected]>

Update mhkit/loads/extreme.py

24bc89a

Co-authored-by: Adam Keester <[email protected]>

Update mhkit/loads/extreme.py

3835f7a

Co-authored-by: Adam Keester <[email protected]>

cmichelenstrofer added 2 commits October 19, 2023 14:05

cleanup

fa27382

Merge branch 'DLC' of github.com:cmichelenstrofer/MHKiT-Python into DLC

0954816

akeeste reviewed Oct 19, 2023

View reviewed changes

mhkit/tests/loads/test_loads.py Outdated Show resolved Hide resolved

Update mhkit/tests/loads/test_loads.py

c0d858d

akeeste reviewed Oct 20, 2023

View reviewed changes

mhkit/tests/loads/test_loads.py Outdated Show resolved Hide resolved

Update mhkit/tests/loads/test_loads.py

d0a4ddd

Update threshold test value one more time

ssolson reviewed Oct 20, 2023

View reviewed changes

mhkit/loads/extreme.py Outdated Show resolved Hide resolved

Update extreme.py

80a431e

break out nested function, consolidate scipy imports

a9c58b1

akeeste merged commit abc4395 into MHKiT-Software:develop Oct 30, 2023

cmichelenstrofer deleted the DLC branch October 30, 2023 20:44

ssolson mentioned this pull request Apr 24, 2024

MHKiT v0.8.0 #307

Closed

ssolson mentioned this pull request May 6, 2024

MHKiT v0.8.0 #321

Merged

Automatic threshold for peaks-over-threshold #268

Automatic threshold for peaks-over-threshold #268

Uh oh!

Conversation

cmichelenstrofer commented Oct 3, 2023

Uh oh!

cmichelenstrofer commented Oct 3, 2023

Uh oh!

ssolson commented Oct 4, 2023

Uh oh!

akeeste commented Oct 4, 2023

Uh oh!

cmichelenstrofer commented Oct 4, 2023

Uh oh!

akeeste commented Oct 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akeeste commented Oct 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Percentile

Generalized Pareto distribution fitting

Probability plotting

Uh oh!

Uh oh!

akeeste left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cmichelenstrofer commented Oct 19, 2023

Uh oh!

akeeste commented Oct 19, 2023

Uh oh!

Uh oh!

akeeste commented Oct 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ssolson commented Oct 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akeeste commented Oct 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

akeeste commented Oct 12, 2023 •

edited

Loading

akeeste commented Oct 18, 2023 •

edited

Loading

akeeste left a comment •

edited

Loading

akeeste commented Oct 19, 2023 •

edited

Loading

ssolson commented Oct 20, 2023 •

edited

Loading

akeeste commented Oct 30, 2023 •

edited

Loading