Code and data for implementing a k nearest neighbor (k-NN) bootstrap resampling approach for generating influent time series for water treatment.
Raseman, W.J., Rajagopalan, B., Kasprzyk, J.R., Kleiber, W., 2020. Nearest neighbor time series bootstrap for generating influent water quality scenarios. Stochastic Environmental Research and Risk Assessment. DOI: 10.1007/s00477-019-01762-3
All dependencies are freely and openly available:
- R (version 3.5.0)
- RStudio
- R packages: all R packages contained in the .R files must be installed before running the scripts.
- Download or clone this GitHub repository. If you've downloaded the repo, unzip the directory.
- Navigate to the repository, and open the .Rproj file.
- Open run_all_scripts.Rin RStudio and click "Source".
- Wait for simulations to run: it may take several hours.
To reduce computation time, you can edit the number of simulations (default is 2500) by altering nsims before running run_all_scripts.R
There are five different scripts that make up the analysis in this repository:
- 01_import_clean.R: import and clean observed water quality data
- 02_create_ts.R: interpolate between missing data points and create complete time series dataset
- 03_visualize_ts.R: plot complete, monthly time series
- 04_simulate_kNN.R: generate synthetic influent water quality data using k-NN resampling algorithm
- 05_visualize_statistics.R: visualize statistics of both observed and simulated datasets
Each script creates a function that is saved to ./lib and is loaded be loaded by run_all_scripts.R. If any changes are made to the above scripts, they need to be run and reloaded by run_all_scripts.R to redo the analysis.
Two datasets are included in the analysis. The first is a water quality dataset of the Cache la Poudre River from the City of Fort Collins Utility. This dataset has been cleaned (as described in 01_import_clean.R) and missing values have been interpolated (as described in 02_create_ts.R). The second dataset is not a water quality dataset, rather it is a record of temperature and precipitation, but is used as a reference because it is a long multivariate dataset.