Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/implement daily bcsd #28

Merged
merged 39 commits into from
Nov 5, 2020
Merged
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
2a35885
first few steps of daily bcsd - custom grouper class
dgergel Oct 29, 2019
c1c31b8
add test notebook for daily bcsd
dgergel Oct 29, 2019
18068b0
fixing merge conflict after pulling in upstream master
dgergel Apr 2, 2020
1eb3b39
updating daily bcsd implementation to include monthly resampling to d…
dgergel Apr 2, 2020
990ddb2
remove old test notebook
dgergel Apr 2, 2020
9400f97
add updated notebook for testing daily bias correction
dgergel Apr 2, 2020
e281e8a
fix merge conflict
dgergel Jun 8, 2020
8c65c36
add old helper function to check datetime indices
dgergel Jun 9, 2020
4a3dfeb
fix naming in previous commit
dgergel Jun 9, 2020
eb090b2
add check_datetime_index import from .utils
dgergel Jun 9, 2020
cdd0838
DOY grouper fixes, still needs cleanup
dgergel Jun 23, 2020
f6a12da
cleanup bcsd temperature and bcsd class helper functions
dgergel Jun 24, 2020
84d3aa2
add bc model doy updates for precip
dgergel Jun 24, 2020
48f5bc9
clean up bcsdprecipitation model
dgergel Jun 24, 2020
6daa1ba
Merge branch 'master' of github.com:jhamman/scikit-downscale into fea…
dgergel Jun 24, 2020
8b6ef5a
remove test notebook
dgergel Jun 24, 2020
1ae1a6d
remove commented out code
dgergel Jun 24, 2020
1d9cdc5
fix formatting check error
dgergel Jun 25, 2020
1f58985
fix isort error
dgergel Jun 25, 2020
9036b9d
remove trailing whitespace
dgergel Jun 25, 2020
65fe1d6
fix persistent formatting check error
dgergel Jun 25, 2020
1f2948e
fix merge conflict
dgergel Sep 25, 2020
bfc051c
move DAY_GROUPER and MONTH_GROUPER to groupers module
dgergel Sep 25, 2020
3621ce4
fix tests/accidental mod from merge conflict
dgergel Sep 25, 2020
7a5455f
fix assertion error in failing tests
dgergel Sep 25, 2020
d41ebe7
fix tests, comment out validate_data fctn that isn't working atm
dgergel Sep 25, 2020
56162a6
updates to PaddedDOYGrouper from review suggestions
dgergel Sep 27, 2020
38ac4ab
add groupers classes to api docs
dgergel Sep 27, 2020
e75909d
add support for leap days into PaddedDOYGrouper
dgergel Sep 29, 2020
3ada0eb
further updates to leap year handling in PaddedDOYGrouper
dgergel Sep 29, 2020
ed932a7
attempt to fix failing CI tests
Sep 30, 2020
740befc
fix lint CI test
dgergel Sep 30, 2020
7e65248
fix lint CI test with pre-commit
dgergel Sep 30, 2020
acce768
quick bug fix to paddeddoygrouper
dgergel Oct 14, 2020
65db340
add unit tests for nasa-nex option and PaddedDOYGrouper
dgergel Oct 14, 2020
fc85b9d
fix linter errors
dgergel Oct 14, 2020
0e43bd0
fix flake8
dgergel Oct 14, 2020
b046af1
remove check_datetime_index fctn
dgergel Nov 3, 2020
f9ee410
remove fctn import
dgergel Nov 3, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions skdownscale/pointwise_models/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from .bcsd import BcsdPrecipitation, BcsdTemperature
from .core import PointWiseDownscaler
from .gard import AnalogRegression, PureAnalog
from .groupers import DAY_GROUPER, MONTH_GROUPER, PaddedDOYGrouper
from .utils import LinearTrendTransformer, QuantileMapper
from .zscore import ZScoreRegressor
110 changes: 79 additions & 31 deletions skdownscale/pointwise_models/bcsd.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,31 +5,56 @@
from sklearn.utils.validation import check_is_fitted

from .base import TimeSynchronousDownscaler
from .utils import QuantileMapper


def MONTH_GROUPER(x):
return x.month
from .groupers import DAY_GROUPER, MONTH_GROUPER, PaddedDOYGrouper
from .utils import QuantileMapper, check_datetime_index, ensure_samples_features


class BcsdBase(TimeSynchronousDownscaler):
""" Base class for BCSD model.
"""

_fit_attributes = ['y_climo_', 'quantile_mappers_']
_fit_attributes = ["y_climo_", "quantile_mappers_"]
_timestep = 'M'

def __init__(self, time_grouper=MONTH_GROUPER, return_anoms=True, qm_kwargs={}):
def __init__(
self,
time_grouper=MONTH_GROUPER,
climate_trend_grouper=DAY_GROUPER,
return_anoms=True,
**qm_kwargs):

self.time_grouper = time_grouper
self.climate_trend_grouper = climate_trend_grouper
self.climate_trend = MONTH_GROUPER
self.return_anoms = return_anoms
self.qm_kwargs = qm_kwargs
jhamman marked this conversation as resolved.
Show resolved Hide resolved

def _pre_fit(self):
if isinstance(self.time_grouper, str):
self.time_grouper_ = pd.Grouper(freq=self.time_grouper)
if time_grouper == "daily_nasa-nex":
self.time_grouper = PaddedDOYGrouper
self.climate_trend_grouper = climate_trend_grouper
self.timestep = "daily"
else:
self.time_grouper_ = pd.Grouper(freq=self.time_grouper)
else:
self.time_grouper_ = self.time_grouper
self.timestep = "monthly"

def _create_groups(self, df, climate_trend=False):
""" helper function to create groups by either daily or month
"""
if self.timestep == "monthly":
return df.groupby(self.time_grouper)
elif self.timestep == "daily":
if climate_trend:
# group by day only rather than also +/- offset days
return df.groupby(self.climate_trend_grouper)
else:
return self.time_grouper(df)
else:
raise TypeError("unexpected time grouper type %s" % self.time_grouper)

def _qm_fit_by_group(self, groups):
""" helper function to fit quantile mappers by group

Expand All @@ -51,6 +76,21 @@ def _qm_transform_by_group(self, groups):
dfs.append(pd.DataFrame(qmapped, index=group.index, columns=group.columns))
return pd.concat(dfs).sort_index()

def _remove_climatology(self, obj, climatology, climate_trend=False):
"""helper function to remove climatologies

"""
dfs = []
for key, group in self._create_groups(obj, climate_trend):
if self.timestep == "monthly":
dfs.append(group - climatology.loc[key].values)
elif self.timestep == "daily":
dfs.append(group - climatology.loc[key])

result = pd.concat(dfs).sort_index()
assert obj.shape == result.shape
return result


class BcsdPrecipitation(BcsdBase):
""" Classic BCSD model for Precipitation
Expand Down Expand Up @@ -91,13 +131,15 @@ def fit(self, X, y):
if self.n_features_in_ != 1:
raise ValueError(f'BCSD only supports 1 feature, found {self.n_features_in_}')

y_groups = y.groupby(self.time_grouper)
y_groups = self._create_groups(y)
# calculate the climatologies
self.y_climo_ = y_groups.mean()

if self.y_climo_.values.min() <= 0:
raise ValueError('Invalid value in target climatology')

# fit the quantile mappers
# TO-DO: do we need to detrend the data before fitting the quantile mappers??
self._qm_fit_by_group(y_groups)

return self
Expand All @@ -119,23 +161,29 @@ def predict(self, X):
X = self._validate_data(X)

# Bias correction
# apply quantile mapping by month
Xqm = self._qm_transform_by_group(X.groupby(self.time_grouper))
# apply quantile mapping by month or day
Xqm = self._qm_transform_by_group(self._create_groups(X, climate_trend=True))

# calculate the anomalies as a ratio of the training data
if self.return_anoms:
return self._calc_ratio_anoms(Xqm, self.y_climo_)
else:
return Xqm

def _calc_ratio_anoms(self, obj, climatology):
def _calc_ratio_anoms(self, obj, climatology, climate_trend=False):
""" helper function for dividing day groups by climatology
"""
dfs = []
for key, group in obj.groupby(self.time_grouper):
dfs.append(group / climatology.loc[key].values)
for key, group in self._create_groups(obj, climate_trend):
if self.timestep == "monthly":
dfs.append(group / climatology.loc[key].values)
else:
dfs.append(group / climatology.loc[key])

out = pd.concat(dfs).sort_index()
assert obj.shape == out.shape
return out
result = pd.concat(dfs).sort_index()
assert obj.shape == result.shape

return result

def _more_tags(self):
return {
Expand Down Expand Up @@ -175,14 +223,17 @@ def fit(self, X, y):
-------
self : returns an instance of self.
"""

self._pre_fit()
X, y = self._validate_data(X, y, y_numeric=True)
if self.n_features_in_ != 1:
raise ValueError(f'BCSD only supports 1 feature, found {self.n_features_in_}')

# make groups for day or month
y_groups = self._create_groups(y)

# calculate the climatologies
self._x_climo = X.groupby(self.time_grouper).mean()
y_groups = y.groupby(self.time_grouper)
self._x_climo = self._create_groups(X).mean()
self.y_climo_ = y_groups.mean()

# fit the quantile mappers
Expand All @@ -206,29 +257,26 @@ def predict(self, X):
check_is_fitted(self)
X = self._check_array(X)

# X = ensure_samples_features(X) # don't need????

# Calculate the 9-year running mean for each month
def rolling_func(x):
return x.rolling(9, center=True, min_periods=1).mean()

X_rolling_mean = X.groupby(self.time_grouper, group_keys=False).apply(rolling_func)
X_rolling_mean = X.groupby(self.climate_trend, group_keys=False).apply(rolling_func)

# calc shift
# why isn't this working??
# X_shift = X_rolling_mean.groupby(self.time_grouper) - self._x_climo
X_shift = self._remove_climatology(X_rolling_mean, self._x_climo)
# remove climatology from 9-year monthly mean climate trend
X_shift = self._remove_climatology(X_rolling_mean, self._x_climo, climate_trend=True)

# remove shift
# remove shift from model data
X_no_shift = X - X_shift

# Bias correction
# apply quantile mapping by month
Xqm = self._qm_transform_by_group(X_no_shift.groupby(self.time_grouper))
# apply quantile mapping by month or day
Xqm = self._qm_transform_by_group(self._create_groups(X_no_shift, climate_trend=True))

# restore the shift
# restore the climate trend
X_qm_with_shift = X_shift + Xqm
# calculate the anomalies

# return bias corrected absolute values or calculate the anomalies
if self.return_anoms:
return self._remove_climatology(X_qm_with_shift, self.y_climo_)
else:
Expand Down
55 changes: 55 additions & 0 deletions skdownscale/pointwise_models/groupers.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
import numpy as np
import pandas as pd


class SkdownscaleGroupGeneratorBase:
pass


def MONTH_GROUPER(x):
return x.month

def DAY_GROUPER(x):
return x.day

class PaddedDOYGrouper(SkdownscaleGroupGeneratorBase):
def __init__(self, df, offset=15):
self.df = df
self.offset = offset
self.max = 365
self.days_of_year = np.arange(1, 366)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which calendars will this support? I'm thinking it will be useful to check that the calendar of df.index is valid for this grouper method.

self.days_of_year_wrapped = np.pad(self.days_of_year, 15, mode="wrap")
self.n = 1

def __iter__(self):
self.n = 1
return self

def __next__(self):
# n as day of year
if self.n > self.max:
raise StopIteration

i = self.n - 1
total_days = (2 * self.offset) + 1

# create day groups with +/- days
# number of days defined by offset
first_half = self.days_of_year_wrapped[i : i + self.offset]
sec_half = self.days_of_year_wrapped[self.n + self.offset : i + total_days]
all_days = np.concatenate((first_half, np.array([self.n]), sec_half), axis=0)

assert len(set(all_days)) == total_days, all_days
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's convert this to a proper ValueError:

if len(set(all_days)) != total_days:
    raise ValueError('...say something meaningful...')


result = self.df[self.df.index.dayofyear.isin(all_days)]

self.n += 1

return self.n - 1, result

def mean(self):
list_result = []
for key, group in self:
list_result.append(group.mean().values[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will be slightly faster if you allocate a numpy array via np.full and populate the values as result[key] = group.mean().values[0]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

result = pd.Series(list_result, index=self.days_of_year)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be a DataFrame. In Pandas, df.groupby(lambda x: x.month).mean() -> pd.DataFrame if df is a DataFrame.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should actually be a Series? It seems like if I make it a DataFrame, the time index gets wonky

Copy link
Contributor Author

@dgergel dgergel Sep 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh nm, got this to work along with some other updates for your comment below

return result
28 changes: 28 additions & 0 deletions skdownscale/pointwise_models/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -168,3 +168,31 @@ def transform(self, X):

def _more_tags(self):
return {'_xfail_checks': {'check_methods_subset_invariance': 'because'}}

def ensure_samples_features(obj):
""" helper function to ensure sammples conform to sklearn format
requirements
"""
if isinstance(obj, pd.DataFrame):
return obj
if isinstance(obj, pd.Series):
return obj.to_frame()
if isinstance(obj, np.ndarray):
if obj.ndim == 2:
return obj
if obj.ndim == 1:
return obj.reshape(-1, 1)
return obj # hope for the best, probably better to raise an error here

def check_datetime_index(obj, timestep):
""" helper function to check datetime index for compatibility
"""
if isinstance(obj, pd.DataFrame):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what should happen when obj is not a DataFrame?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually don't think we really need this function - I didn't end up using it in the NASA-NEX daily implementation and I think it's redundant with the testing that is in place now. seem reasonable @jhamman?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I don't think we need this anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool - just took this out.

if timestep == "daily":
obj.index = obj.index.values.astype("datetime64[D]")
return obj
elif timestep == "monthly":
obj.index = obj.index.values.astype("datetime64[M]")
return obj
else:
raise ValueError("this frequency has not yet been implemented in scikit-downscale")