-
Notifications
You must be signed in to change notification settings - Fork 15
Description
Story time.
Previously, @dulte raised the issue of "Missing model data is interpolated when colocate_time=True " (#736) and tried to address it through a masking approach in #755.
I offered to help with this, which me led to
pyaerocom/pyaerocom/colocation.py
Line 453 in 693def7
def _colocate_site_data_helper_timecol( |
- In order to colocate in time, pyaerocom resamples the time using the lowest temporal resolution, which also may be offset between the model and the observations.
- The DataFrames from from this resampling are then concatenated using using the keys from both the
ref
(aka observations) anddata
(aka model). - The concatenated DataFrame, called
merged
is then has it's values interpolated according to the observation's indexobs_index
Here is the practical meaning of this, based on a test coming from
pyaerocom/tests/test_colocation.py
Line 86 in 693def7
def test__colocate_site_data_helper_timecol( |
S4
and S3
from the parameterization. If you are interested in recreating these results, you can do so by running pytest on this specific function.
In this example, we are considering concpm10
. The model output stat_data
is at a daily resolution start 2010-01-03:
stat_data
StationData: {'dtime': array(['2010-01-03T00:00:00.000000000', '2010-01-04T00:00:00.000000000',
'2010-01-05T00:00:00.000000000', '2010-01-06T00:00:00.000000000',
'2010-01-07T00:00:00.000000000', '2010-01-08T00:00:00.000000000',
'2010-01-09T00:00:00.000000000', '2010-01-10T00:00:00.000000000',
'2010-01-11T00:00:00.000000000', '2010-01-12T00:00:00.000000000',
'2010-01-13T00:00:00.000000000', '2010-01-14T00:00:00.000000000',
...
The observations, stat_data_ref
are at a somewhat unusual time resolution: "ts_type" : "13daily"
(but not out of the question by any means) starting 2010-01-01:
stat_data_ref
StationData: {'dtime': array(['2010-01-01T00:00:00.000000000', '2010-01-14T00:00:00.000000000',
'2010-01-27T00:00:00.000000000', '2010-02-09T00:00:00.000000000',
'2010-02-22T00:00:00.000000000', '2010-03-07T00:00:00.000000000',
'2010-03-20T00:00:00.000000000', '2010-04-02T00:00:00.000000000',
'2010-04-15T00:00:00.000000000', '2010-04-28T00:00:00.000000000',
...
After resampling to the lowest (coarsest) resolution, the model output stat_data
, which was at available at a daily frequency, is now at a "biweekly" frequency.
stat_data
StationData: {'dtime': array(['2010-01-03T00:00:00.000000000', '2010-01-16T00:00:00.000000000',
'2010-01-29T00:00:00.000000000', '2010-02-11T00:00:00.000000000',
'2010-02-24T00:00:00.000000000', '2010-03-09T00:00:00.000000000',
'2010-03-22T00:00:00.000000000', '2010-04-04T00:00:00.000000000',
'2010-04-17T00:00:00.000000000', '2010-04-30T00:00:00.000000000',
'2010-05-13T00:00:00.000000000', '2010-05-26T00:00:00.000000000',
The indices where the observations are null are stored in obs_isnan
, and the obs_idx
is also stored. The two two DataFrames are then concatenated into merged
. Note the differences in the time steps.
merged
ref data
2010-01-01 10.0 NaN
2010-01-03 NaN 6.923077
2010-01-14 NaN NaN
2010-01-16 NaN 10.000000
2010-01-27 10.0 NaN
... ... ...
2011-10-28 NaN 10.000000
2011-11-10 NaN 10.000000
2011-11-23 NaN 10.000000
2011-12-06 NaN 10.000000
2011-12-19 NaN 10.000000
The merged
DataFrame is then interpolated using the "index"
method from Pandas and reindexed according to the observation's index.
merged
ref data
2010-01-01 10.0 NaN
2010-01-14 10.0 9.526627
2010-01-27 10.0 10.000000
2010-02-09 10.0 10.000000
2010-02-22 10.0 10.000000
2010-03-07 10.0 10.000000
2010-03-20 10.0 10.000000
2010-04-02 10.0 10.000000
...
Note that the model output data
still has a NaN
in it (to the point of the original issue). Also, due to resampling the model to the lowest resolution, places where we previously had data are now filled with interpolated values. This is the so called "phantom model data" in #736.
Moreover, the interpolation scheme based on the index fails to consider what the values actually represent. For example, data reported at a monthly resolution might represent the mean concpm10 over that month but the index-based interpolation scheme will simply return a value based on the indexing from a linear interpolation of the neighboring valid entries. When dealing with daily, weekly, and monthly data, this is less of an issue. However for data which are collected over nonstandard intervals and reported at non-standard frequencies, this approach can be highly sub-optimal (e.g., EBAS, VOCs).
@dulte and I brought this up with @jgriesfeller and he suggested the best way to proceed is to resample down to the finest temporal resolution, thereby reducing but not eliminating the errors. A scientific discussion should probably motivate the programming solution in this case. And implementing the solution will likely be challenging.