Pyaerocom's method for time colocation can be inadequate for some edge cases

Story time.

Previously, @dulte raised the issue of "Missing model data is interpolated when colocate_time=True " (#736) and tried to address it through a masking approach in #755.

I offered to help with this, which me led to https://github.com/metno/pyaerocom/blob/693def774031d419f9e5ff4229cb3d7c2f3eaf76/pyaerocom/colocation.py#L453 and some discoveries.

1. In order to colocate in time, pyaerocom resamples the time using the **lowest** temporal resolution, which also may be offset between the model and the observations.
2. The DataFrames from from this resampling are then concatenated using using the keys from both the `ref` (aka observations) and `data` (aka model).
3. The concatenated DataFrame, called `merged` is then has it's values interpolated according to the observation's index `obs_index`

Here is the practical meaning of this, based on a test coming from
https://github.com/metno/pyaerocom/blob/693def774031d419f9e5ff4229cb3d7c2f3eaf76/tests/test_colocation.py#L86 using the fake station data `S4` and `S3` from the parameterization. If you are interested in recreating these results, you can do so by running pytest on this specific function.

In this example, we are considering `concpm10`. The model output `stat_data` is at a daily resolution start 2010-01-03:

```console
stat_data
StationData: {'dtime': array(['2010-01-03T00:00:00.000000000', '2010-01-04T00:00:00.000000000',
       '2010-01-05T00:00:00.000000000', '2010-01-06T00:00:00.000000000',
       '2010-01-07T00:00:00.000000000', '2010-01-08T00:00:00.000000000',
       '2010-01-09T00:00:00.000000000', '2010-01-10T00:00:00.000000000',
       '2010-01-11T00:00:00.000000000', '2010-01-12T00:00:00.000000000',
       '2010-01-13T00:00:00.000000000', '2010-01-14T00:00:00.000000000',
...
```
The observations, `stat_data_ref` are at a somewhat unusual time resolution: `"ts_type" : "13daily"` (but not out of the question by any means) starting 2010-01-01:
```console
stat_data_ref
StationData: {'dtime': array(['2010-01-01T00:00:00.000000000', '2010-01-14T00:00:00.000000000',
       '2010-01-27T00:00:00.000000000', '2010-02-09T00:00:00.000000000',
       '2010-02-22T00:00:00.000000000', '2010-03-07T00:00:00.000000000',
       '2010-03-20T00:00:00.000000000', '2010-04-02T00:00:00.000000000',
       '2010-04-15T00:00:00.000000000', '2010-04-28T00:00:00.000000000',
...
```

After resampling to the lowest (coarsest) resolution, the model output `stat_data`, which was at available at a daily frequency, is now at a "biweekly" frequency.

```console
stat_data
StationData: {'dtime': array(['2010-01-03T00:00:00.000000000', '2010-01-16T00:00:00.000000000',
       '2010-01-29T00:00:00.000000000', '2010-02-11T00:00:00.000000000',
       '2010-02-24T00:00:00.000000000', '2010-03-09T00:00:00.000000000',
       '2010-03-22T00:00:00.000000000', '2010-04-04T00:00:00.000000000',
       '2010-04-17T00:00:00.000000000', '2010-04-30T00:00:00.000000000',
       '2010-05-13T00:00:00.000000000', '2010-05-26T00:00:00.000000000',
```

The indices where the observations are null are stored in `obs_isnan`, and the `obs_idx` is also stored. The two two DataFrames are then concatenated into `merged`. Note the differences in the time steps.

```console
merged
             ref       data
2010-01-01  10.0        NaN
2010-01-03   NaN   6.923077
2010-01-14   NaN        NaN
2010-01-16   NaN  10.000000
2010-01-27  10.0        NaN
...          ...        ...
2011-10-28   NaN  10.000000
2011-11-10   NaN  10.000000
2011-11-23   NaN  10.000000
2011-12-06   NaN  10.000000
2011-12-19   NaN  10.000000
``` 

The `merged` DataFrame is then interpolated using the `"index"` method from Pandas and reindexed according to the observation's index.

```console
merged
             ref       data
2010-01-01  10.0        NaN
2010-01-14  10.0   9.526627
2010-01-27  10.0  10.000000
2010-02-09  10.0  10.000000
2010-02-22  10.0  10.000000
2010-03-07  10.0  10.000000
2010-03-20  10.0  10.000000
2010-04-02  10.0  10.000000
...
```
Note that the model output `data` still has a `NaN` in it (to the point of the original issue). Also, due to resampling the model to the lowest resolution, places where we previously had data are now filled with interpolated values. This is the so called "phantom model data" in #736. 

Moreover, the interpolation scheme based on the index fails to consider what the values actually represent. For example, data reported at a monthly resolution might represent the mean concpm10 over that month but the index-based interpolation scheme will simply return a value based on the indexing from a linear interpolation of the neighboring valid entries. When dealing with daily, weekly, and monthly data, this is less of an issue. However for data which are collected over nonstandard intervals and reported at non-standard frequencies, this approach can be highly sub-optimal (e.g., EBAS, VOCs).

@dulte and I brought this up with @jgriesfeller and he suggested the best way to proceed is to resample down to the **finest** temporal resolution, thereby reducing but not eliminating the errors. A scientific discussion should probably motivate the programming solution in this case. And implementing the solution will likely be challenging.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pyaerocom's method for time colocation can be inadequate for some edge cases #762

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pyaerocom's method for time colocation can be inadequate for some edge cases #762

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions