Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doctor Visits adjusted signal AUC does not match the raw signal #2045

Open
nolangormley opened this issue Aug 30, 2024 · 3 comments
Open

Doctor Visits adjusted signal AUC does not match the raw signal #2045

nolangormley opened this issue Aug 30, 2024 · 3 comments
Assignees
Labels
data quality Missing data, weird data, broken data

Comments

@nolangormley
Copy link
Contributor

nolangormley commented Aug 30, 2024

Actual Behavior:

When looking at the data from the Doctor Visits signal, the day-adjusted signal does not seem to match the area under the curve of the raw signal. The sum of the values on the raw signal is 67.70 and the day-adjusted signal is 56.22.

docvisit

Expected behavior

@RoniRos and I were looking through this yesterday and it was our intuition that the AUC should match between these two signals.

Context

Here's some code to replicate the plot above

import wget

docvisit = wget.download("https://api.covidcast.cmu.edu/epidata/covidcast/csv?signal=doctor-visits:smoothed_cli&start_day=2024-05-29&end_day=2024-08-29&geo_type=nation")
docvisitadj = wget.download("https://api.covidcast.cmu.edu/epidata/covidcast/csv?signal=doctor-visits:smoothed_adj_cli&start_day=2024-05-29&end_day=2024-08-29&geo_type=nation")

df = pd.read_csv("covidcast-doctor-visits-smoothed_cli-2024-05-29-to-2024-08-29.csv")
dfadj = pd.read_csv("covidcast-doctor-visits-smoothed_adj_cli-2024-05-29-to-2024-08-29.csv")

df.time_value = pd.to_datetime(df.time_value, utc=True)
dfadj.time_value = pd.to_datetime(dfadj.time_value, utc=True)
dfadj = dfadj[['time_value', 'value']].rename(columns={'time_value':'time_value', 'value':'valueadj'})

foo = df[['time_value', 'value']].merge(dfadj, on='time_value', how='left')
foo.plot(x='time_value', y=['value', 'valueadj'])
@nolangormley nolangormley added the data quality Missing data, weird data, broken data label Aug 30, 2024
@nolangormley nolangormley self-assigned this Aug 30, 2024
@nolangormley
Copy link
Contributor Author

I believe this was part of @rumackaaron 's work. Are we correct in assuming that these should match?

@rumackaaron
Copy link
Contributor

Interesting find! Mathematically, they don't have to match and I think that's the expected behavior in this case. When creating the design matrix in weekday.py, the constraint is that $\sum_{wd=0}^6 \alpha_{wd} = 1$. After fitting the day-of-week parameters $\alpha$, we take the original signal $y_t$ and multiply it by $\exp(\alpha_{wd})$ to get the weekday-adjusted signal $y'_t$ (where $wd$ is the day-of-week of $t$).

For simplicity, say that there are only two days in the week. Let $\alpha_0 = -1$ and $\alpha_1 = 1$, and $y_0 = 5$ and $y_1$ = 1. The sum of the raw values $y$ is 6, and the sum of the weekday-adjusted values is $5\exp(-1) + \exp(1) = 4.55$. We see something similar here, where the sum of the adjusted signal is lower than the sum of the raw signal.

It may be possible to create a different constraint to ensure that (at least on the training data), the sum of the original signal is the same as that of the adjusted signal. I don't think it's possible to ensure that constraint holds over an arbitrary time interval while using multiplicative day-of-week effects.

P.S. I find it concerning that the "sawtooth" pattern is still present in the adjusted signal. I don't know what the training period is for fitting the day-of-week effects, but it may be worth experimenting to find an appropriate period that consistently removes the "sawtooth" pattern.

@RoniRos
Copy link
Member

RoniRos commented Sep 1, 2024

I don't think it's possible to ensure that constraint holds over an arbitrary time interval while using multiplicative day-of-week effects.

Indeed. In fact, it's not possible to ensure that with any modification (think of the special case of an interval of one day).

Even if we relax the requirement to all intervals of some fixed length (e.g. 7 days), I think that the only solution is a moving average. But a moving average isn't sufficiently sensitive to the most recent developments.

This suggests an asymmetric kernel, e.g. a triangle or half-Gaussian. I think all kernels satisfy some form of long-term AUC equivalence. But this doesn't address the day-of-week effects.

We need to send this problem for some research TLC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data quality Missing data, weird data, broken data
Projects
None yet
Development

No branches or pull requests

3 participants