Skip to content

Commit 3ce4428

Browse files
committed
add notebooks
1 parent 48b1f96 commit 3ce4428

6 files changed

+3241
-0
lines changed
+148
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
---
2+
title: "Google Symptoms Dataset"
3+
author: "Addison"
4+
output:
5+
github_document:
6+
toc: true
7+
---
8+
```{r import_statements, echo = FALSE, message = FALSE}
9+
library(tidyverse)
10+
```
11+
12+
In this notebook, we look at California data in order to obtain a
13+
general impression of the dataset.
14+
15+
```{r base_urls, echo = FALSE, message=FALSE}
16+
BASE_DAILY_URL = paste0(
17+
'https://raw.githubusercontent.com/google-research/open-covid-19-data/',
18+
'master/data/exports/search_trends_symptoms_dataset/',
19+
'United%20States%20of%20America/subregions/{state}/',
20+
'2020_US_{state}_daily_symptoms_dataset.csv')
21+
BASE_WEEKLY_URL = paste0(
22+
'https://raw.githubusercontent.com/google-research/open-covid-19-data/',
23+
'master/data/exports/search_trends_symptoms_dataset/',
24+
'United%20States%20of%20America/subregions/{state}/',
25+
'2020_US_{state}_weekly_symptoms_dataset.csv')
26+
27+
CA_DAILY_URL = str_replace_all(BASE_DAILY_URL,
28+
fixed('{state}'), 'California')
29+
CA_WEEKLY_URL = str_replace_all(BASE_WEEKLY_URL,
30+
fixed('{state}'), 'California')
31+
```
32+
33+
```{r read_data, echo = FALSE, message=FALSE}
34+
ca_daily_df = read_csv(CA_DAILY_URL)
35+
ca_weekly_df = read_csv(CA_WEEKLY_URL)
36+
```
37+
38+
39+
### Temporal availability
40+
Data is available at a daily resolution, starting on January 1, 2020.
41+
Google also provides weekly rollups every Monday, starting on January 6, 2020.
42+
The weekly rollups are useful because the increase in sample size
43+
diminishes the effect of the noise added for differential privacy; or in
44+
circumstances may allow us obtain information in regions where data is
45+
too sparse to be reported at a consistent daily resolution.
46+
47+
Daily availability:
48+
```{r daily_availability, echo = TRUE}
49+
print(head(unique(ca_daily_df$date)))
50+
print(tail(unique(ca_daily_df$date)))
51+
print(length(unique(ca_daily_df$date)))
52+
```
53+
54+
Weekly availability:
55+
```{r weekly_availability, echo = TRUE}
56+
print(unique(ca_weekly_df$date))
57+
```
58+
59+
### Spatial availability
60+
Data is available at the county level, which is an improvement upon our
61+
original Google Health Trends signal. However, there are varying degrees
62+
of missingness in the data, in line with Google's standards for not reporting
63+
data when the counts are too small, in order to protect users' privacy.
64+
65+
```{r spatial_availability, echo = TRUE}
66+
print(unique(ca_daily_df$open_covid_region_code))
67+
```
68+
69+
### Symptom availability
70+
The dataset provides normalized search volumes for 422 distinct "symptoms".
71+
Note, however, that one search may count towards multiple symptoms
72+
[citation needed, but I read this in their documentation]. The normalization
73+
is a linear scaling such that (TODO -- but the info about this is in their
74+
PDF).
75+
76+
```{r extract_symptom_columns, echo = TRUE}
77+
symptom_cols = colnames(ca_daily_df)[
78+
str_detect(colnames(ca_daily_df), 'symptom:')]
79+
symptom_names = str_replace(symptom_cols, fixed('symptom:'), '')
80+
```
81+
82+
Although there are hundreds of topics included, note that neither
83+
`covid` nor `corona` is a substring in any term.
84+
85+
```{r grep_covid_corona, echo = TRUE}
86+
sum(str_detect(symptom_names, fixed('covid', ignore_case=TRUE)))
87+
sum(str_detect(symptom_names, fixed('corona', ignore_case=TRUE)))
88+
```
89+
90+
```{r calculate_availability, echo = FALSE}
91+
data_matrix = ca_daily_df %>% filter(
92+
date >= '2020-03-15',
93+
) %>% select (
94+
contains('symptom:'),
95+
) %>% as.matrix
96+
availability_vector = apply(!is.na(data_matrix), 2, mean)
97+
names(availability_vector) = symptom_names
98+
```
99+
100+
The large number of symptoms for which data is provided spans those that
101+
are availability almost every day to those availabile on amount 10% of days.
102+
If my understanding is correct, we can think of data availability as roughly
103+
corresponding to whether search incidence exceeds a certain minimum threshold.
104+
105+
```{r plot_availability, echo = FALSE}
106+
plot(sort(availability_vector),
107+
main='Availability of data across symptoms',
108+
ylab='Prop. of days that a symptom reported data'
109+
)
110+
```
111+
112+
### Symptoms by degree of availability
113+
Because 422 topics are too many to parse simultaneously, I organize them
114+
based on their availability (1 - missingness) level, starting on March 1, 2020
115+
(a soft start point for the coronavirus pandemic in the United States).
116+
```{r print_symptoms_by_availibility, echo = TRUE}
117+
for (idx in 9:0) {
118+
cat(sprintf('Symptoms that reported data on %d%% to %d%% of days:',
119+
idx*10, (idx+1)*10), '\n')
120+
print(names(availability_vector[
121+
(availability_vector >= idx/10)
122+
&(availability_vector <= (idx+1)/10)]))
123+
cat('\n')
124+
}
125+
```
126+
127+
### All symptoms
128+
Here we print the entire symptom list, ordered by the columns in Google's
129+
dataset:
130+
```{r print_all_symptoms, echo = TRUE}
131+
print(symptom_names)
132+
```
133+
134+
```{r comment_block, echo = FALSE}
135+
# Further directions
136+
# Filter to counties [states] for each
137+
# Concatenate all counties [states]
138+
# Export
139+
140+
# * Missingness in data by column (symptom type)
141+
# * Need a better way of visualizing given that there are over 400 symptoms
142+
# * Many of these symptoms are not necessarily related to COVID; most seem like
143+
# general health terms, such as "Weight gain" and "Wart"
144+
# * Some way of imputation
145+
# * Find principal components (maybe some subset of the columns that are densely
146+
# populated?), use PC1 as the initial signal
147+
# * Run correlations analysis against initial version of signal
148+
```

0 commit comments

Comments
 (0)